[jira] [Commented] (SPARK-10848) Applied JSON Schema Works for json RDD but not when loading json file

Natalia Gorchakova (JIRA) Wed, 21 Mar 2018 05:53:15 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407851#comment-16407851
 ]


Natalia Gorchakova commented on SPARK-10848:
--------------------------------------------

As I understand, intent of calling .asNullable on schema was be safe (as there 
is no way to check that field in present in all files) . Don't see any reason 
of that for cases when we have explicit schema in files (for example avro 
files). 

With the current implementation (2.2.x, 2.3.x), dataframe based on avro files 
(with required fields) has all fields nullable. 

Should it be some additional logic ( flag ) to be added to apply nullable only 
for formats without explicit schema?

> Applied JSON Schema Works for json RDD but not when loading json file
> ---------------------------------------------------------------------
>
>                 Key: SPARK-10848
>                 URL: https://issues.apache.org/jira/browse/SPARK-10848
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.5.0
>            Reporter: Miklos Christine
>            Priority: Minor
>
> Using a defined schema to load a json rdd works as expected. Loading the json 
> records from a file does not apply the supplied schema. Mainly the nullable 
> field isn't applied correctly. Loading from a file uses nullable=true on all 
> fields regardless of applied schema. 
> Code to reproduce:
> {code}
> import  org.apache.spark.sql.types._
> val jsonRdd = sc.parallelize(List(
>   """{"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", 
> "ProductCode": "WQT648", "Qty": 5}""",
>   """{"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", 
> "ProductCode": "LG4-Z5", "Qty": 10, "Discount":0.25, 
> "expressDelivery":true}"""))
> val mySchema = StructType(Array(
>   StructField(name="OrderID"   , dataType=LongType, nullable=false),
>   StructField("CustomerID", IntegerType, false),
>   StructField("OrderDate", DateType, false),
>   StructField("ProductCode", StringType, false),
>   StructField("Qty", IntegerType, false),
>   StructField("Discount", FloatType, true),
>   StructField("expressDelivery", BooleanType, true)
> ))
> val myDF = sqlContext.read.schema(mySchema).json(jsonRdd)
> val schema1 = myDF.printSchema
> val dfDFfromFile = sqlContext.read.schema(mySchema).json("Orders.json")
> val schema2 = dfDFfromFile.printSchema
> {code}
> Orders.json
> {code}
> {"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", "ProductCode": 
> "WQT648", "Qty": 5}
> {"OrderID": 2, "CustomerID":16  , "OrderDate": "2015-07-11", "ProductCode": 
> "LG4-Z5", "Qty": 10, "Discount":0.25, "expressDelivery":true}
> {code}
> The behavior should be consistent. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-10848) Applied JSON Schema Works for json RDD but not when loading json file

Reply via email to