[
https://issues.apache.org/jira/browse/SPARK-10848?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16407851#comment-16407851
]
Natalia Gorchakova commented on SPARK-10848:
--------------------------------------------
As I understand, intent of calling .asNullable on schema was be safe (as there
is no way to check that field in present in all files) . Don't see any reason
of that for cases when we have explicit schema in files (for example avro
files).
With the current implementation (2.2.x, 2.3.x), dataframe based on avro files
(with required fields) has all fields nullable.
Should it be some additional logic ( flag ) to be added to apply nullable only
for formats without explicit schema?
> Applied JSON Schema Works for json RDD but not when loading json file
> ---------------------------------------------------------------------
>
> Key: SPARK-10848
> URL: https://issues.apache.org/jira/browse/SPARK-10848
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 1.5.0
> Reporter: Miklos Christine
> Priority: Minor
>
> Using a defined schema to load a json rdd works as expected. Loading the json
> records from a file does not apply the supplied schema. Mainly the nullable
> field isn't applied correctly. Loading from a file uses nullable=true on all
> fields regardless of applied schema.
> Code to reproduce:
> {code}
> import org.apache.spark.sql.types._
> val jsonRdd = sc.parallelize(List(
> """{"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16",
> "ProductCode": "WQT648", "Qty": 5}""",
> """{"OrderID": 2, "CustomerID":16 , "OrderDate": "2015-07-11",
> "ProductCode": "LG4-Z5", "Qty": 10, "Discount":0.25,
> "expressDelivery":true}"""))
> val mySchema = StructType(Array(
> StructField(name="OrderID" , dataType=LongType, nullable=false),
> StructField("CustomerID", IntegerType, false),
> StructField("OrderDate", DateType, false),
> StructField("ProductCode", StringType, false),
> StructField("Qty", IntegerType, false),
> StructField("Discount", FloatType, true),
> StructField("expressDelivery", BooleanType, true)
> ))
> val myDF = sqlContext.read.schema(mySchema).json(jsonRdd)
> val schema1 = myDF.printSchema
> val dfDFfromFile = sqlContext.read.schema(mySchema).json("Orders.json")
> val schema2 = dfDFfromFile.printSchema
> {code}
> Orders.json
> {code}
> {"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", "ProductCode":
> "WQT648", "Qty": 5}
> {"OrderID": 2, "CustomerID":16 , "OrderDate": "2015-07-11", "ProductCode":
> "LG4-Z5", "Qty": 10, "Discount":0.25, "expressDelivery":true}
> {code}
> The behavior should be consistent.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]