Miklos Christine created SPARK-10848:
----------------------------------------
Summary: Applied JSON Schema Works for json RDD but not when
loading json file
Key: SPARK-10848
URL: https://issues.apache.org/jira/browse/SPARK-10848
Project: Spark
Issue Type: Bug
Components: SQL
Affects Versions: 1.5.0
Reporter: Miklos Christine
Priority: Minor
Using a defined schema to load a json rdd works as expected. Loading the json
records from a file does not apply the supplied schema. Mainly the nullable
field isn't applied correctly. Loading from a file uses nullable=true on all
fields regardless of applied schema.
Code to reproduce:
{code}
import org.apache.spark.sql.types._
val jsonRdd = sc.parallelize(List(
"""{"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16",
"ProductCode": "WQT648", "Qty": 5}""",
"""{"OrderID": 2, "CustomerID":16 , "OrderDate": "2015-07-11",
"ProductCode": "LG4-Z5", "Qty": 10, "Discount":0.25,
"expressDelivery":true}"""))
val mySchema = StructType(Array(
StructField(name="OrderID" , dataType=LongType, nullable=false),
StructField("CustomerID", IntegerType, false),
StructField("OrderDate", DateType, false),
StructField("ProductCode", StringType, false),
StructField("Qty", IntegerType, false),
StructField("Discount", FloatType, true),
StructField("expressDelivery", BooleanType, true)
))
val myDF = sqlContext.read.schema(mySchema).json(jsonRdd)
val schema1 = myDF.printSchema
val dfDFfromFile = sqlContext.read.schema(mySchema).json("Orders.json")
val schema2 = dfDFfromFile.printSchema
{code}
Orders.json
{code}
{"OrderID": 1, "CustomerID":452 , "OrderDate": "2015-05-16", "ProductCode":
"WQT648", "Qty": 5}
{"OrderID": 2, "CustomerID":16 , "OrderDate": "2015-07-11", "ProductCode":
"LG4-Z5", "Qty": 10, "Discount":0.25, "expressDelivery":true}
{code}
The behavior should be consistent.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]