[
https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Armbrust updated SPARK-2870:
------------------------------------
Target Version/s: 1.2.0
> Thorough schema inference directly on RDDs of Python dictionaries
> -----------------------------------------------------------------
>
> Key: SPARK-2870
> URL: https://issues.apache.org/jira/browse/SPARK-2870
> Project: Spark
> Issue Type: Improvement
> Components: PySpark, SQL
> Reporter: Nicholas Chammas
>
> h4. Background
> I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods.
> They process JSON text directly and infer a schema that covers the entire
> source data set.
> This is very important with semi-structured data like JSON since individual
> elements in the data set are free to have different structures. Matching
> fields across elements may even have different value types.
> For example:
> {code}
> {"a": 5}
> {"a": "cow"}
> {code}
> To get a queryable schema that covers the whole data set, you need to infer a
> schema by looking at the whole data set. The aforementioned
> {{SQLContext.json...()}} methods do this very well.
> h4. Feature Request
> What we need is for {{SQlContext.inferSchema()}} to do this, too.
> Alternatively, we need a new {{SQLContext}} method that works on RDDs of
> Python dictionaries and does something functionally equivalent to this:
> {code}
> SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
> {code}
> As of 1.0.2,
> [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema]
> just looks at the first element in the data set. This won't help much when
> the structure of the elements in the target RDD is variable.
> h4. Example Use Case
> * You have some JSON text data that you want to analyze using Spark SQL.
> * You would use one of the {{SQLContext.json...()}} methods, but you need to
> do some filtering on the data first to remove bad elements--basically, some
> minimal schema validation.
> * You deserialize the JSON objects to Python {{dict}} s and filter out the
> bad ones. You now have an RDD of dictionaries.
> * From this RDD, you want a SchemaRDD that captures the schema for the whole
> data set.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]