David Courtinot created SPARK-23494:

             Summary: Expose InferSchema's functionalities to the outside
                 Key: SPARK-23494
                 URL: https://issues.apache.org/jira/browse/SPARK-23494
             Project: Spark
          Issue Type: Improvement
          Components: Spark Core, SQL
    Affects Versions: 2.2.1
            Reporter: David Courtinot

I'm proposing that InferSchema's internals (infer the schema of each record, 
merge two schemata, and canonicalize the result) to be exposed to the outside.


We continuously produce large amounts of JSON data. The schema is and must be 
very dynamic: fields can appear and go from one day to another, most fields are 
nullable, some fields have small frequency etc.

We then consume this data, sample it, infer the schema using Dataset.schema(). 
From there, we output the data in Parquet for later querying. This approach has 
proved problematic:
 *  rare fields can be absent from a sample, and therefore absent from the 
schema. This results on exceptions when trying to query those fields. We have 
had to implement cumbersome fixes for this involving a manually curated set of 
required fields.
 * this is expensive. Going through a sample of the data to infer the schema is 
still a very costly operation for us. Caching the JSON RDD to disk (doesn't fit 
in memory) revealed at least as slow as traversing the sample first, and the 
whole data next.


InferSchema is essentially a fold operator. This means a Spark accumulator can 
easily be built on top of it in order to calculate a schema alongside an RDD 
calculation. In the above use-case, it has two main advantages:
 * the schema is inferred on the entire data, therefore contains all possible 
 * the computational overhead is negligible since it happens at the same time 
as writing the data to an external store rather than by evaluating the RDD for 
the sole purpose of schema inference.
 * after writing the manifest to an external store, we can load the JSON data 
in a Dataset without ever paying the infer cost again (just the conversion from 
JSON to Row).

With such feature, users can decide to use their JSON (or whatever else) data 
as structured data whenever they want to even though the actual schema may vary 
every ten minutes as long as they record the schema of each portion of data.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to