[ 
https://issues.apache.org/jira/browse/SPARK-23494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hyukjin Kwon updated SPARK-23494:
---------------------------------
    Labels: bulk-closed  (was: )

> Expose InferSchema's functionalities to the outside
> ---------------------------------------------------
>
>                 Key: SPARK-23494
>                 URL: https://issues.apache.org/jira/browse/SPARK-23494
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core, SQL
>    Affects Versions: 2.2.1
>            Reporter: David Courtinot
>            Priority: Major
>              Labels: bulk-closed
>
> I'm proposing that InferSchema's internals (infer the schema of each record, 
> merge two schemata, and canonicalize the result) be exposed to the outside.
> *Use-case*
> My team continuously produces large amounts of JSON data. The schema is and 
> must be very dynamic: fields can appear and go from one day to another, most 
> fields are nullable, some fields have small frequency etc.
> In another job, we download this data, sample it and infer the schema using 
> Dataset.schema(). From there, we convert the data in Parquet and upload it 
> somewhere for later querying. This approach has proved problematic:
>  *  rare fields can be absent from a sample, and therefore absent from the 
> schema. This results on exceptions when trying to query those fields. We have 
> had to implement cumbersome fixes for this involving a manually curated set 
> of required fields.
>  * this is expensive. Going through a sample of the data to infer the schema 
> is still a very costly operation for us. Caching the JSON RDD to disk 
> (doesn't fit in memory) revealed to be at least as slow as traversing the 
> sample first, and the whole data next.
> *Proposition*
> InferSchema is essentially a fold operator. This means a Spark accumulator 
> can easily be built on top of it in order to calculate a schema alongside an 
> RDD calculation. In the above use-case, it has two main advantages:
>  * the schema is inferred on the entire data, therefore contains all possible 
> fields no matter how low is their frequency.
>  * the computational overhead is negligible since it happens at the same time 
> as writing the data to an external store rather than by evaluating the RDD 
> for the sole purpose of schema inference.
>  * after writing the schema to an external store, we can load the JSON data 
> in a Dataset without ever paying the inference cost again (just the 
> conversion from JSON to Row). We keep the advantages and flexibility of JSON 
> while also benefiting from the powerful features and optimizations available 
> on Datasets or Parquet itself.
> With such feature, users can decide to use their JSON (or whatever else) data 
> as structured data whenever they want to even though the actual schema may 
> vary every ten minutes as long as they record the schema of each portion of 
> data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to