[
https://issues.apache.org/jira/browse/SPARK-23494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Hyukjin Kwon updated SPARK-23494:
---------------------------------
Labels: bulk-closed (was: )
> Expose InferSchema's functionalities to the outside
> ---------------------------------------------------
>
> Key: SPARK-23494
> URL: https://issues.apache.org/jira/browse/SPARK-23494
> Project: Spark
> Issue Type: Improvement
> Components: Spark Core, SQL
> Affects Versions: 2.2.1
> Reporter: David Courtinot
> Priority: Major
> Labels: bulk-closed
>
> I'm proposing that InferSchema's internals (infer the schema of each record,
> merge two schemata, and canonicalize the result) be exposed to the outside.
> *Use-case*
> My team continuously produces large amounts of JSON data. The schema is and
> must be very dynamic: fields can appear and go from one day to another, most
> fields are nullable, some fields have small frequency etc.
> In another job, we download this data, sample it and infer the schema using
> Dataset.schema(). From there, we convert the data in Parquet and upload it
> somewhere for later querying. This approach has proved problematic:
> * rare fields can be absent from a sample, and therefore absent from the
> schema. This results on exceptions when trying to query those fields. We have
> had to implement cumbersome fixes for this involving a manually curated set
> of required fields.
> * this is expensive. Going through a sample of the data to infer the schema
> is still a very costly operation for us. Caching the JSON RDD to disk
> (doesn't fit in memory) revealed to be at least as slow as traversing the
> sample first, and the whole data next.
> *Proposition*
> InferSchema is essentially a fold operator. This means a Spark accumulator
> can easily be built on top of it in order to calculate a schema alongside an
> RDD calculation. In the above use-case, it has two main advantages:
> * the schema is inferred on the entire data, therefore contains all possible
> fields no matter how low is their frequency.
> * the computational overhead is negligible since it happens at the same time
> as writing the data to an external store rather than by evaluating the RDD
> for the sole purpose of schema inference.
> * after writing the schema to an external store, we can load the JSON data
> in a Dataset without ever paying the inference cost again (just the
> conversion from JSON to Row). We keep the advantages and flexibility of JSON
> while also benefiting from the powerful features and optimizations available
> on Datasets or Parquet itself.
> With such feature, users can decide to use their JSON (or whatever else) data
> as structured data whenever they want to even though the actual schema may
> vary every ten minutes as long as they record the schema of each portion of
> data.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]