[ 
https://issues.apache.org/jira/browse/SPARK-23494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Courtinot updated SPARK-23494:
------------------------------------
    Description: 
I'm proposing that InferSchema's internals (infer the schema of each record, 
merge two schemata, and canonicalize the result) be exposed to the outside.

*Use-case*

My team continuously produces large amounts of JSON data. The schema is and 
must be very dynamic: fields can appear and go from one day to another, most 
fields are nullable, some fields have small frequency etc.

In another job, we donwload this data, sample it, infer the schema using 
Dataset.schema(). From there, we output the data in Parquet for later querying. 
This approach has proved problematic:
 *  rare fields can be absent from a sample, and therefore absent from the 
schema. This results on exceptions when trying to query those fields. We have 
had to implement cumbersome fixes for this involving a manually curated set of 
required fields.
 * this is expensive. Going through a sample of the data to infer the schema is 
still a very costly operation for us. Caching the JSON RDD to disk (doesn't fit 
in memory) revealed to be at least as slow as traversing the sample first, and 
the whole data next.

*Proposition*

InferSchema is essentially a fold operator. This means a Spark accumulator can 
easily be built on top of it in order to calculate a schema alongside an RDD 
calculation. In the above use-case, it has two main advantages:
 * the schema is inferred on the entire data, therefore contains all possible 
fields
 * the computational overhead is negligible since it happens at the same time 
as writing the data to an external store rather than by evaluating the RDD for 
the sole purpose of schema inference.
 * after writing the manifest to an external store, we can load the JSON data 
in a Dataset without ever paying the infer cost again (just the conversion from 
JSON to Row).

With such feature, users can decide to use their JSON (or whatever else) data 
as structured data whenever they want to even though the actual schema may vary 
every ten minutes as long as they record the schema of each portion of data.

  was:
I'm proposing that InferSchema's internals (infer the schema of each record, 
merge two schemata, and canonicalize the result) to be exposed to the outside.

*Use-case*

My team continuously produces large amounts of JSON data. The schema is and 
must be very dynamic: fields can appear and go from one day to another, most 
fields are nullable, some fields have small frequency etc.

We then consume this data, sample it, infer the schema using Dataset.schema(). 
From there, we output the data in Parquet for later querying. This approach has 
proved problematic:
 *  rare fields can be absent from a sample, and therefore absent from the 
schema. This results on exceptions when trying to query those fields. We have 
had to implement cumbersome fixes for this involving a manually curated set of 
required fields.
 * this is expensive. Going through a sample of the data to infer the schema is 
still a very costly operation for us. Caching the JSON RDD to disk (doesn't fit 
in memory) revealed at least as slow as traversing the sample first, and the 
whole data next.

*Proposition*

InferSchema is essentially a fold operator. This means a Spark accumulator can 
easily be built on top of it in order to calculate a schema alongside an RDD 
calculation. In the above use-case, it has two main advantages:
 * the schema is inferred on the entire data, therefore contains all possible 
fields
 * the computational overhead is negligible since it happens at the same time 
as writing the data to an external store rather than by evaluating the RDD for 
the sole purpose of schema inference.
 * after writing the manifest to an external store, we can load the JSON data 
in a Dataset without ever paying the infer cost again (just the conversion from 
JSON to Row).

With such feature, users can decide to use their JSON (or whatever else) data 
as structured data whenever they want to even though the actual schema may vary 
every ten minutes as long as they record the schema of each portion of data.


> Expose InferSchema's functionalities to the outside
> ---------------------------------------------------
>
>                 Key: SPARK-23494
>                 URL: https://issues.apache.org/jira/browse/SPARK-23494
>             Project: Spark
>          Issue Type: Improvement
>          Components: Spark Core, SQL
>    Affects Versions: 2.2.1
>            Reporter: David Courtinot
>            Priority: Major
>
> I'm proposing that InferSchema's internals (infer the schema of each record, 
> merge two schemata, and canonicalize the result) be exposed to the outside.
> *Use-case*
> My team continuously produces large amounts of JSON data. The schema is and 
> must be very dynamic: fields can appear and go from one day to another, most 
> fields are nullable, some fields have small frequency etc.
> In another job, we donwload this data, sample it, infer the schema using 
> Dataset.schema(). From there, we output the data in Parquet for later 
> querying. This approach has proved problematic:
>  *  rare fields can be absent from a sample, and therefore absent from the 
> schema. This results on exceptions when trying to query those fields. We have 
> had to implement cumbersome fixes for this involving a manually curated set 
> of required fields.
>  * this is expensive. Going through a sample of the data to infer the schema 
> is still a very costly operation for us. Caching the JSON RDD to disk 
> (doesn't fit in memory) revealed to be at least as slow as traversing the 
> sample first, and the whole data next.
> *Proposition*
> InferSchema is essentially a fold operator. This means a Spark accumulator 
> can easily be built on top of it in order to calculate a schema alongside an 
> RDD calculation. In the above use-case, it has two main advantages:
>  * the schema is inferred on the entire data, therefore contains all possible 
> fields
>  * the computational overhead is negligible since it happens at the same time 
> as writing the data to an external store rather than by evaluating the RDD 
> for the sole purpose of schema inference.
>  * after writing the manifest to an external store, we can load the JSON data 
> in a Dataset without ever paying the infer cost again (just the conversion 
> from JSON to Row).
> With such feature, users can decide to use their JSON (or whatever else) data 
> as structured data whenever they want to even though the actual schema may 
> vary every ten minutes as long as they record the schema of each portion of 
> data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to