[jira] [Updated] (SPARK-15693) Write schema definition out for file-based data sources to avoid schema inference

Michael Armbrust (JIRA) Thu, 01 Jun 2017 16:03:07 -0700

     [ 
https://issues.apache.org/jira/browse/SPARK-15693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Michael Armbrust updated SPARK-15693:
-------------------------------------
    Target Version/s: 2.3.0  (was: 2.2.0)

> Write schema definition out for file-based data sources to avoid schema 
> inference
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-15693
>                 URL: https://issues.apache.org/jira/browse/SPARK-15693
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Reynold Xin
>
> Spark supports reading a variety of data format, many of which don't have 
> self-describing schema. For these file formats, Spark often can infer the 
> schema by going through all the data. However, schema inference is expensive 
> and does not always infer the intended schema (for example, with json data 
> Spark always infer integer types as long, rather than int).
> It would be great if Spark can write the schema definition out for file-based 
> formats, and when reading the data in, schema can be "inferred" directly by 
> reading the schema definition file without going through full schema 
> inference. If the file does not exist, then the good old schema inference 
> should be performed.
> This ticket certainly merits a design doc that should discuss the spec for 
> schema definition, as well as all the corner cases that this feature needs to 
> handle (e.g. schema merging, schema evolution, partitioning). It would be 
> great if the schema definition is using a human readable format (e.g. JSON).



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (SPARK-15693) Write schema definition out for file-based data sources to avoid schema inference

Reply via email to