[
https://issues.apache.org/jira/browse/SPARK-15693?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael Armbrust updated SPARK-15693:
-------------------------------------
Target Version/s: 2.3.0 (was: 2.2.0)
> Write schema definition out for file-based data sources to avoid schema
> inference
> ---------------------------------------------------------------------------------
>
> Key: SPARK-15693
> URL: https://issues.apache.org/jira/browse/SPARK-15693
> Project: Spark
> Issue Type: New Feature
> Components: SQL
> Reporter: Reynold Xin
>
> Spark supports reading a variety of data format, many of which don't have
> self-describing schema. For these file formats, Spark often can infer the
> schema by going through all the data. However, schema inference is expensive
> and does not always infer the intended schema (for example, with json data
> Spark always infer integer types as long, rather than int).
> It would be great if Spark can write the schema definition out for file-based
> formats, and when reading the data in, schema can be "inferred" directly by
> reading the schema definition file without going through full schema
> inference. If the file does not exist, then the good old schema inference
> should be performed.
> This ticket certainly merits a design doc that should discuss the spec for
> schema definition, as well as all the corner cases that this feature needs to
> handle (e.g. schema merging, schema evolution, partitioning). It would be
> great if the schema definition is using a human readable format (e.g. JSON).
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]