[jira] [Commented] (SPARK-15693) Write schema definition out for file-based data sources to avoid schema inference

Jayesh Thakrar (JIRA) Fri, 02 Mar 2018 08:38:28 -0800

    [ 
https://issues.apache.org/jira/browse/SPARK-15693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16383787#comment-16383787
 ]


Jayesh Thakrar commented on SPARK-15693:
----------------------------------------

Hello [~rxin] - just to clarify about the requirements / intent for this story:

1) Everytime data is exported from Spark, it would be nice to have a 
non-data/non-impacting file that contains the schema of the exported data 
(preferably in newline + indented JSON format). Non-impacting = some file named 
like say, _SCHEMA or .__SCHEMA. 

2) And when that data (or any data) is read by Spark, Spark would look for that 
schema file to and if found, would skip examining all the input files to 
determine schema and deem that schema file to be the schema for all the input 
data.

I am guessing this ticket is to fullfill 1) only.

> Write schema definition out for file-based data sources to avoid schema 
> inference
> ---------------------------------------------------------------------------------
>
>                 Key: SPARK-15693
>                 URL: https://issues.apache.org/jira/browse/SPARK-15693
>             Project: Spark
>          Issue Type: New Feature
>          Components: SQL
>            Reporter: Reynold Xin
>            Priority: Major
>
> Spark supports reading a variety of data format, many of which don't have 
> self-describing schema. For these file formats, Spark often can infer the 
> schema by going through all the data. However, schema inference is expensive 
> and does not always infer the intended schema (for example, with json data 
> Spark always infer integer types as long, rather than int).
> It would be great if Spark can write the schema definition out for file-based 
> formats, and when reading the data in, schema can be "inferred" directly by 
> reading the schema definition file without going through full schema 
> inference. If the file does not exist, then the good old schema inference 
> should be performed.
> This ticket certainly merits a design doc that should discuss the spec for 
> schema definition, as well as all the corner cases that this feature needs to 
> handle (e.g. schema merging, schema evolution, partitioning). It would be 
> great if the schema definition is using a human readable format (e.g. JSON).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-15693) Write schema definition out for file-based data sources to avoid schema inference

Reply via email to