[
https://issues.apache.org/jira/browse/SPARK-14274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Apache Spark reassigned SPARK-14274:
------------------------------------
Assignee: Cheng Lian (was: Apache Spark)
> Add FileFormat.prepareRead to collect necessary global information
> ------------------------------------------------------------------
>
> Key: SPARK-14274
> URL: https://issues.apache.org/jira/browse/SPARK-14274
> Project: Spark
> Issue Type: Sub-task
> Components: SQL
> Affects Versions: 2.0.0
> Reporter: Cheng Lian
> Assignee: Cheng Lian
> Fix For: 2.0.0
>
>
> One problem of our newly introduced {{FileFormat.buildReader()}} method is
> that it only sees pieces of input files. On the other hand, data sources like
> CSV and LibSVM requires some sort of global information:
> - CSV: the content of the header line if {{header}} option is set to true, so
> that we can filter out header lines within each input file. This is
> considered as a global information because it's possible that the header
> appears in the middle of a file after blocks of comments and empty lines,
> although this is just a rare/contrived corner case.
> - LibSVM: when {{numFeature}} is not set, we need to scan the whole dataset
> to infer the total number of features to construct result {{LabeledPoint}}
> instances.
> Unfortunately, with our current API, this kind of global information can't be
> gathered.
> The solution proposed here is to add a {{prepareRead}} method, which accepts
> the same arguments as {{inferSchema}} but returns a {{ReadContext}}, which
> contains an {{Option\[StructType\]}} for the inferred schema and a
> {{Map\[String, Any\]}} for any gathered global information. This
> {{ReadContext}} is then passed to {{buildReader()}}. By default,
> {{prepareRead}} simply calls {{inferSchema}} (actually the inferred schema
> itself can be considered as a sort of global information).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]