Cheng Lian created SPARK-14274:
----------------------------------
Summary: Replaces inferSchema with prepareRead to collect
necessary global information
Key: SPARK-14274
URL: https://issues.apache.org/jira/browse/SPARK-14274
Project: Spark
Issue Type: Sub-task
Components: SQL
Affects Versions: 2.0.0
Reporter: Cheng Lian
Assignee: Cheng Lian
One problem of our newly introduced {{FileFormat.buildReader()}} method is that
it only sees pieces of input files. On the other hand, data sources like CSV
and LibSVM requires some sort of global information:
- CSV: the content of the header line if {{header}} option is set to true, so
that we can filter out header lines within each input file. This is considered
as a global information because it's possible that the header appears in the
middle of a file after blocks of comments and empty lines, although this is
just a rare/contrived corner case.
- LibSVM: when {{numFeature}} is not set, we need to scan the whole dataset to
infer the total number of features to construct result {{LabeledPoint}}s.
Unfortunately, with our current API, this kind of global information can't be
gathered.
The solution proposed here is to add a {{prepareRead}} method, which accepts
the same arguments as {{inferSchema}} but returns a {{ReadContext}}, which
contains an {{Option\[StructType\]}} for the inferred schema and a
{{Map\[String, Any\]}} for any gathered global information. This
{{ReadContext}} is then passed to {{buildReader()}}. By default,
{{prepareRead}} simply calls {{inferSchema}} (actually the inferred schema
itself can be considered as a sort of global information).
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]