Neal Richardson created ARROW-5747:
--------------------------------------
Summary: [C++] Better column name and header support in CSV reader
Key: ARROW-5747
URL: https://issues.apache.org/jira/browse/ARROW-5747
Project: Apache Arrow
Issue Type: Improvement
Reporter: Neal Richardson
While working on ARROW-5500, I found a number of issues around the CSV parse
options {{header_rows}}:
* If header_rows is 0, [the reader
errors|https://github.com/apache/arrow/blob/8b0318a11bba2aa2cf39bff245ff916a3283d372/cpp/src/arrow/csv/reader.cc#L150]
* It's not possible to supply your own column names, as [this
TODO|https://github.com/apache/arrow/blob/8b0318a11bba2aa2cf39bff245ff916a3283d372/cpp/src/arrow/csv/reader.cc#L149]
notes. ARROW-4912 allows renaming columns after reading in, which _maybe_ is
enough as long as header_rows == 0 doesn't error, but then you can't naturally
specify column types in the convert options because that takes a map of column
name to type.
* If header_rows is > 1, every cell gets turned into a column name, so if
header_rows == 2, you get twice the number of column names as columns. This
doesn't error, but it leads to unexpected results.
IMO a better interface would be to have a {{skip_rows}} argument to let you
ignore a large header, and a {{column_names}} argument that, if provided, gives
the column names. If not provided, the first row after {{skip_rows}} is taken
to be the column names. I don't think there's value in trying to be clever
about multirow headers and converting those to column names; if there's
meaningful information in a tall header, let the user parse it themselves.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)