bkietz commented on a change in pull request #9725:
URL: https://github.com/apache/arrow/pull/9725#discussion_r597084271
##########
File path: cpp/src/arrow/dataset/file_csv.h
##########
@@ -38,6 +38,13 @@ class ARROW_DS_EXPORT CsvFileFormat : public FileFormat {
public:
/// Options affecting the parsing of CSV files
csv::ParseOptions parse_options = csv::ParseOptions::Defaults();
+ /// Number of header rows to skip (see arrow::csv::ReadOptions::skip_rows)
+ int32_t skip_rows = 0;
+ /// Column names for the target table (see
arrow::csv::ReadOptions::column_names)
+ std::vector<std::string> column_names;
Review comment:
I'm still -0 on including this option. It makes sense for a single file
reader but I don't think datasets needs to provide two approaches for renaming
columns.
By contrast, skip_rows and autogenerate are necessary here to accommodate
the cases where files
- have non-csv front matter like a license comment which we shouldn't
attempt to parse
- have data in their first row which we shouldn't interpret as column names
respectively.
@jorisvandenbossche @nealrichardson what do you think?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]