[GitHub] [arrow] bkietz commented on a change in pull request #9725: ARROW-8631: [C++][Python][Dataset] Add ReadOptions to CsvFileFormat, expose options to python

GitBox Thu, 18 Mar 2021 10:22:54 -0700


bkietz commented on a change in pull request #9725:
URL: https://github.com/apache/arrow/pull/9725#discussion_r597084271




##########
File path: cpp/src/arrow/dataset/file_csv.h
##########
@@ -38,6 +38,13 @@ class ARROW_DS_EXPORT CsvFileFormat : public FileFormat {
  public:
   /// Options affecting the parsing of CSV files
   csv::ParseOptions parse_options = csv::ParseOptions::Defaults();
+  /// Number of header rows to skip (see arrow::csv::ReadOptions::skip_rows)
+  int32_t skip_rows = 0;
+  /// Column names for the target table (see 
arrow::csv::ReadOptions::column_names)
+  std::vector<std::string> column_names;

Review comment:
       I'm still -0 on including this option. It makes sense for a single file 
reader but I don't think datasets needs to provide two approaches for renaming 
columns.
   
   By contrast, skip_rows and autogenerate are necessary here to accommodate 
the cases where files
   - have non-csv front matter like a license comment which we shouldn't 
attempt to parse
   - have data in their first row which we shouldn't interpret as column names
   
   respectively.
   
   @jorisvandenbossche @nealrichardson what do you think?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] bkietz commented on a change in pull request #9725: ARROW-8631: [C++][Python][Dataset] Add ReadOptions to CsvFileFormat, expose options to python

Reply via email to