westonpace commented on a change in pull request #10270:
URL: https://github.com/apache/arrow/pull/10270#discussion_r628569404



##########
File path: cpp/src/arrow/dataset/file_csv.cc
##########
@@ -48,18 +48,28 @@ using internal::SerialExecutor;
 using RecordBatchGenerator = 
std::function<Future<std::shared_ptr<RecordBatch>>()>;
 
 Result<std::unordered_set<std::string>> GetColumnNames(
-    const csv::ParseOptions& parse_options, util::string_view first_block,
-    MemoryPool* pool) {
+    const csv::ParseOptions& parse_options, const csv::ReadOptions& 
read_options,

Review comment:
       Minor nit: I think everywhere else `read_options` comes before 
`parse_options`.

##########
File path: cpp/src/arrow/dataset/file_csv.cc
##########
@@ -104,6 +115,13 @@ static inline Result<csv::ConvertOptions> 
GetConvertOptions(
     // Properly set conversion types
     convert_options.column_types[field->name()] = field->type();
   }
+  if (convert_options.include_columns.empty()) {
+    // We know which columns we want - so if none were specified, override the 
default
+    // behavior of reading all columns by reading a missing fake null column
+    convert_options.include_missing_columns = true;
+    convert_options.include_columns.push_back("fabricated");

Review comment:
       It's probably rare but could this cause an issue if the CSV happened to 
have a column named `fabricated`?  Although I suppose it'll be addressed by the 
newline scanning fast implementation.

##########
File path: cpp/src/arrow/dataset/file_csv_test.cc
##########
@@ -182,6 +183,18 @@ bar)");
     }
     ASSERT_EQ(rows, 5);
   }
+  {
+    SetSchema({field("custom_header", utf8())});
+    defaults->read_options.skip_rows = 0;

Review comment:
       0 is the default for `skip_rows`.  If you specify `column_names` then 
the CSV reader will automatically assume there is no header so you shouldn't 
have to manually set it to 0.  Maybe add a test case where `skip_rows` is 
non-zero to ensure you are passing it down correctly?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to