westonpace commented on issue #35756:
URL: https://github.com/apache/arrow/issues/35756#issuecomment-1585085230

   I had an R environment setup today and so I debugged this for a bit.  
@thisisnic , I had, offline, pointed you at 
https://github.com/apache/arrow/blob/8b2ab4d8200fdd414b85d531f1cef4b58a3ce351/cpp/src/arrow/csv/reader.cc#L602
 as the point where we handle autogenerated column names.
   
   However, I had forgotten that we also (rather embarassingly :) have a 
duplicated copy of this logic in the datasets module here: 
https://github.com/apache/arrow/blob/8b2ab4d8200fdd414b85d531f1cef4b58a3ce351/cpp/src/arrow/dataset/file_csv.cc#L179
   
   The logic in the dataset module is slightly different than the logic in the 
reader module.  It is (omitting some stuff):
   
   ```
     int32_t max_num_rows = read_options.skip_rows + 1;
     csv::BlockParser parser(pool, parse_options, /*num_cols=*/-1, 
/*first_row=*/1,
                             max_num_rows);
   
     RETURN_NOT_OK(parser.Parse(std::string_view{first_block}, &parsed_size));
   
     if (read_options.autogenerate_column_names) {
       column_names.reserve(parser.num_cols());
   ```
   
   So we give the skipped rows to the parser (this is different than the 
reader.cc logic where we skip the rows outside the parser and then only give 
the first non-skipped row to the parser).
   
   When we are not auto-generating column names then I think we kind of get 
away with it because we call `parser.VisitLastRow` which only really depends on 
the contents of the last row.
   
   On the other hand, if the user is asking to autogenerate the column names we 
use `parser.num_cols()`.  This calculation is based on the first row the parser 
sees!
   
   I'm attaching a very clumsy sketch of a fix (which I verified works in OPs 
reprex) that just copy/pastes code from the reader so we can handle skipping in 
the exact same way.  This sketch is also missing unit tests.
   
   I think a good long-term fix would be to eliminate this duplicate path 
entirely.  This could be done by adding an "inspect" method (or PeekMetadata or 
GetColumnNames or something) to the CSV reader.  Then the datasets API could 
use that instead of re-inventing the wheel.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to