TatianaJin commented on issue #38676:
URL: https://github.com/apache/arrow/issues/38676#issuecomment-2309601610

   > Ok, so two problems here.
   > 
   > First: "cannot infer number of columns". This is because the file has no 
newline at all. If you add a newline at the end, the error disappears. I wonder 
if such files exist in the wild, but would be good to add support for them.
   
   This issue still exists in `apache-arrow-15.0.0` (I built arrow from source 
using this tag). The following code should reproduce the bug.
   
   ```cpp
   #include <fstream>
   #include <iostream>
   #include <ostream>
   
   #include <arrow/csv/api.h>
   #include <arrow/filesystem/localfs.h>
   
   int main() {
     auto csv_file = "CSVReaderTest.csv";
     {  // generate test file
       std::ofstream ostream(csv_file);
       std::string data = "a,b\n0,1";
       // no new line at the end
       ostream.write(data.data(), data.size());
       ostream.close();
     }
   
     // options
     auto read_options = arrow::csv::ReadOptions::Defaults();
     // skip the header row as the file has column names, and we want to 
generate column names by index.
     read_options.skip_rows = 1;
     read_options.autogenerate_column_names = true;
     auto parse_options = arrow::csv::ParseOptions::Defaults();
     auto convert_options = arrow::csv::ConvertOptions::Defaults();
   
     auto arrow_fs = std::make_shared<::arrow::fs::LocalFileSystem>();
     auto random_access_file = arrow_fs->OpenInputFile(csv_file).ValueOrDie();
   
     // die on this statement
     auto record_batch_reader = 
arrow::csv::StreamingReader::Make(arrow::io::default_io_context(), 
random_access_file,
                                                                  read_options, 
parse_options, convert_options)
                                    .ValueOrDie();
   
     std::cout << record_batch_reader->ToTable().ValueOrDie()->ToString() << 
std::endl;
     return 0;
   }
   ```
   The outcome is like this:
   
![image](https://github.com/user-attachments/assets/29631646-c5e5-4dfc-a783-86c71ebd0364)
   
   I think the problem might be here in `ProcessHeader` (I tried to look into 
the codes yet am still new)
   
https://github.com/apache/arrow/blob/51e9f70f94cd09a0a08196afdd2f4fc644666b5e/cpp/src/arrow/csv/reader.cc#L609
   
   The block is actually final in this case but calling `Parse` indicates 
`is_final` is false. The only data row is therefore aborted and we got the 
problem `cannot infer number of columns`.
   
   
https://github.com/apache/arrow/blob/a61f4af724cd06c3a9b4abd20491345997e532c0/cpp/src/arrow/csv/parser.cc#L403
   
   @jorisvandenbossche Please help look into this. Thanks!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to