[GitHub] [arrow] arnscott opened a new issue, #14568: Some string columns of extremely large CSVs are not fully parsing

GitBox Wed, 02 Nov 2022 04:48:06 -0700


arnscott opened a new issue, #14568:
URL: https://github.com/apache/arrow/issues/14568


   I have a piece of C++ software to read and write csv/parquet files, and it 
works very well for the most part. I can parse files up to a few GB fine, but 
once I try to parse a file that is around 37GB, some of the columns in the CSV 
file are not completely read in, even though there is no exception raised with 
the reader. I can read the file in fine with the Python bindings, but I cannot 
figure out why the entire columns of 2 string columns are not parsed correctly.
   
   Here is the code:
   
   
   ```
       arrow::SetCpuThreadPoolCapacity(threads);
   
       auto memory_pool = arrow::default_memory_pool();
   
       std::shared_ptr<arrow::io::ReadableFile> infile;
   
       ARROW_ASSIGN_OR_RAISE(infile, 
arrow::io::ReadableFile::Open(library_path));
   
       arrow::csv::ParseOptions parse_options = 
arrow::csv::ParseOptions::Defaults();
   
       parse_options.delimiter = '\t';
   
       arrow::csv::ReadOptions read_options = 
arrow::csv::ReadOptions::Defaults();
   
       read_options.use_threads = true;
   
       std::cout << "Parsing table." << std::endl;
   
       arrow::io::IOContext io_context = arrow::io::IOContext(
               memory_pool
               );
   
       ARROW_ASSIGN_OR_RAISE(auto csv_reader, arrow::csv::TableReader::Make(
               io_context,
               infile,
               read_options,
               parse_options,
               arrow::csv::ConvertOptions::Defaults()
       ));
   
   
       ARROW_ASSIGN_OR_RAISE(auto csv_table, csv_reader->Read());
   
       std::vector<std::string> column_names = csv_table->ColumnNames();
   
       std::cout << "Combining chunks." << std::endl;
   
       ARROW_ASSIGN_OR_RAISE(
               auto combined_chunks,
               csv_table->CombineChunksToBatch(memory_pool)
               );
   
       std::cout << combined_chunks->num_rows() << std::endl;
   
       std::cout << "Settings arrays." << std::endl;
   
       auto precursor_mzs = 
std::static_pointer_cast<arrow::DoubleArray>(combined_chunks->GetColumnByName("PrecursorMz"));
       auto peptide_sequences = 
std::static_pointer_cast<arrow::StringArray>(combined_chunks->GetColumnByName("PeptideSequence"));
   ```
   
   
   The number of rows in the table is 224314905, but the number of rows in 
peptide_sequences is only 148056458, while the precursor_mz column is the 
correct length.
   
   Is this something to do with the memory pool? I don't see this when there 
are only a few million to 10 million rows in the data set. 
   
   I thought that the chunks in the ChunkedArray would be combined when using 
the CombineChunksToBatch, but it doesn't seem to work with my implementation 
here.
   
   I am converting this table into row data to be processed by other methods, 
so that's why I am creating those arrays at the end.
   
   Any helpful pointers as to what is going on would be much appreciated!
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] arnscott opened a new issue, #14568: Some string columns of extremely large CSVs are not fully parsing

Reply via email to