arnscott opened a new issue, #14568:
URL: https://github.com/apache/arrow/issues/14568
I have a piece of C++ software to read and write csv/parquet files, and it
works very well for the most part. I can parse files up to a few GB fine, but
once I try to parse a file that is around 37GB, some of the columns in the CSV
file are not completely read in, even though there is no exception raised with
the reader. I can read the file in fine with the Python bindings, but I cannot
figure out why the entire columns of 2 string columns are not parsed correctly.
Here is the code:
```
arrow::SetCpuThreadPoolCapacity(threads);
auto memory_pool = arrow::default_memory_pool();
std::shared_ptr<arrow::io::ReadableFile> infile;
ARROW_ASSIGN_OR_RAISE(infile,
arrow::io::ReadableFile::Open(library_path));
arrow::csv::ParseOptions parse_options =
arrow::csv::ParseOptions::Defaults();
parse_options.delimiter = '\t';
arrow::csv::ReadOptions read_options =
arrow::csv::ReadOptions::Defaults();
read_options.use_threads = true;
std::cout << "Parsing table." << std::endl;
arrow::io::IOContext io_context = arrow::io::IOContext(
memory_pool
);
ARROW_ASSIGN_OR_RAISE(auto csv_reader, arrow::csv::TableReader::Make(
io_context,
infile,
read_options,
parse_options,
arrow::csv::ConvertOptions::Defaults()
));
ARROW_ASSIGN_OR_RAISE(auto csv_table, csv_reader->Read());
std::vector<std::string> column_names = csv_table->ColumnNames();
std::cout << "Combining chunks." << std::endl;
ARROW_ASSIGN_OR_RAISE(
auto combined_chunks,
csv_table->CombineChunksToBatch(memory_pool)
);
std::cout << combined_chunks->num_rows() << std::endl;
std::cout << "Settings arrays." << std::endl;
auto precursor_mzs =
std::static_pointer_cast<arrow::DoubleArray>(combined_chunks->GetColumnByName("PrecursorMz"));
auto peptide_sequences =
std::static_pointer_cast<arrow::StringArray>(combined_chunks->GetColumnByName("PeptideSequence"));
```
The number of rows in the table is 224314905, but the number of rows in
peptide_sequences is only 148056458, while the precursor_mz column is the
correct length.
Is this something to do with the memory pool? I don't see this when there
are only a few million to 10 million rows in the data set.
I thought that the chunks in the ChunkedArray would be combined when using
the CombineChunksToBatch, but it doesn't seem to work with my implementation
here.
I am converting this table into row data to be processed by other methods,
so that's why I am creating those arrays at the end.
Any helpful pointers as to what is going on would be much appreciated!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]