C++ parquet::TypedColumnReader::ReadBatchSpaced() replacement?

Adam Hooper Tue, 20 Jul 2021 11:00:54 -0700

Hi list,

Updating some code to Arrow 4.0, I noticed
https://issues.apache.org/jira/browse/PARQUET-1899 deprecated
parquet::TypedColumnReader<T>::ReadBatchSpaced().


I use this function in a parquet-to-csv converter. It reads batches of
1,000 values at a time, allowing nulls. ReadBatchSpaced() in a loop is
faster than reading an entire record batch. It's also more RAM-friendly (so
the program costs only a few megabytes, regardless of Parquet file
size). I've spawned hundreds of concurrent parquet-to-csv processes,
streaming to slow clients via Python+ASGI, with response times in the
milliseconds. I commented my findings:
https://github.com/CJWorkbench/parquet-to-arrow/blob/70253c7fdf0fc778e51f50b992c98b16e8864723/src/parquet-to-text-stream.cc#L73

As I understand it, the function is deprecated because it has bugs
concerning nested values. These bugs didn't affect me because I don't use
nested values.

Does the C++ parquet reader support reading a batch of values and their
validity bitmap?

Enjoy life,
Adam

-- 
Adam Hooper
+1-514-882-9694
http://adamhooper.com

C++ parquet::TypedColumnReader::ReadBatchSpaced() replacement?

Reply via email to