Hi list, Updating some code to Arrow 4.0, I noticed https://issues.apache.org/jira/browse/PARQUET-1899 deprecated parquet::TypedColumnReader<T>::ReadBatchSpaced().
I use this function in a parquet-to-csv converter. It reads batches of 1,000 values at a time, allowing nulls. ReadBatchSpaced() in a loop is faster than reading an entire record batch. It's also more RAM-friendly (so the program costs only a few megabytes, regardless of Parquet file size). I've spawned hundreds of concurrent parquet-to-csv processes, streaming to slow clients via Python+ASGI, with response times in the milliseconds. I commented my findings: https://github.com/CJWorkbench/parquet-to-arrow/blob/70253c7fdf0fc778e51f50b992c98b16e8864723/src/parquet-to-text-stream.cc#L73 As I understand it, the function is deprecated because it has bugs concerning nested values. These bugs didn't affect me because I don't use nested values. Does the C++ parquet reader support reading a batch of values and their validity bitmap? Enjoy life, Adam -- Adam Hooper +1-514-882-9694 http://adamhooper.com