lesterfan opened a new pull request, #46103:
URL: https://github.com/apache/arrow/pull/46103

   ### Rationale for this change
   This PR implements direct reads of Parquet RLE (Run Length Encoded) data 
into Arrow REE (Run End Encoded) in-memory representation as described in 
https://github.com/apache/arrow/issues/32339, using the existing 
`read_dictionary` API as inspiration for interface level changes. Like 
`read_dictionary`, this feature is only supported for columns with a Parquet 
physical type of `BYTE_ARRAY`, such as string or binary types.
   
   Let me know if you want me to break this change up; it wasn't immediately 
clear to me how to do this. Additionally, I included some other currently open 
PRs as commits in this one as this feature depends on those bug fixes (see the 
first bullet below for more details). I plan to rebase my branch off main as 
they get reviewed/merged.
   
   Regarding performance, I am anecdotally observing an order of magnitude 
(i.e. ~10x) speedup for reading columns which have lots of repeated values and 
a slight performance degredation for columns which contain purely unique values 
when read with this feature enabled. This aligns with my expectations.
   
   ### What changes are included in this PR?
   Concretely, the following changes are included in this PR:
   1. Cherry-pick https://github.com/apache/arrow/pull/45533 and 
https://github.com/apache/arrow/pull/45535 which implement bug fixes needed by 
below changes. I plan to rebase my branch off main as they get reviewed/merged.
   1. Implement `RleDecoder::GetNextRun` and `RleDecoder::GetNextRunSpaced` 
which directly read the next run from the Parquet RLE encoded data into memory 
without expansion.
       1. I spent some time trying to add unit test coverage here similar to 
the ones in `rle_encoding_test.cc`, but ran into this issue: 
https://github.com/apache/arrow/issues/46094 (as the implementation of 
`RleDecoder::GetNextRun` is heavily inspired by `RleDecoder::Get` and 
`RleDecoder::GetBatch`).
   1. Implement the `ByteArrayReeRecordReader` which inherits from a new 
`ReeRecordReader` abstract interface. 
   1. Add a `read_ree` interface to `ArrowReaderProperties` similar to 
`read_dictionary` which delegates individual record handling to the 
`ByteArrayReeRecordReader` implemented above if a column is specified to be 
read directly into Arrow REE representation. Note that similar to 
`read_dictionary`, this is only supported for columns with a Parquet physical 
type of `BYTE_ARRAY`, such as string or binary types.
   1. Implement a new `DecodeArrow` interface for `PlainByteArrayDecoder`s to 
write to `ReeAccumulator`s. This is called from the `ByteArrayReeRecordReader` 
and uses the above `RleDecoder::GetNextRun` and `RleDecoder::GetNextRunSpaced` 
to directly read Parquet RLE dictionary indices into an Arrow 
`RunEndEncodedBuilder` through a `ReeBuilderHelper` utility class.
   1. Implement tests in C++ for the above changes similar to those testing 
`read_dictionary`.
   1. Implement pyarrow stubs for the above changes and add pytests similar to 
those testing `read_dictionary`.
   
   ### Are these changes tested?
   Yes, through included C++ unit tests and pytests.
   
   ### Are there any user-facing changes?
   Yes; this PR implements a new user-facing feature of the `FileReader`.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to