[GitHub] [arrow] arthurpassos commented on a diff in pull request #35825: GH-32723: [C++][Parquet] Add option to use LARGE* variants of binary types

via GitHub Wed, 14 Jun 2023 13:13:56 -0700


arthurpassos commented on code in PR #35825:
URL: https://github.com/apache/arrow/pull/35825#discussion_r1230113848



##########
cpp/src/parquet/arrow/reader.cc:
##########
@@ -462,7 +463,8 @@ class LeafReader : public ColumnReaderImpl {
         input_(std::move(input)),
         descr_(input_->descr()) {
     record_reader_ = RecordReader::Make(

Review Comment:
   I have tried to connect a few dots and read good portion of the code, here's 
my current understanding of the architecture. Will probably dig more into it 
tomorrow.
   
   `FileReader` is the broader arrow interface as far as I understand. It 
allows the user to read entire columns, tables and also get some readers.
   
   `FileReaderImpl` implements such functionality. One of its functionalities 
is `GetFieldReader`, which in my naive assumption is used by all of its other 
api's to get the typed reader. The typed reader will resolve either to a 
`LeafReader` or a special struct reader that contains `LeafReaders`, that 
decision is based on whether the parquet/arrow type is special or not.
   
   This FieldReader is probably responsible for parsing parquet specific stuff, 
will get into it later. `FieldReader` is an implementation of 
`ColumnReaderImpl`.
   
   `ColumnReaderImpl` implements `ColumnReader`, which according to the docs is 
just a stream reader. It probably contains states and reads n bytes. The 
`ByteArrayRecordReader` can be of dictiopnary reader and non dictionary reader. 
Let me focus on non dict reader for now. It of course inherits / extends 
`TypedRecordReader`, a class that implements methods like `ReadBatch`. 
ReadBatch depends on `TypedDecoder` to read the values, that is resolved to 
`MakeTypedDecoder` response. In this case, `PlainByteArrayDecoder`. Finally, 
`PlainByteArrayDecoder` is the class that reads the values with the help of 
`ArrowBinaryHelperBase`. `PlainByteArrayDecoder` explicitly uses 32bit 
integers, which I assume it's because this won't change for Large / non large 
since parquet supports only 32bit strings.
   
   I believe `ArrowbinaryHelperBase` is used to store those values in an 
`arrow` format, which means that's already no longer parquet. It then proceeds 
to use 64 bit / large variants according to traits, aside from dictionary 
accumulator;
   
   If it uses the dense binary read path 
`PlainByteArrayDecoderBase::DecodeArrow` (whatever that means), it'll use 
LargeBinaryBuilder with 64bit offsets. I have sanity checked it and it looks 
ok. If it uses the dictionary read path 
`PlainByteArrayDecoderBase::DecodeArrow`, it'll use the `Dictionary32Builder`. 
In that case, the interface available is of course 32 bit. Which also seems to 
be ok because of the previous discussion: parquet does not allow a string that 
big and we are not expecting a dictionary to contain that many distinct strings.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] arthurpassos commented on a diff in pull request #35825: GH-32723: [C++][Parquet] Add option to use LARGE* variants of binary types

Reply via email to