[I] [C++][Parquet] ByteArray Reader: Extend current DictReader to supports building a LargeBinary [arrow]

via GitHub Tue, 09 Apr 2024 08:18:27 -0700


mapleFU opened a new issue, #41104:
URL: https://github.com/apache/arrow/issues/41104


   ### Describe the enhancement requested
   
   Previously, an issue ( https://github.com/apache/arrow/pull/35825 ) shows 
that directly read large binary by dict is not supported.
   
   During writing to parquet, we don't allow a single ByteArray to exceeds 2GB. 
So, any single binary would be less than 2GB.
   
   The parquet binary reader, which is separate into two styles of API, could 
be shown as below:
   
   ```c++
   class BinaryRecordReader : virtual public RecordReader {
    public:
     virtual std::vector<std::shared_ptr<::arrow::Array>> GetBuilderChunks() = 
0;
   };
   
   /// \brief Read records directly to dictionary-encoded Arrow form (int32
   /// indices). Only valid for BYTE_ARRAY columns
   class DictionaryRecordReader : virtual public RecordReader {
    public:
     virtual std::shared_ptr<::arrow::ChunkedArray> GetResult() = 0;
   };
   ```
   
   The api above, Both of these api don't support read "LargeBinary", however, 
the first api is able to separate the string into multiple separate chunk. When 
a `BinaryBuilder` reaches 2GB, it will rotate and switch to a new Binary. The 
api below can casting the result data to segments of large binary:
   
   ```c++
   Status TransferColumnData(RecordReader* reader, const 
std::shared_ptr<Field>& value_field,
                             const ColumnDescriptor* descr, MemoryPool* pool,
                             std::shared_ptr<ChunkedArray>* out) 
   ```
   
   For `Dictionary`, though the api returns a 
`std::shared_ptr<::arrow::ChunkedArray>`. However, only one dictionary builder 
would be used. I think we can apply the same way for it.
   
   Pros: we can support read more than 2GB data into dictionary column
   Cons: data might be repeated among different dictionary columns. Maybe user 
should call "Concat" on that
   
   ### Component(s)
   
   C++, Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] [C++][Parquet] ByteArray Reader: Extend current DictReader to supports building a LargeBinary [arrow]

Reply via email to