mapleFU opened a new issue, #41104: URL: https://github.com/apache/arrow/issues/41104
### Describe the enhancement requested Previously, an issue ( https://github.com/apache/arrow/pull/35825 ) shows that directly read large binary by dict is not supported. During writing to parquet, we don't allow a single ByteArray to exceeds 2GB. So, any single binary would be less than 2GB. The parquet binary reader, which is separate into two styles of API, could be shown as below: ```c++ class BinaryRecordReader : virtual public RecordReader { public: virtual std::vector<std::shared_ptr<::arrow::Array>> GetBuilderChunks() = 0; }; /// \brief Read records directly to dictionary-encoded Arrow form (int32 /// indices). Only valid for BYTE_ARRAY columns class DictionaryRecordReader : virtual public RecordReader { public: virtual std::shared_ptr<::arrow::ChunkedArray> GetResult() = 0; }; ``` The api above, Both of these api don't support read "LargeBinary", however, the first api is able to separate the string into multiple separate chunk. When a `BinaryBuilder` reaches 2GB, it will rotate and switch to a new Binary. The api below can casting the result data to segments of large binary: ```c++ Status TransferColumnData(RecordReader* reader, const std::shared_ptr<Field>& value_field, const ColumnDescriptor* descr, MemoryPool* pool, std::shared_ptr<ChunkedArray>* out) ``` For `Dictionary`, though the api returns a `std::shared_ptr<::arrow::ChunkedArray>`. However, only one dictionary builder would be used. I think we can apply the same way for it. Pros: we can support read more than 2GB data into dictionary column Cons: data might be repeated among different dictionary columns. Maybe user should call "Concat" on that ### Component(s) C++, Parquet -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
