arthurpassos commented on issue #32723: URL: https://github.com/apache/arrow/issues/32723#issuecomment-1567327113
> > 1. Yes, I think so. > > 2. I think [ArrowReaderPropertires](https://github.com/apache/arrow/blob/130f9e981aa98c25de5f5bfe55185db270cec313/cpp/src/parquet/properties.h#L778) is probably where this belongs. For per column settings you can probably find inspiration from ParquetProperties (global might be fine for an initial implementation. > > 3. IIRC its not really memory limit as much as it is a limitation of the underlying address space of the Binary/String arrays which allow for at most 2GB of data in a row group. I don't recall the code well enough to know if there are other edge cases that you might encounter, but i think this would solve most issues. > > Cool, thanks. I have updated the draft PR with some refactorings, but it's no longer working. I suspect it's related to the dictionary encondig / decoding classes, they seem to be hard-coded to `int`, which might not work for LARGE* variants. Do you know if it's necessary to have the 64 bit version of dictionaries? Enconding / decoding code is huge & somewhat complex, it would be great if I could skip changing that. Tons of changes and I am kind of afraid of introducing bugs.. https://github.com/arthurpassos/arrow/blob/main/cpp/src/parquet/encoding.h https://github.com/arthurpassos/arrow/blob/main/cpp/src/parquet/encoding.cc -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
