tustvold edited a comment on issue #1111: URL: https://github.com/apache/arrow-rs/issues/1111#issuecomment-1003153596
So I'm not sure there is an easy way to fix this... `ArrowArrayReader` flattens all the pages from all the column chunks into iterators and then feeds these to `CompositeValueDecoder` which decode the levels and values independently. This makes it a non-trivial change to decode the levels and corresponding values from a given page in lock-step, which I believe is necessary in order to decode the correct number. Rather than spending time re-working `ArrowArrayReader` in order to fix this bug, I'm **personally** going to focus on getting #1082 and the PRs it builds on polished up. This provides an alternative implementation for reading byte arrays, that builds on the existing `ColumnReaderImpl` and `RecordReader` logic and so, much like `PrimitiveArrayReader`, does not run into this bug. My hope is that by being both faster, and duplicating less code, it will make sense to swap out `ArrowArrayReader` and therefore fix this bug for anything not using `ArrowArrayReader` explicitly. If someone else wishes to work on fixing `ArrowArrayReader` that would be brilliant, but I'm going to focus my efforts elsewhere. FYI @yordan-pavlov @alamb Edit: In the short-term switching back to `ComplexObjectArrayReader` does fix the bug, but represents a non-trivial performance regression (up to 6x) and so I'm somewhat loathe to suggest it -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
