XiangpengHao commented on PR #6921:
URL: https://github.com/apache/arrow-rs/pull/6921#issuecomment-2695587076
> Hi @XiangpengHao , I'm sure this PR is still work in progress.
>
> But we're encountering extremely slow IO pushdown queries when we were
doing POC with datafusion / arrow-rs for parquet reads. [ some of the cases ,
say in 100 million rows of application logs, say i query for status = 200 or
status = 400 , its 8x slower than filter exec ]
>
> So I took your changes , applied on arrow-rs 52.1.0 and did a round of
testing with datafusion 45.0.
>
> ```
> if self.decoders.contains_key(&encoding) {
> return Err(general_err!("Column cannot have more than one
dictionary"));
> }
> ```
>
> All batch reads seem to end at this line as part of column decoder
(src/column/reader/decoder.rs) [the parquet schema has 5 fields and all are
dict encoded] and hence fails with error.
>
> So just want to check if there are there other planned PRs / changes
pending for this PR before it gets merged ?
Thank you for taking a look at this @bharath-techie , I plan to spend more
time on this in the next few days.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]