Re: [PR] Experimental parquet decoder with first-class selection pushdown support [arrow-rs]

via GitHub Mon, 03 Mar 2025 13:25:15 -0800


XiangpengHao commented on PR #6921:
URL: https://github.com/apache/arrow-rs/pull/6921#issuecomment-2695587076


   > Hi @XiangpengHao , I'm sure this PR is still work in progress.
   > 
   > But we're encountering extremely slow IO pushdown queries when we were 
doing POC with datafusion / arrow-rs for parquet reads. [ some of the cases , 
say in 100 million rows of application logs, say i query for status = 200 or 
status = 400 , its 8x slower than filter exec ]
   > 
   > So I took your changes , applied on arrow-rs 52.1.0 and did a round of 
testing with datafusion 45.0.
   > 
   > ```
   > if self.decoders.contains_key(&encoding) {
   >             return Err(general_err!("Column cannot have more than one 
dictionary"));
   >         }
   > ```
   > 
   > All batch reads seem to end at this line as part of column decoder 
(src/column/reader/decoder.rs) [the parquet schema has 5 fields and all are 
dict encoded] and hence fails with error.
   > 
   > So just want to check if there are there other planned PRs / changes 
pending for this PR before it gets merged ?
   
   Thank you for taking a look at this @bharath-techie , I plan to spend more 
time on this in the next few days.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Experimental parquet decoder with first-class selection pushdown support [arrow-rs]

Reply via email to