Re: [I] Performance traps with arrow/parquet? [arrow-rs]

via GitHub Sat, 09 Mar 2024 12:18:20 -0800


tustvold commented on issue #5490:
URL: https://github.com/apache/arrow-rs/issues/5490#issuecomment-1986966644


   Parquet is a block-oriented data format, where the lowest addressable unit 
may contain hundreds or even millions of rows. Reading a small selection of 
rows is therefore very expensive. Therefore when reading just 100 rows, it may 
have to decode far more than this in order to find the rows of interest.
   
   Some thoughts:
   
   * You should enable reading the page index, which will help push down the 
RowSelection to the page level - 
https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderOptions.html#method.with_page_index
   * I would recommend using the object_store integration which is able to 
better perform vectorised reads
   * I recommend against pushing down filters that don't result in consecutive 
runs of matching rows, instead applying filters after the fact
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Performance traps with arrow/parquet? [arrow-rs]

Reply via email to