tustvold commented on issue #5490: URL: https://github.com/apache/arrow-rs/issues/5490#issuecomment-1986966644
Parquet is a block-oriented data format, where the lowest addressable unit may contain hundreds or even millions of rows. Reading a small selection of rows is therefore very expensive. Therefore when reading just 100 rows, it may have to decode far more than this in order to find the rows of interest. Some thoughts: * You should enable reading the page index, which will help push down the RowSelection to the page level - https://docs.rs/parquet/latest/parquet/arrow/arrow_reader/struct.ArrowReaderOptions.html#method.with_page_index * I would recommend using the object_store integration which is able to better perform vectorised reads * I recommend against pushing down filters that don't result in consecutive runs of matching rows, instead applying filters after the fact -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
