Ted-Jiang commented on issue #2433: URL: https://github.com/apache/arrow-rs/issues/2433#issuecomment-1213912489
Make sense. @tustvold After reading the code i found: there are two kinds implements of `skip_next_page` 1. sync_reader: only support skip according to `page_index` as you mention. 2. async_reader: only support skip according to parsing all dataPage headers in column chunk. > in previous versions of the format, Statistics are stored for ColumnChunks in ColumnMetaData and for individual pages inside DataPageHeader structs. When reading pages, a reader had to process the page header to determine whether the page could be skipped based on the statistics. This means the reader had to access all pages in a column, thus likely reading most of the column data from disk. (copy from parquet doc) I think the page_index intend to not read all pageHeader. So i think we should support all the two methods in both async or sync (read all page header as fallback) 🤔 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
