[GitHub] [arrow-rs] Ted-Jiang commented on issue #2433: Add GenericColumnReader::skip_records Missing OffsetIndex Fallback

GitBox Sat, 13 Aug 2022 00:28:42 -0700


Ted-Jiang commented on issue #2433:
URL: https://github.com/apache/arrow-rs/issues/2433#issuecomment-1213912489


   Make sense. @tustvold 
   After reading the code i found:
   there are two kinds  implements of `skip_next_page`
   1. sync_reader: only support skip according to `page_index` as you mention.
   2. async_reader: only support skip according to parsing all dataPage headers 
in column chunk.
   
   > in previous versions of the format, Statistics are stored for ColumnChunks 
in ColumnMetaData and for individual pages inside DataPageHeader structs. When 
reading pages, a reader had to process the page header to determine whether the 
page could be skipped based on the statistics. This means the reader had to 
access all pages in a column, thus likely reading most of the column data from 
disk. (copy from parquet doc)
   
   I think the page_index intend to not read all pageHeader.
   So i think we should support all the two methods in both async or sync (read 
all page header as fallback) 🤔 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-rs] Ted-Jiang commented on issue #2433: Add GenericColumnReader::skip_records Missing OffsetIndex Fallback

Reply via email to