emkornfield commented on PR #242:
URL: https://github.com/apache/parquet-format/pull/242#issuecomment-2116279057

   > @pitrou In conjunction with this change, if we want improved random access 
for row groups and columns I think this would also be a good time to upgrade 
the OffsetIndex / ColumnIndex in two key ways:
   > 
   > 1. Have OffsetIndex be stored in a random access way rather than using a 
list so that an individual page chunk can be loaded without needing to read the 
entire OffsetIndex array.
   > 2. Have OffsetIndex explicitly include the dictionary page in addition to 
any data pages so that column data can be directly loaded from the OffsetIndex 
without needing to get all offsets from the metadata.
   > 
   > I think this would make the ColumnIndex a lot more powerful as it could 
then be used for projection pushdown in a much faster way without the large 
overhead it has now.
   
   I think these are reasonable suggestions, but I think they can be handled as 
a follow-up once we align on design principles here.  In general for 
dictionaries (and other "auxiliary") metadata we should maybe consider this 
more holistically, on how pages can be linked effectively.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to