emkornfield commented on PR #242: URL: https://github.com/apache/parquet-format/pull/242#issuecomment-2116279057
> @pitrou In conjunction with this change, if we want improved random access for row groups and columns I think this would also be a good time to upgrade the OffsetIndex / ColumnIndex in two key ways: > > 1. Have OffsetIndex be stored in a random access way rather than using a list so that an individual page chunk can be loaded without needing to read the entire OffsetIndex array. > 2. Have OffsetIndex explicitly include the dictionary page in addition to any data pages so that column data can be directly loaded from the OffsetIndex without needing to get all offsets from the metadata. > > I think this would make the ColumnIndex a lot more powerful as it could then be used for projection pushdown in a much faster way without the large overhead it has now. I think these are reasonable suggestions, but I think they can be handled as a follow-up once we align on design principles here. In general for dictionaries (and other "auxiliary") metadata we should maybe consider this more holistically, on how pages can be linked effectively. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
