[ https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760723#comment-17760723 ]
ASF GitHub Bot commented on PARQUET-2261: ----------------------------------------- etseidl commented on PR #197: URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1700406992 Since we all seem to be in agreement now, it's probably good to list the options available and then make a decision on which to use. My (probably incomplete) list would be: 1. Simply add `SizeStatistics` to `ColumnIndex`. This is the simplest solution, keeps the new data together, and mirrors what is being added to `ColumnMetaData`. The downside is extra storage and work for clients that may not use this new information. 2. Add `RepetitionDefinitionLevelHistogram` to `ColumnIndex` and `unencoded_variable_width_stored_bytes` to `OffsetIndex` (either by adding it as an optional field in the `PageLocation`, or as an optional `list<i64>` in `OffsetIndex`). This is the next simplest to implement, and has modest savings over option 1. This suffers the same drawback that clients are forced to read this extra information. 3. Add a size/location pair to `ColumnMetaData` and a new struct containing `list<SizeStatistics>`, mirroring how `OffsetIndex` is written. This allows clients that have no need for this information to ignore it, and allows clients that don't need the full column/offset indexes access to just the size information, but adds complexity and requires reading a third structure for those clients that will use all three. I think 3 is maybe the most flexible, but since I'd almost always be using all three structures anyway, I'd likely vote for 1 or 2. If forced to pick, I'd probably take 1 right now since I already have it implemented :) I do have the cycles to try out 2 and 3 and can report back if that would be helpful. > [Format] Add statistics that reflect decoded size to metadata > ------------------------------------------------------------- > > Key: PARQUET-2261 > URL: https://issues.apache.org/jira/browse/PARQUET-2261 > Project: Parquet > Issue Type: Improvement > Components: parquet-format > Reporter: Micah Kornfield > Assignee: Micah Kornfield > Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)