[ 
https://issues.apache.org/jira/browse/PARQUET-2261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17760723#comment-17760723
 ] 

ASF GitHub Bot commented on PARQUET-2261:
-----------------------------------------

etseidl commented on PR #197:
URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1700406992

   Since we all seem to be in agreement now, it's probably good to list the 
options available and then make a decision on which to use. My (probably 
incomplete) list would be:
   
   1. Simply add `SizeStatistics` to `ColumnIndex`. This is the simplest 
solution, keeps the new data together, and mirrors what is being added to 
`ColumnMetaData`. The downside is extra storage and work for clients that may 
not use this new information.
   2. Add `RepetitionDefinitionLevelHistogram` to `ColumnIndex` and 
`unencoded_variable_width_stored_bytes` to `OffsetIndex` (either by adding it 
as an optional field in the `PageLocation`, or as an optional `list<i64>` in 
`OffsetIndex`). This is the next simplest to implement, and has modest savings 
over option 1. This suffers the same drawback that clients are forced to read 
this extra information.
   3. Add a size/location pair to `ColumnMetaData` and a new struct containing 
`list<SizeStatistics>`, mirroring how `OffsetIndex` is written. This allows 
clients that have no need for this information to ignore it, and allows clients 
that don't need the full column/offset indexes access to just the size 
information, but adds complexity and requires reading a third structure for 
those clients that will use all three.
   
   I think 3 is maybe the most flexible, but since I'd almost always be using 
all three structures anyway, I'd likely vote for 1 or 2. If forced to pick, I'd 
probably take 1 right now since I already have it implemented :) I do have the 
cycles to try out 2 and 3 and can report back if that would be helpful.
   




> [Format] Add statistics that reflect decoded size to metadata
> -------------------------------------------------------------
>
>                 Key: PARQUET-2261
>                 URL: https://issues.apache.org/jira/browse/PARQUET-2261
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-format
>            Reporter: Micah Kornfield
>            Assignee: Micah Kornfield
>            Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to