[GitHub] [parquet-format] etseidl commented on pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

via GitHub Thu, 31 Aug 2023 11:21:39 -0700


etseidl commented on PR #197:
URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1701548513


   I've implemented option 2 now. As expected, the size impact is somewhat less 
due to less nesting in the thrift output. Here are some comparisson numbers 
(apologies, it seems my earlier option1 implementation left out some bytes 
somewhere, so the sizes of the indexes have increased somewhat). Option 3 
_should_ have the same size impact as option 1, but with those extra bytes 
moved to a new structure.
   ```
   nested 1
                    column  offset   delta
     no stats         3085     883
     full stats 1     4713     883    +1628
     full stats 2     4317    1139    +1488
   
   nested 2
                    column  offset   delta
     no stats        22704    6802
     full stats 1    38755    6802   +16051
     full stats 2    35340    9207   +15041
   
   flat 1
                    column  offset   delta
     no stats      1730740  854682
     full stats 1  2555824  854682  +825084
     full stats 2  2286518 1000333  +701429
   
   flat 2
                    column  offset   delta
     no stats      2322339 1027144
     full stats 1  3335139 1027144 +1012800
     full stats 2  2955139 1267144  +872800
   ```
   
   I also did a quick test using @mapleFU's suggestion to only write the 
histograms if `max_level > 1`. As you'd expect, for the files with a flat 
schema no histogram data was written at all. For the nested files the histogram 
size was reduced, but not by much (only 300 bytes for "nested 2").
   @emkornfield @wgtmac @mapleFU @gszadovszky 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [parquet-format] etseidl commented on pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

Reply via email to