etseidl commented on PR #197:
URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1701548513
I've implemented option 2 now. As expected, the size impact is somewhat less
due to less nesting in the thrift output. Here are some comparisson numbers
(apologies, it seems my earlier option1 implementation left out some bytes
somewhere, so the sizes of the indexes have increased somewhat). Option 3
_should_ have the same size impact as option 1, but with those extra bytes
moved to a new structure.
```
nested 1
column offset delta
no stats 3085 883
full stats 1 4713 883 +1628
full stats 2 4317 1139 +1488
nested 2
column offset delta
no stats 22704 6802
full stats 1 38755 6802 +16051
full stats 2 35340 9207 +15041
flat 1
column offset delta
no stats 1730740 854682
full stats 1 2555824 854682 +825084
full stats 2 2286518 1000333 +701429
flat 2
column offset delta
no stats 2322339 1027144
full stats 1 3335139 1027144 +1012800
full stats 2 2955139 1267144 +872800
```
I also did a quick test using @mapleFU's suggestion to only write the
histograms if `max_level > 1`. As you'd expect, for the files with a flat
schema no histogram data was written at all. For the nested files the histogram
size was reduced, but not by much (only 300 bytes for "nested 2").
@emkornfield @wgtmac @mapleFU @gszadovszky
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]