etseidl commented on PR #197: URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1699773196
Hi all, just wanted to share some preliminary results with the new statistics. I implemented this PR using both the `RepetitionDefinitionLevelHistogram` and the full `SizeStatistics` struct in the `ColumnIndex`. I used four files I use frequently for testing; two large files with a flat schema and varying mixes of integer and string data, and two smaller files that are deeply nested. The table below shows the impact on the size of the `ColumnIndex`, as well as the impact to total file size, for each of the test files. ``` ------------------------------------------------------------------------------ | | | column index size (bytes) | | file | file size (MiB) | no size stats | histograms | full size stats | ------------------------------------------------------------------------------ | flat 1 | 1883.1 | 1730740 | 2229005 | 2498311 | ------------------------------------------------------------------------------ | flat 2 | 1695.4 | 2322339 | 2884139 | 3265139 | ------------------------------------------------------------------------------ | nested 1 | 12.1 | 3085 | 4287 | 4683 | ------------------------------------------------------------------------------ | nested 2 | 282.2 | 22704 | 34852 | 38267 | ------------------------------------------------------------------------------ ``` For the files with a flat schema, the histograms resulted in a 24-29% increase in the index size. Adding in the unencoded size bumped that to a 41-44% increase. The large impact to the added size info is due to a) the lack of a repetition level histogram, and b) small definition level histogram (2 bins). For the nested files, the histograms added between 40-54% to the `ColumnIndex` size, now that the repetition level histograms are populated, and the max definition level is as high as 9. For these files, the addition of the size info had a less dramatic effect, with the full stats adding between 52-69% to the index. The overall impact on file size was negligible, however, with the largest increase being an additional .053%. So the good news here is no dramatic increase in file sizes, but the bad news is a pretty significant hit to `ColumnIndex` sizes. If the latter is a concern, perhaps it is a better idea to move the per-page size statistics to its own structure separate from the page indexes. Then the page histogram data could be skipped altogether if the filtering predicate doesn't include any `null` logic. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
