[GitHub] [parquet-format] etseidl commented on pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

via GitHub Wed, 30 Aug 2023 13:14:00 -0700


etseidl commented on PR #197:
URL: https://github.com/apache/parquet-format/pull/197#issuecomment-1699773196


   Hi all, just wanted to share some preliminary results with the new 
statistics. I implemented this PR using both the 
`RepetitionDefinitionLevelHistogram` and the full `SizeStatistics` struct in 
the `ColumnIndex`. I used four files I use frequently for testing; two large 
files with a flat schema and varying mixes of integer and string data, and two 
smaller files that are deeply nested. The table below shows the impact on the 
size of the `ColumnIndex`, as well as the impact to total file size, for each 
of the test files.
   
   ```
   
------------------------------------------------------------------------------
   |          |                 |            column index size (bytes)          
|
   | file     | file size (MiB) | no size stats |  histograms | full size stats 
|
   
------------------------------------------------------------------------------
   | flat 1   |     1883.1      |   1730740     |   2229005   |     2498311     
|
   
------------------------------------------------------------------------------
   | flat 2   |     1695.4      |   2322339     |   2884139   |     3265139     
|
   
------------------------------------------------------------------------------
   | nested 1 |       12.1      |      3085     |      4287   |        4683     
|
   
------------------------------------------------------------------------------
   | nested 2 |      282.2      |     22704     |     34852   |       38267     
|
   
------------------------------------------------------------------------------
   ```
   For the files with a flat schema, the histograms resulted in a 24-29% 
increase in the index size. Adding in the unencoded size bumped that to a 
41-44% increase. The large impact to the added size info is due to a) the lack 
of a repetition level histogram, and b) small definition level histogram (2 
bins). For the nested files, the histograms added between 40-54% to the 
`ColumnIndex` size, now that the repetition level histograms are populated, and 
the max definition level is as high as 9. For these files, the addition of the 
size info had a less dramatic effect, with the full stats adding between 52-69% 
to the index.
   
   The overall impact on file size was negligible, however, with the largest 
increase being an additional .053%.
   
   So the good news here is no dramatic increase in file sizes, but the bad 
news is a pretty significant hit to `ColumnIndex` sizes. If the latter is a 
concern, perhaps it is a better idea to move the per-page size statistics to 
its own structure separate from the page indexes. Then the page histogram data 
could be skipped altogether if the filtering predicate doesn't include any 
`null` logic.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [parquet-format] etseidl commented on pull request #197: PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering

Reply via email to