Re: [PR] PARQUET-34: implement Size() filter for repeated columns [parquet-java]

via GitHub Fri, 20 Dec 2024 12:04:00 -0800


emkornfield commented on PR #3098:
URL: https://github.com/apache/parquet-java/pull/3098#issuecomment-2557642415


   I can try to look in more detail but stats can certainly be used here, I 
imagine they are most useful for repeated fieds when trying to discriminate 
between repeated fields that mostly have 0 or 1 element, and trying to filter 
out cases with > 0  or 1 elements. e.g. if all fields have 0 observed 
rep_levels of 1, then one knows for sure all lists are of length 0 or 1 
(whether there are any lists of length 0 or one can be deteremined by 
inspecting the def level histogram).  For larger cardinality lists the 
filtering power diminishes significanly (its hard to distinguish based on 
histograms the difference between many very small lists vs one very large one).


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] PARQUET-34: implement Size() filter for repeated columns [parquet-java]

Reply via email to