[
https://issues.apache.org/jira/browse/PARQUET-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16259224#comment-16259224
]
Gabor Szadovszky commented on PARQUET-1025:
-------------------------------------------
To implement the new statistics we have to support the different comparison
logics required by the specs. Currently, all the primitives are comparable.
There are two possible options to extend the API:
# Implement separate comparators which are to be used by parquet-mr internally
as well as by the API users.
#* pros:
#** Backwards compatible (however, the statistics and filtering would work
based on different ordering than before)
#** Keeps the primitive and logical types more loosely coupled in parquet-mr
#* cons:
#** It can be quite confusing to the API user that the compareTo methods of the
provided primitive types shall not be used but the provided comparators
#** More parts of the API have to be modified to ensure that the proper
comparators are used and these comparators are accessible
#** The client developers shall also modify their code parts where primitive
comparison is used
# Extend the actual primitive implementations for the logical types so the
comparable objects would do the proper comparison by default.
#* pros:
#** Backwards compatible (however, the different Binary implementations would
have different orderings than before, but the current one is incorrect anyway)
#** The API is more clean as the API users can rely on the comparable primitive
types
#** The other parts of the API (e.g. filtering, statistics) can be kept
unmodified, we modify only the parts where the primitives are created
#** The client code can be remain unmodified as it can still rely on the
comparable primitive types
#* cons:
#** Proper comparison logic for UINT types will not be implemented (we cannot
override the natural ordering of the primitive java types int and long)
#** The primitive and logical types would get more tightly coupled in parquet-mr
Which one shall we prefer? I’m also curious about the ideas/comments of the API
users (e.g. Hive, Sparks etc.).
The first option is more or less implemented. Check the linked PR for details.
I’m happy to implement the second option if it is more supported.
> Support new min-max statistics in parquet-mr
> --------------------------------------------
>
> Key: PARQUET-1025
> URL: https://issues.apache.org/jira/browse/PARQUET-1025
> Project: Parquet
> Issue Type: New Feature
> Components: parquet-mr
> Reporter: Zoltan Ivanfi
> Assignee: Gabor Szadovszky
>
> Impala started using new min-max statistics that got specified as part of
> PARQUET-686. Support for these should be added to parquet-mr as well.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)