[jira] [Commented] (PARQUET-1025) Support new min-max statistics in parquet-mr

Gabor Szadovszky (JIRA) Mon, 20 Nov 2017 05:14:22 -0800

    [ 
https://issues.apache.org/jira/browse/PARQUET-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16259224#comment-16259224
 ]


Gabor Szadovszky commented on PARQUET-1025:
-------------------------------------------

To implement the new statistics we have to support the different comparison 
logics required by the specs. Currently, all the primitives are comparable. 
There are two possible options to extend the API:
# Implement separate comparators which are to be used by parquet-mr internally 
as well as by the API users.
#* pros:
#** Backwards compatible (however, the statistics and filtering would work 
based on different ordering than before)
#** Keeps the primitive and logical types more loosely coupled in parquet-mr
#* cons:
#** It can be quite confusing to the API user that the compareTo methods of the 
provided primitive types shall not be used but the provided comparators
#** More parts of the API have to be modified to ensure that the proper 
comparators are used and these comparators are accessible
#** The client developers shall also modify their code parts where primitive 
comparison is used
# Extend the actual primitive implementations for the logical types so the 
comparable objects would do the proper comparison by default.
#* pros:
#** Backwards compatible (however, the different Binary implementations would 
have different orderings than before, but the current one is incorrect anyway)
#** The API is more clean as the API users can rely on the comparable primitive 
types
#** The other parts of the API (e.g. filtering, statistics) can be kept 
unmodified, we modify only the parts where the primitives are created
#** The client code can be remain unmodified as it can still rely on the 
comparable primitive types
#* cons:
#** Proper comparison logic for UINT types will not be implemented (we cannot 
override the natural ordering of the primitive java types int and long)
#** The primitive and logical types would get more tightly coupled in parquet-mr

Which one shall we prefer? I’m also curious about the ideas/comments of the API 
users (e.g. Hive, Sparks etc.).
The first option is more or less implemented. Check the linked PR for details. 
I’m happy to implement the second option if it is more supported.

> Support new min-max statistics in parquet-mr
> --------------------------------------------
>
>                 Key: PARQUET-1025
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1025
>             Project: Parquet
>          Issue Type: New Feature
>          Components: parquet-mr
>            Reporter: Zoltan Ivanfi
>            Assignee: Gabor Szadovszky
>
> Impala started using new min-max statistics that got specified as part of 
> PARQUET-686. Support for these should be added to parquet-mr as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (PARQUET-1025) Support new min-max statistics in parquet-mr

Reply via email to