Tim Armstrong created PARQUET-839:
-------------------------------------

             Summary: Min-max should be computed based on logical type
                 Key: PARQUET-839
                 URL: https://issues.apache.org/jira/browse/PARQUET-839
             Project: Parquet
          Issue Type: Bug
          Components: parquet-format
    Affects Versions: format-2.3.1
            Reporter: Tim Armstrong


The min/max stats are currently underspecified - it is not clear in any cases 
from the spec what the expected ordering is.

There are some related issues, like PARQUET-686 to fix specific problems, but 
there seems to be a general assumption that the min/max should be defined based 
on the primitive type instead of the logical type.

However, this makes the stats nearly useless for some logical types. E.g. 
consider a DECIMAL encoded into a (variable-length) BINARY. The min-max of the 
underlying binary type is based on the lexical order of the byte string, but 
that does not correspond to any reasonable ordering of the decimal values. E.g. 
16 (0x1 0x0) will be ordered between 1 (0x0) and (0x2). This makes min-max 
filtering a lot less effective and would force query engines using parquet to 
implement workarounds to produce correct results (e.g. custom comparators).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to