[
https://issues.apache.org/jira/browse/PARQUET-281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574129#comment-14574129
]
Dong Chen commented on PARQUET-281:
-----------------------------------
Hi [~rdblue], as we discussed in HIVE-10254, here is some thoughts about adding
a comparator at column level rather than Binary class. Could you take a look if
time is available? Thanks.
The customized comparator will be injected and used in 3 parts:
* generating blocks statistics when writing
* filter blocks with predicate when reading
* filter records with predicate when reading
1. Writing
{{Statistics}} instance hold the data and is compared & updated when writing a
record. It is initialized in {{ColumnWriter}} inside Parquet and not exposed
for Hive.
In order to transit the comparator from Hive to Parquet, how about we adding
params (like {{parquet.customized.comparator.type}} and {{p.c.c.class}}) in
conf or WriteContext.extraMetaData? Then add a delegated comparator in
{{Statistic}}. {{Statistics}} could extract the param and instantiate the
comparator based on data type.
2. Reading
Methods like {{FilterApi.binaryColumn}} is exposed so that we could pass the
comparator from Hive. Then {{Operators.Column}} class should have an attribute
to store the comparator.
For filtering blocks, modify the {{visit}} methods in {{StatisticsFilter}} to
get the comparator through {{Column}} and use it if existed.
For fitlering records, modify the {{update}} methods in
{{IncrementallyUpdatedFilterPredicate.ValueInspector}} (the impl is actually in
{{IncrementallyUpdatedFilterPredicateGenerator}}) to get the comparator through
{{Column}} and use it if existed.
How does this sound?
> Statistic and Filter need a mechanism to get customized comparator from high
> layer user
> ---------------------------------------------------------------------------------------
>
> Key: PARQUET-281
> URL: https://issues.apache.org/jira/browse/PARQUET-281
> Project: Parquet
> Issue Type: Improvement
> Reporter: Dong Chen
> Assignee: Dong Chen
>
> As discussed in HIVE-10254, we might need a customized comparator from high
> layer user for generating statistic when writing and applying filter when
> reading.
> The problem is that (use Decimal type in Hive as an example):
> Decimal in Hive is mapped to Binary in Parquet. When using predicate and
> statistic to filter values, comparing Binary values in Parquet cannot reflect
> the correct relationship of Decimal values in Hive. This type mapping causes
> 2 problems:
> 1. When writing Decimal column, Binary.compareTo() is used to judge and set
> the column statistic (min, max). The generated statistic value is not correct
> from a Decimal perspective.
> 2. When reading with Predicate (also Filter), in which the expected Decimal
> value is converted to Binary type, Binary.compareTo() is used to compare the
> expected value and column statistic value. They are Binary perspective, and
> also the result is not right.
> We could add an interface for customized comparator, and high level user like
> Hive provides the comparator to Parquet, since Hive knows how to decode the
> binary to Decimal and compare. Then Parquet could switch between customized
> and original comparison method.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)