[ 
https://issues.apache.org/jira/browse/DRILL-7132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16799339#comment-16799339
 ] 

Volodymyr Vysotskyi commented on DRILL-7132:
--------------------------------------------

[~rhou], parquet metadata cache contains min/max values for varchar, decimal, 
interval, and some other types encoded using base64, so they differ from the 
values displayed by parquet tools.

There is no need to store values in the same format/encoding, etc. The main 
requirement is Drill should be able to handle these values from parquet 
metadata cache files correctly, and it does.

As a side note, in DRILL-4139 was made a change to use base64 encoding in 
parquet metadata cache to be able to handle correctly statistics for decimal 
and interval types.

> Metadata cache does not have correct min/max values for varchar and interval 
> data types
> ---------------------------------------------------------------------------------------
>
>                 Key: DRILL-7132
>                 URL: https://issues.apache.org/jira/browse/DRILL-7132
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Metadata
>    Affects Versions: 1.14.0
>            Reporter: Robert Hou
>            Priority: Major
>             Fix For: 1.17.0
>
>         Attachments: 0_0_10.parquet
>
>
> The parquet metadata cache does not have correct min/max values for varchar 
> and interval data types.
> I have attached a parquet file.  Here is what parquet tools shows for varchar:
> [varchar_col] BINARY 14.6% of all space [PLAIN, BIT_PACKED] min: 67 max: 67 
> average: 67 total: 67 (raw data: 65 saving -3%)
>   values: min: 1 max: 1 average: 1 total: 1
>   uncompressed: min: 65 max: 65 average: 65 total: 65
>   column values statistics: min: ioegjNJKvnkd, max: ioegjNJKvnkd, num_nulls: 0
> Here is what the metadata cache file shows:
>         "name" : [ "varchar_col" ],
>         "minValue" : "aW9lZ2pOSkt2bmtk",
>         "maxValue" : "aW9lZ2pOSkt2bmtk",
>         "nulls" : 0
> Here is what parquet tools shows for interval:
> [interval_col] BINARY 11.3% of all space [PLAIN, BIT_PACKED] min: 52 max: 52 
> average: 52 total: 52 (raw data: 50 saving -4%)
>   values: min: 1 max: 1 average: 1 total: 1
>   uncompressed: min: 50 max: 50 average: 50 total: 50
>   column values statistics: min: P18582D, max: P18582D, num_nulls: 0
> Here is what the metadata cache file shows:
>         "name" : [ "interval_col" ],
>         "minValue" : "UDE4NTgyRA==",
>         "maxValue" : "UDE4NTgyRA==",
>         "nulls" : 0



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to