[jira] [Commented] (DRILL-4070) Metadata Caching : min/max values are null for varchar columns in auto partitioned data

Jason Altekruse (JIRA) Thu, 12 Nov 2015 10:22:54 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-4070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15002576#comment-15002576
 ]


Jason Altekruse commented on DRILL-4070:
----------------------------------------

I have confirmed that we do behave properly for newly written files, the filter 
currently used by parquet is a version number greater than 1.8 (our new "fork", 
which is just a solid maven release version of the current 1.8.2-SNAPSHOT, the 
tip of parquet master, is called 1.8.1-drill_r0 and does get read appropriately 
and has the statistics respected).

I also confirmed running Drill 1.2, creating a set of auto-partitioned files, 
where the version number is not in the range to be accepted as valid by the new 
parquet changes causes 1.3 to fail at pruning.

I don't think there are changes that should be made to Drill to solve this 
issue. Unfortunately externally created files could have bad statistics because 
of the previous bug, if we made drill behave differently we may cause incorrect 
results over files created by other tools. I am trying to look if there is a 
unique version number in the old files that we were using, but it appears that 
they just contain "parquet-mr" with no version number. So unfortunately it 
doesn't look like we could modify parquet to provide a special case for the old 
Drill files, by looking for our older version string in particular. I think we 
need to just work on a separate migration utility to rewrite the footers in the 
cases where we know the files were produced with Drill.

> Metadata Caching : min/max values are null for varchar columns in auto 
> partitioned data
> ---------------------------------------------------------------------------------------
>
>                 Key: DRILL-4070
>                 URL: https://issues.apache.org/jira/browse/DRILL-4070
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Metadata
>    Affects Versions: 1.3.0
>            Reporter: Rahul Challapalli
>            Priority: Critical
>         Attachments: cache.txt, fewtypes_varcharpartition.tar.tgz
>
>
> git.commit.id.abbrev=e78e286
> The metadata cache file created contains incorrect values for min/max fields 
> for varchar colums. The data is also partitioned on the varchar column
> {code}
> refresh table metadata fewtypes_varcharpartition;
> {code}
> As a result partition pruning is not happening. This was working after 
> DRILL-3937 has been fixed (d331330efd27dbb8922024c4a18c11e76a00016b)
> I attached the data set and the cache file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4070) Metadata Caching : min/max values are null for varchar columns in auto partitioned data

Reply via email to