[jira] [Commented] (DRILL-4070) Metadata Caching : min/max values are null for varchar columns in auto partitioned data

Jason Altekruse (JIRA) Thu, 12 Nov 2015 11:49:41 -0800

    [ 
https://issues.apache.org/jira/browse/DRILL-4070?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15002771#comment-15002771
 ]


Jason Altekruse commented on DRILL-4070:
----------------------------------------

[~zelaine] The answer to your question is something along the lines of "maybe 
yes". It looks like what were were writing before was just the string 
"parquet-mr" as the version, lacking a number or any drill-specific 
information. While I believe that the files written by other tools should 
pretty consistently contain parquet-mr X.X.X, with a proper version number, 
this field has not been used as a means of changing behavior before this recent 
bug. You can see in the commit I linked above that new code was added for 
"VersionParser" along with the corresponding bits that disable statistics for 
this particular case in cases where 1.8 cannot be found in the version number.

While it is possible that looking for the version "parquet-mr" exactly may 
provide the behavior we need to read the old drill files, it seems a bit risky 
to make this the default behavior. I would be more inclined to advocate for 
this change if we had included the word "drill" in the version string 
previously, where the changes for collisions with other tools would nearly 
non-existent.

The one thing I can say is that 1.2 did have the incorrect results issue that 
we would see by turning on this compatibility behavior. As Jacques mentioned on 
the vote thread, we could have a flag that turns this on. I would advocate that 
we make sure to have a loud warning that it may produce incorrect results with 
externally created files and that they would be strongly encouraged to use the 
migration tool to rewrite the metadata on their existing parquet files and stop 
using the switch as soon as possible.

> Metadata Caching : min/max values are null for varchar columns in auto 
> partitioned data
> ---------------------------------------------------------------------------------------
>
>                 Key: DRILL-4070
>                 URL: https://issues.apache.org/jira/browse/DRILL-4070
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Metadata
>    Affects Versions: 1.3.0
>            Reporter: Rahul Challapalli
>            Priority: Critical
>         Attachments: cache.txt, fewtypes_varcharpartition.tar.tgz
>
>
> git.commit.id.abbrev=e78e286
> The metadata cache file created contains incorrect values for min/max fields 
> for varchar colums. The data is also partitioned on the varchar column
> {code}
> refresh table metadata fewtypes_varcharpartition;
> {code}
> As a result partition pruning is not happening. This was working after 
> DRILL-3937 has been fixed (d331330efd27dbb8922024c4a18c11e76a00016b)
> I attached the data set and the cache file



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (DRILL-4070) Metadata Caching : min/max values are null for varchar columns in auto partitioned data

Reply via email to