[ 
https://issues.apache.org/jira/browse/DRILL-4070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jacques Nadeau updated DRILL-4070:
----------------------------------
    Description: 
Drill uses the parquet-mr library to write Parquet files. The metadata 
signature that Drill produced in 1.2 and earlier versions of Drill is 
indistinguishable from older footers written by other tools (such as Pig and 
Hive). There was a known bug when those tools wrote metadata that caused the 
statistics to be incorrect. To correct this, the parquet-mr library adopted a 
behavior of ignoring statistics from the old form of the Parquet footer. 

With 1.3, Drill upgraded to the latest version of parquet-mr and has now 
started ignoring these statistics as well. This ensures correct result but 
produces performance regressions (compared to Drill v1 and v2) when querying 
against partitioned Parquet files generated in Drill 1.1 and 1.2. 

  was:
git.commit.id.abbrev=e78e286

The metadata cache file created contains incorrect values for min/max fields 
for varchar colums. The data is also partitioned on the varchar column

{code}
refresh table metadata fewtypes_varcharpartition;
{code}

As a result partition pruning is not happening. This was working after 
DRILL-3937 has been fixed (d331330efd27dbb8922024c4a18c11e76a00016b)

I attached the data set and the cache file




> Files written with versions of Drill before v1.3 record metadata that is 
> indistinguishable from bad metadata from other Parquet creators
> ----------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: DRILL-4070
>                 URL: https://issues.apache.org/jira/browse/DRILL-4070
>             Project: Apache Drill
>          Issue Type: Bug
>          Components: Metadata
>    Affects Versions: 1.3.0
>            Reporter: Rahul Challapalli
>            Assignee: Parth Chandra
>            Priority: Blocker
>             Fix For: 1.3.0
>
>         Attachments: cache.txt, fewtypes_varcharpartition.tar.tgz
>
>
> Drill uses the parquet-mr library to write Parquet files. The metadata 
> signature that Drill produced in 1.2 and earlier versions of Drill is 
> indistinguishable from older footers written by other tools (such as Pig and 
> Hive). There was a known bug when those tools wrote metadata that caused the 
> statistics to be incorrect. To correct this, the parquet-mr library adopted a 
> behavior of ignoring statistics from the old form of the Parquet footer. 
> With 1.3, Drill upgraded to the latest version of parquet-mr and has now 
> started ignoring these statistics as well. This ensures correct result but 
> produces performance regressions (compared to Drill v1 and v2) when querying 
> against partitioned Parquet files generated in Drill 1.1 and 1.2. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to