[ 
https://issues.apache.org/jira/browse/HIVE-21327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marta Kuczora updated HIVE-21327:
---------------------------------
    Description: 
The Parquet FilterPredicate is created and set to the configuration in the 
ParquetRecordReaderBase.setFilter method. This method is used from the 
ParquetRecordReaderWrapper constructor through the 
ParquetRecordReaderBase.getSplit method and expects a JobConf as parameter 
where it sets the created filter predicate. In the ParquetRecordReaderWrapper 
constructor, multiple JobConf object is used:

{noformat}
    jobConf = oldJobConf;
    final ParquetInputSplit split = getSplit(oldSplit, jobConf);

    TaskAttemptID taskAttemptID = 
TaskAttemptID.forName(jobConf.get(IOConstants.MAPRED_TASK_ID));
    if (taskAttemptID == null) {
      taskAttemptID = new TaskAttemptID();
    }

    // create a TaskInputOutputContext
    Configuration conf = jobConf;
    if (skipTimestampConversion ^ HiveConf.getBoolVar(
        conf, HiveConf.ConfVars.HIVE_PARQUET_TIMESTAMP_SKIP_CONVERSION)) {
      conf = new JobConf(oldJobConf);
      HiveConf.setBoolVar(conf,
        HiveConf.ConfVars.HIVE_PARQUET_TIMESTAMP_SKIP_CONVERSION, 
skipTimestampConversion);
    }

    final TaskAttemptContext taskContext = 
ContextUtil.newTaskAttemptContext(conf, taskAttemptID);
{noformat}
So we have the jobConf, oldJobConf and conf objects and the getSplit is called 
with the jobConf object, so the filter predicate will be set into this config 
object. Based on this code part, the jobConf and oldJobConf should be the same 
reference inside the if statement, so the newly created conf should also 
contain the filter predicate. However in the getSplit method the value of the 
jobConf is changed by the projectionPusher.pushProjectionsAndFilters method, so 
inside the if statement, the jobConf and the oldJobConf are actually different 
references. The filter predicate is set in the jobConf, but if the if condition 
is true, the conf will be created from the oldJobConf so it won't contain the 
filter predicate.
Just for reference, this behavior was introduced in 
[HIVE-9873|https://issues.apache.org/jira/browse/HIVE-9873]. 
Since the goal of the if statement is only to update the 
HIVE_PARQUET_TIMESTAMP_SKIP_CONVERSION property in the configuration, it should 
be using the jobConf where the filter predicate is correctly set.

> Predicate is not pushed to Parquet if 
> hive.parquet.timestamp.skip.conversion=true
> ---------------------------------------------------------------------------------
>
>                 Key: HIVE-21327
>                 URL: https://issues.apache.org/jira/browse/HIVE-21327
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 4.0.0
>            Reporter: Marta Kuczora
>            Assignee: Marta Kuczora
>            Priority: Major
>
> The Parquet FilterPredicate is created and set to the configuration in the 
> ParquetRecordReaderBase.setFilter method. This method is used from the 
> ParquetRecordReaderWrapper constructor through the 
> ParquetRecordReaderBase.getSplit method and expects a JobConf as parameter 
> where it sets the created filter predicate. In the ParquetRecordReaderWrapper 
> constructor, multiple JobConf object is used:
> {noformat}
>     jobConf = oldJobConf;
>     final ParquetInputSplit split = getSplit(oldSplit, jobConf);
>     TaskAttemptID taskAttemptID = 
> TaskAttemptID.forName(jobConf.get(IOConstants.MAPRED_TASK_ID));
>     if (taskAttemptID == null) {
>       taskAttemptID = new TaskAttemptID();
>     }
>     // create a TaskInputOutputContext
>     Configuration conf = jobConf;
>     if (skipTimestampConversion ^ HiveConf.getBoolVar(
>         conf, HiveConf.ConfVars.HIVE_PARQUET_TIMESTAMP_SKIP_CONVERSION)) {
>       conf = new JobConf(oldJobConf);
>       HiveConf.setBoolVar(conf,
>         HiveConf.ConfVars.HIVE_PARQUET_TIMESTAMP_SKIP_CONVERSION, 
> skipTimestampConversion);
>     }
>     final TaskAttemptContext taskContext = 
> ContextUtil.newTaskAttemptContext(conf, taskAttemptID);
> {noformat}
> So we have the jobConf, oldJobConf and conf objects and the getSplit is 
> called with the jobConf object, so the filter predicate will be set into this 
> config object. Based on this code part, the jobConf and oldJobConf should be 
> the same reference inside the if statement, so the newly created conf should 
> also contain the filter predicate. However in the getSplit method the value 
> of the jobConf is changed by the projectionPusher.pushProjectionsAndFilters 
> method, so inside the if statement, the jobConf and the oldJobConf are 
> actually different references. The filter predicate is set in the jobConf, 
> but if the if condition is true, the conf will be created from the oldJobConf 
> so it won't contain the filter predicate.
> Just for reference, this behavior was introduced in 
> [HIVE-9873|https://issues.apache.org/jira/browse/HIVE-9873]. 
> Since the goal of the if statement is only to update the 
> HIVE_PARQUET_TIMESTAMP_SKIP_CONVERSION property in the configuration, it 
> should be using the jobConf where the filter predicate is correctly set.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to