William Slacum created HIVE-11545:
-------------------------------------

             Summary: HiveInputFormat#pushProjectionsAndFilters modifies 
JobConf, which is ignored
                 Key: HIVE-11545
                 URL: https://issues.apache.org/jira/browse/HIVE-11545
             Project: Hive
          Issue Type: Bug
          Components: Tez
    Affects Versions: 0.14.0
         Environment: HDP 2.2.4.2
            Reporter: William Slacum


I was debugging an issue where Tez was running a simple query of the form 
{{select count(*) from table where x = y}}. {{x}} is a partition column, stats 
and cbo were disabled, and the table was backed by RCFiles with 
{{ColumnarSerDe}}

With MapReduce, when the {{MapOperator}} goes to create its {{ColumnarSerDe}} 
instances, {{ColumnProjectUtils#isReadAllColumns}} returns {{false}}. This 
causes the operation to not attempt to parse any of the data in the table and 
just count it. With Tez, {{ColumnProjectUtils#isReadAllColumns}} returns true, 
which causes data to be deserialized. In combination with HIVE-11544, this was 
causing my queries using Tez to run over 100x slower than with MR.

I finally traced this through, and found that the issue lies in 
{{HiveInputFormat#pushProjectionsAndFilters}} when the {{RecordReader}} is 
created. The {{JobConf}} that {{HiveInputFormat}} sees, and subsequently 
modifies, is the same instance the {{MapOperator}} sees when the execution 
engine is MapReduce. When the {{ColumnarSerDe}} instance is created, 
{{ColumnProjectUtils#isReadAllColumns}} is set to {{false}}.

Under Tez, {{HiveInputFormat}} sees a copy of the original {{JobConf}}, so its 
modifications are lost. The {{JobConf}}/{{Configuration}} that the 
{{MapOperator}} sees doesn't have {{hive.io.file.read.all.columns}} set, so 
{{ColumnProjectUtils#isReadAllColumns}} defaults to {{true}}.

As for what the fix should be, I don't really know. I debated on making this a 
Tez issue, but I'm generally of the opinion that passing around mutable state 
leads to problems, but MR has been allowing that for a while now. Maybe the 
client can force  {{hive.io.file.read.all.columns}} to {{false}}, but I don't 
know how it'd work with multiple input splits of different types.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to