William Slacum created HIVE-11545: ------------------------------------- Summary: HiveInputFormat#pushProjectionsAndFilters modifies JobConf, which is ignored Key: HIVE-11545 URL: https://issues.apache.org/jira/browse/HIVE-11545 Project: Hive Issue Type: Bug Components: Tez Affects Versions: 0.14.0 Environment: HDP 2.2.4.2 Reporter: William Slacum
I was debugging an issue where Tez was running a simple query of the form {{select count(*) from table where x = y}}. {{x}} is a partition column, stats and cbo were disabled, and the table was backed by RCFiles with {{ColumnarSerDe}} With MapReduce, when the {{MapOperator}} goes to create its {{ColumnarSerDe}} instances, {{ColumnProjectUtils#isReadAllColumns}} returns {{false}}. This causes the operation to not attempt to parse any of the data in the table and just count it. With Tez, {{ColumnProjectUtils#isReadAllColumns}} returns true, which causes data to be deserialized. In combination with HIVE-11544, this was causing my queries using Tez to run over 100x slower than with MR. I finally traced this through, and found that the issue lies in {{HiveInputFormat#pushProjectionsAndFilters}} when the {{RecordReader}} is created. The {{JobConf}} that {{HiveInputFormat}} sees, and subsequently modifies, is the same instance the {{MapOperator}} sees when the execution engine is MapReduce. When the {{ColumnarSerDe}} instance is created, {{ColumnProjectUtils#isReadAllColumns}} is set to {{false}}. Under Tez, {{HiveInputFormat}} sees a copy of the original {{JobConf}}, so its modifications are lost. The {{JobConf}}/{{Configuration}} that the {{MapOperator}} sees doesn't have {{hive.io.file.read.all.columns}} set, so {{ColumnProjectUtils#isReadAllColumns}} defaults to {{true}}. As for what the fix should be, I don't really know. I debated on making this a Tez issue, but I'm generally of the opinion that passing around mutable state leads to problems, but MR has been allowing that for a while now. Maybe the client can force {{hive.io.file.read.all.columns}} to {{false}}, but I don't know how it'd work with multiple input splits of different types. -- This message was sent by Atlassian JIRA (v6.3.4#6332)