[
https://issues.apache.org/jira/browse/HIVE-21709?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17130776#comment-17130776
]
David Mollitor commented on HIVE-21709:
---------------------------------------
Still interested in working on this?
Can you please create PR against master?
> Count with expression does not work in Parquet
> ----------------------------------------------
>
> Key: HIVE-21709
> URL: https://issues.apache.org/jira/browse/HIVE-21709
> Project: Hive
> Issue Type: Bug
> Affects Versions: 2.3.2
> Reporter: Mainak Ghosh
> Priority: Major
> Labels: pull-request-available
> Time Spent: 20m
> Remaining Estimate: 0h
>
> For parquet file with nested schema, count with expression as column name
> does not work when you are filtering on another column in the same struct.
> Here are the steps to reproduce:
> {code:java}
> CREATE TABLE `test_table`( `rtb_win` struct<`impression_id`:string,
> `pub_id`:string>) ROW FORMAT SERDE
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS
> INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
> INSERT INTO TABLE test_table SELECT named_struct('impression_id', 'cat',
> 'pub_id', '2');
> select count(rtb_win.impression_id) from test_table where rtb_win.pub_id ='2';
> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the
> future versions. Consider using a different execution engine (i.e. spark,
> tez) or using Hive 1.X releases.
> +------+
> | _c0 |
> +------+
> | 0 |
> +------+
> select count(*) from test_parquet_count_mghosh where rtb_win.pub_id ='2';
> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the
> future versions. Consider using a different execution engine (i.e. spark,
> tez) or using Hive 1.X releases.
> +------+
> | _c0 |
> +------+
> | 1 |
> +------+{code}
> As you can see the first query returns the wrong result while the second one
> returns the correct result.
> The issue is an column order mismatch between the actual parquet file
> (impression_id first and pub_id second) and the Hive prunedCols datastructure
> (reverse). As a result in the filter we compare with the wrong value and the
> count returns 0. I have been able to identify the cause of this mismatch.
> I would love to get the code reviewed and merged. Some of the code changes
> are changes to commits from Ferdinand Xu and Chao Sun.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)