[ 
https://issues.apache.org/jira/browse/HIVE-21709?focusedWorklogId=446410&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-446410
 ]

ASF GitHub Bot logged work on HIVE-21709:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 16/Jun/20 10:54
            Start Date: 16/Jun/20 10:54
    Worklog Time Spent: 10m 
      Work Description: github-actions[bot] closed pull request #631:
URL: https://github.com/apache/hive/pull/631


   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 446410)
    Time Spent: 0.5h  (was: 20m)

> Count with expression does not work in Parquet
> ----------------------------------------------
>
>                 Key: HIVE-21709
>                 URL: https://issues.apache.org/jira/browse/HIVE-21709
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 2.3.2
>            Reporter: Mainak Ghosh
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> For parquet file with nested schema, count with expression as column name 
> does not work when you are filtering on another column in the same struct. 
> Here are the steps to reproduce:
> {code:java}
> CREATE TABLE `test_table`( `rtb_win` struct<`impression_id`:string, 
> `pub_id`:string>) ROW FORMAT SERDE 
> 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe' STORED AS 
> INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat' 
> OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat';
> INSERT INTO TABLE test_table SELECT named_struct('impression_id', 'cat', 
> 'pub_id', '2');
> select count(rtb_win.impression_id) from test_table where rtb_win.pub_id ='2';
> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the 
> future versions. Consider using a different execution engine (i.e. spark, 
> tez) or using Hive 1.X releases.
> +------+ 
> | _c0  |
> +------+ 
> | 0    | 
> +------+
> select count(*) from test_parquet_count_mghosh where rtb_win.pub_id ='2';
> WARNING: Hive-on-MR is deprecated in Hive 2 and may not be available in the 
> future versions. Consider using a different execution engine (i.e. spark, 
> tez) or using Hive 1.X releases. 
> +------+ 
> | _c0  | 
> +------+ 
> | 1    | 
> +------+{code}
> As you can see the first query returns the wrong result while the second one 
> returns the correct result.
> The issue is an column order mismatch between the actual parquet file 
> (impression_id first and pub_id second) and the Hive prunedCols datastructure 
> (reverse). As a result in the filter we compare with the wrong value and the 
> count returns 0. I have been able to identify the cause of this mismatch.
> I would love to get the code reviewed and merged. Some of the code changes 
> are changes to commits from Ferdinand Xu and Chao Sun.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to