[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

Yin Huai (JIRA) Wed, 18 Sep 2013 19:23:04 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-4113?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13771530#comment-13771530
 ]


Yin Huai commented on HIVE-4113:
--------------------------------

Three issues:
# ColumnProjectionUtils.READ_COLUMN_NAMES_CONF_STR is only used in Hive. Seems 
HCatalog does not set it. So, seems when accessing ORC through HCatalog, we 
cannot do predicate pushdown.
# neededColumnIDs in TableScanOperator can be null when column pruning is 
disabled. In this case, seems we can see NPE in 
ColumnAccessAnalyzer.analyzeColumnAccess. Also, when column pruning is 
disabled, we cannot do predicate pushdown in Hive, because neededColumnIDs will 
be null when column pruning is disabled.
# With this change, we will assume that an empty neededColumnIDs means no 
needed column. Either ColumnProjectionUtils.READ_ALL_COLUMNS=true or 
READ_COLUMN_IDS_CONF_STR having all columns can represent selecting all columns.

I will make two changes.
# Remove the flag of column pruning.
# Set READ_ALL_COLUMNS_DEFAULT to true. So, if users of hcatalog do not use 
ColumnProjectionUtils, we can select all columns for them. If we use false for 
READ_ALL_COLUMNS_DEFAULT, users have to use ColumnProjectionUtils. Otherwise, 
no column will be selected.
                
> Optimize select count(1) with RCFile and Orc
> --------------------------------------------
>
>                 Key: HIVE-4113
>                 URL: https://issues.apache.org/jira/browse/HIVE-4113
>             Project: Hive
>          Issue Type: Bug
>          Components: File Formats
>            Reporter: Gopal V
>            Assignee: Yin Huai
>             Fix For: 0.12.0
>
>         Attachments: HIVE-4113-0.patch, HIVE-4113.1.patch, HIVE-4113.patch, 
> HIVE-4113.patch
>
>
> select count(1) loads up every column & every row when used with RCFile.
> "select count(1) from store_sales_10_rc" gives
> {code}
> Job 0: Map: 5  Reduce: 1   Cumulative CPU: 31.73 sec   HDFS Read: 234914410 
> HDFS Write: 8 SUCCESS
> {code}
> Where as, "select count(ss_sold_date_sk) from store_sales_10_rc;" reads far 
> less
> {code}
> Job 0: Map: 5  Reduce: 1   Cumulative CPU: 29.75 sec   HDFS Read: 28145994 
> HDFS Write: 8 SUCCESS
> {code}
> Which is 11% of the data size read by the COUNT(1).
> This was tracked down to the following code in RCFile.java
> {code}
>       } else {
>         // TODO: if no column name is specified e.g, in select count(1) from 
> tt;
>         // skip all columns, this should be distinguished from the case:
>         // select * from tt;
>         for (int i = 0; i < skippedColIDs.length; i++) {
>           skippedColIDs[i] = false;
>         }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (HIVE-4113) Optimize select count(1) with RCFile and Orc

Reply via email to