[jira] [Commented] (HIVE-7664) VectorizedBatchUtil.addRowToBatchFrom is not optimized for Vectorized execution and takes 25% CPU

Navis (JIRA) Tue, 19 Aug 2014 18:55:18 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-7664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14103231#comment-14103231
 ]


Navis commented on HIVE-7664:
-----------------------------

Changes applied only to ReduceRecordProcessor (Might be applied to 
VectorizedColumnarSerDe, VectorizedOrcAcidRowReader, etc.). I cannot sure this 
can improve CPU time consumption. [~mmokhtar] Could you do the same benchmark 
with this patch?

And also by changing ReduceRecordProcessor#472, 
{code}
for (int i = 0; i < length; i++) {
  accessors[tag].visit(convey[i], rowIdx++);
}
{code}
to 
{code}
accessors[tag].visit(Arrays.asList(Arrays.copyOfRange(convey, 0, length)));
{code}
?

> VectorizedBatchUtil.addRowToBatchFrom is not optimized for Vectorized 
> execution and takes 25% CPU
> -------------------------------------------------------------------------------------------------
>
>                 Key: HIVE-7664
>                 URL: https://issues.apache.org/jira/browse/HIVE-7664
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 0.13.1
>            Reporter: Mostafa Mokhtar
>             Fix For: 0.14.0
>
>         Attachments: HIVE-7664.1.patch.txt, HIVE-7664.2.patch.txt
>
>
> In a Group by heavy vectorized Reducer vertex 25% of CPU is spent in 
> VectorizedBatchUtil.addRowToBatchFrom().
> Looked at the code of VectorizedBatchUtil.addRowToBatchFrom and it looks like 
> it wasn't optimized for Vectorized processing.
> addRowToBatchFrom is called for every row and for each row and every column 
> in the batch getPrimitiveCategory is called to figure the type of each 
> column, column types are stored in a HashMap, for VectorGroupByOperator 
> columns types won't change between batches, so column types shouldn't be 
> looked up for every row.
> I recommend storing the column type in StructObjectInspector so that other 
> components can leverage this optimization.
> Also addRowToBatchFrom has a case statement for every row and every column 
> used for type casting I recommend encapsulating the type logic in templatized 
> methods.   
> {code}
> Stack Trace   Sample Count    Percentage(%)
> VectorizedBatchUtil.addRowToBatchFrom 86      26.543
>    AbstractPrimitiveObjectInspector.getPrimitiveCategory()    34      10.494
>    LazyBinaryStructObjectInspector.getStructFieldData 25      7.716
>    StandardStructObjectInspector.getStructFieldData   4       1.235
> {code}
> The query used : 
> {code}
> select 
>     ss_sold_date_sk
> from
>     store_sales
> where
>     ss_sold_date between '1998-01-01' and '1998-06-01'
> group by ss_item_sk , ss_customer_sk , ss_sold_date_sk
> having sum(ss_list_price) > 50000000000000;
> {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (HIVE-7664) VectorizedBatchUtil.addRowToBatchFrom is not optimized for Vectorized execution and takes 25% CPU

Reply via email to