[jira] Commented: (HIVE-477) Some optimization thoughts for Hive

He Yongqiang (JIRA) Mon, 11 May 2009 07:32:10 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12708053#action_12708053
 ]


He Yongqiang commented on HIVE-477:
-----------------------------------

New test results for understanding how much time is used in the RecordWriter, 
and how much time is used in OperatorProcessing.

The whole test involves 4 tables: tablerc1,tablerc2, tableseq1, tableseq2. They 
all have 30 string columns.
tablerc1 and tablerc2 are stored as RCFile. tableseq1 and tableseq2 are stored 
as SequenceFile.
tablerc1 and tablerc2 are about 134M. tableseq1 and tableseq2 are about 178M. 
They all store the same original data.

Here are the results:
|| Command || Normal Execution Time( the whole job costs / the first mapper / 
the second mapper ) ||  No RecordWriter's write in FileSinkOperator ( the whole 
job costs / the first mapper / the second mapper )|| Empty ExecMapper's map 
Body( the whole job costs / the first mapper / the second mapper ) ||
| insert overwrite tablerc2 select * from tablerc1 | 131 / 115 /  117 | 45 / 34 
/ 34 | 26 / 16 / 15 |
| insert overwrite tablerc2 select * from tablerc1 | 121 / 114 /  116 | 42 / 34 
/ 33 | 20 / 16 / 15 |
| insert overwrite tableseq2 select * from tableseq1 | 129 / 120 /  122 | 37 / 
35 / 34 | 18 / 12 / 12 |
| insert overwrite tableseq2 select * from tableseq1 | 130 / 127 /  123 | 38/ 
35 / 35 | 17 / 13 / 12 |

> Some optimization thoughts for Hive
> -----------------------------------
>
>                 Key: HIVE-477
>                 URL: https://issues.apache.org/jira/browse/HIVE-477
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>
> Before we can start working on Hive-461. I am doing some profiling for hive. 
> And here are some thoughts for improvements:
> minor :
> 1) add a new HiveText to replace Text. It can avoid byte copy when init 
> LazyString. I have done a draft one, it shows  ~1% performance gains.
> 2) let StructObjectInspector's 
>     {noformat}
>      public List<Object> getStructFieldsDataAsList(Object data);
>     {noformat}
> to be 
>     {noformat}
>      public Object[] getStructFieldsDataAsArray(Object data);
>     {noformat}
> In my profiling test, it shows some performace gains. but in acutal execution 
> it did not. Anyway, let it return java array will reduce gc's burden of 
> collection ArrayList
> not so minor:
> 3) split FileSinkOperator's Writer into another Thread. Adding a 
> producer-consumer array as the bridge between the Operators thread and the 
> Writer thread.
> 4) the operator stack is kind of deep. In order to avoid instruction cache 
> misses, and increase the efficiency data cache, I suggest to let Hive's 
> operator can process an array of rows instead of processing only one row at a 
> time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-477) Some optimization thoughts for Hive

Reply via email to