[ 
https://issues.apache.org/jira/browse/HIVE-477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12707218#action_12707218
 ] 

Zheng Shao commented on HIVE-477:
---------------------------------

For 3), adding another thread means we need to buffer the data between the 2 
threads. It will be great to have some data on how much percentage of time this 
can save us beforehand. At least, we should know how much time is spent in 
operator stack, and how much is spent in writer.

For 4), there are some difficulties. We are using a single object to pass all 
rows. Doing 4) means we need to use multiple objects. Also, given the bigger 
cache size of modern CPUs, I am not sure whether our operator stack will go out 
of cache or not.


> Some optimization thoughts for Hive
> -----------------------------------
>
>                 Key: HIVE-477
>                 URL: https://issues.apache.org/jira/browse/HIVE-477
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: He Yongqiang
>
> Before we can start working on Hive-461. I am doing some profiling for hive. 
> And here are some thoughts for improvements:
> minor :
> 1) add a new HiveText to replace Text. It can avoid byte copy when init 
> LazyString. I have done a draft one, it shows  ~1% performance gains.
> 2) let StructObjectInspector's 
>     {noformat}
>      public List<Object> getStructFieldsDataAsList(Object data);
>     {noformat}
> to be 
>     {noformat}
>      public Object[] getStructFieldsDataAsArray(Object data);
>     {noformat}
> In my profiling test, it shows some performace gains. but in acutal execution 
> it did not. Anyway, let it return java array will reduce gc's burden of 
> collection ArrayList
> not so minor:
> 3) split FileSinkOperator's Writer into another Thread. Adding a 
> producer-consumer array as the bridge between the Operators thread and the 
> Writer thread.
> 4) the operator stack is kind of deep. In order to avoid instruction cache 
> misses, and increase the efficiency data cache, I suggest to let Hive's 
> operator can process an array of rows instead of processing only one row at a 
> time.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to