[jira] Commented: (HIVE-756) performance improvement for RCFile and ColumnarSerDe in Hive

Ning Zhang (JIRA) Mon, 17 Aug 2009 12:41:41 -0700

    [ 
https://issues.apache.org/jira/browse/HIVE-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744176#action_12744176
 ]


Ning Zhang commented on HIVE-756:
---------------------------------

@zheng,
1) line 122 in the patch should take care of this: we assume no columns are 
specified by default. So we don't need to read any column for queries such as 
select count(1) from tt;
2) for this case I assume you refer to select * from tt. It seems Hive doesn't 
need a mapper for this query and the function is not call. Correct me if I'm 
wrong.
3) the selected_cols in ColumnarStruct is a temporary ArrayList, for its ease 
of adding element to the list. The prjColIDs is the final array of col IDs for 
its efficiency in readying element based on index. The use o int[] instead of 
ArrayList should be more efficient if the array elements are accessed many 
times. I agree that we can change the name of the variable selected_cols to 
something like 'tmp_selected_cols' to distinguish the lifetime of prjColIDs.


@Yongqiang,
1) the introduction of prjColIDs is based on the observation in the YourKit 
profiler that a lot of queries just access a very small subset of columns and 
the boolean array of skippedColIDs keeps all columns in the table. While in the 
loop for processing each row, the skippedColIDs need to be looped to find out 
which columns are not skipped. This info is not changed across rows and should 
be moved outside the row-iterator. The prjColIDs is introduced to keep track of 
what is the set of selected columns at the iterator initialization phase. This 
could save a lot if the number of columns is large and the number of selected 
columns is small.
2) the resetValid() is a good point, but seems that the call to 
resetValid(columnNumber) is just to make sure the buffer is large enough for 
the columnNumber, and the columnNumber is only set during Reader() 
construction. So if the access to RCFile.Reader() all follows the protocol that 
initializing the reader through the constructor and never change the settings 
(number of total columns and the set of returned columns) during the iteration 
of getting the next row, then we don't need to initialize the columnNumber of 
default values of non-selected columns for each row. That's a huge cost for 
large table scans.

> performance improvement for RCFile and ColumnarSerDe in Hive
> ------------------------------------------------------------
>
>                 Key: HIVE-756
>                 URL: https://issues.apache.org/jira/browse/HIVE-756
>             Project: Hadoop Hive
>          Issue Type: Improvement
>            Reporter: Ning Zhang
>            Assignee: Ning Zhang
>         Attachments: hive-756.patch
>
>
> There are some easy performance improvements in the columnar storage in Hive 
> I found during Hackathon. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (HIVE-756) performance improvement for RCFile and ColumnarSerDe in Hive

Reply via email to