[ https://issues.apache.org/jira/browse/HIVE-756?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12744176#action_12744176 ]
Ning Zhang commented on HIVE-756: --------------------------------- @zheng, 1) line 122 in the patch should take care of this: we assume no columns are specified by default. So we don't need to read any column for queries such as select count(1) from tt; 2) for this case I assume you refer to select * from tt. It seems Hive doesn't need a mapper for this query and the function is not call. Correct me if I'm wrong. 3) the selected_cols in ColumnarStruct is a temporary ArrayList, for its ease of adding element to the list. The prjColIDs is the final array of col IDs for its efficiency in readying element based on index. The use o int[] instead of ArrayList should be more efficient if the array elements are accessed many times. I agree that we can change the name of the variable selected_cols to something like 'tmp_selected_cols' to distinguish the lifetime of prjColIDs. @Yongqiang, 1) the introduction of prjColIDs is based on the observation in the YourKit profiler that a lot of queries just access a very small subset of columns and the boolean array of skippedColIDs keeps all columns in the table. While in the loop for processing each row, the skippedColIDs need to be looped to find out which columns are not skipped. This info is not changed across rows and should be moved outside the row-iterator. The prjColIDs is introduced to keep track of what is the set of selected columns at the iterator initialization phase. This could save a lot if the number of columns is large and the number of selected columns is small. 2) the resetValid() is a good point, but seems that the call to resetValid(columnNumber) is just to make sure the buffer is large enough for the columnNumber, and the columnNumber is only set during Reader() construction. So if the access to RCFile.Reader() all follows the protocol that initializing the reader through the constructor and never change the settings (number of total columns and the set of returned columns) during the iteration of getting the next row, then we don't need to initialize the columnNumber of default values of non-selected columns for each row. That's a huge cost for large table scans. > performance improvement for RCFile and ColumnarSerDe in Hive > ------------------------------------------------------------ > > Key: HIVE-756 > URL: https://issues.apache.org/jira/browse/HIVE-756 > Project: Hadoop Hive > Issue Type: Improvement > Reporter: Ning Zhang > Assignee: Ning Zhang > Attachments: hive-756.patch > > > There are some easy performance improvements in the columnar storage in Hive > I found during Hackathon. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.