[ https://issues.apache.org/jira/browse/HIVE-5009?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13755974#comment-13755974 ]
Benjamin Jakobus commented on HIVE-5009: ---------------------------------------- Bump :) > Fix minor optimization issues > ----------------------------- > > Key: HIVE-5009 > URL: https://issues.apache.org/jira/browse/HIVE-5009 > Project: Hive > Issue Type: Improvement > Reporter: Benjamin Jakobus > Assignee: Benjamin Jakobus > Priority: Minor > Fix For: 0.12.0 > > Original Estimate: 48h > Remaining Estimate: 48h > > I have found some minor optimization issues in the codebase, which I would > like to rectify and contribute. Specifically, these are: > The optimizations that could be applied to Hive's code base are as follows: > 1. Use StringBuffer when appending strings - In 184 instances, the > concatination operator (+=) was used when appending strings. This is > inherintly inefficient - instead Java's StringBuffer or StringBuilder class > should be used. 12 instances of this optimization can be applied to the > GenMRSkewJoinProcessor class and another three to the optimizer. CliDriver > uses the + operator inside a loop, so does the column projection utilities > class (ColumnProjectionUtils) and the aforementioned skew-join processor. > Tests showed that using the StringBuilder when appending strings is 57\% > faster than using the + operator (using the StringBuffer took 122 > milliseconds whilst the + operator took 284 milliseconds). The reason as to > why using the StringBuffer class is preferred over using the + operator, is > because > String third = first + second; > gets compiled to: > StringBuilder builder = new StringBuilder( first ); > builder.append( second ); > third = builder.toString(); > Therefore, when building complex strings, that, for example involve loops, > require many instantiations (and as discussed below, creating new objects > inside loops is inefficient). > 2. Use arrays instead of List - Java's java.util.Arrays class asList method > is a more efficient at creating creating lists from arrays than using loops > to manually iterate over the elements (using asList is computationally very > cheap, O(1), as it merely creates a wrapper object around the array; looping > through the list however has a complexity of O(n) since a new list is created > and every element in the array is added to this new list). As confirmed by > the experiment detailed in Appendix D, the Java compiler does not > automatically optimize and replace tight-loop copying with asList: the > loop-copying of 1,000,000 items took 15 milliseconds whilst using asList is > instant. > Four instances of this optimization can be applied to Hive's codebase (two of > these should be applied to the Map-Join container - MapJoinRowContainer) - > lines 92 to 98: > for (obj = other.first(); obj != null; obj = other.next()) { > ArrayList<Object> ele = new ArrayList(obj.length); > for (int i = 0; i < obj.length; i++) { > ele.add(obj[i]); > } > list.add((Row) ele); > } > 3. Unnecessary wrapper object creation - In 31 cases, wrapper object creation > could be avoided by simply using the provided static conversion methods. As > noted in the PMD documentation, "using these avoids the cost of creating > objects that also need to be garbage-collected later." > For example, line 587 of the SemanticAnalyzer class, could be replaced by the > more efficient parseDouble method call: > // Inefficient: > Double percent = Double.valueOf(value).doubleValue(); > // To be replaced by: > Double percent = Double.parseDouble(value); > Our test case in Appendix D confirms this: converting 10,000 strings into > integers using Integer.parseInt(gen.nextSessionId()) (i.e. creating an > unnecessary wrapper object) took 119 on average; using parseInt() took only > 38. Therefore creating even just one unnecessary wrapper object can make your > code up to 68% slower. > 4. Converting literals to strings using + "" - Converting literals to strings > using + "" is quite inefficient (see Appendix D) and should be done by > calling the toString() method instead: converting 1,000,000 integers to > strings using + "" took, on average, 1340 milliseconds whilst using the > toString() method only required 1183 milliseconds (hence adding empty strings > takes nearly 12% more time). > 89 instances of this using + "" when converting literals were found in Hive's > codebase - one of these are found in the JoinUtil. > 5. Avoid manual copying of arrays - Instead of copying arrays as is done in > GroupByOperator on line 1040 (see below), the more efficient System.arraycopy > can be used (arraycopy is a native method meaning that the entire memory > block is copied using memcpy or mmove). > // Line 1040 of the GroupByOperator > for (int i = 0; i < keys.length; i++) { > forwardCache[i] = keys[i]; > } > Using System.arraycopy on an array of 10,000 strings was (close to) instant > whilst the manual copy took 6 milliseconds. > 11 instances of this optimization should be applied to the Hive codebase. > 6. Avoiding instantiation inside loops - As noted in the PMD documentation, > "new objects created within loops should be checked to see if they can > created outside them and reused.". > Declaring variables inside a loop (i from 0 to 10,000) took 300 milliseconds > whilst declaring them outside took only 88 milliseconds (this can be > explained by the fact that when declaring a variable outside the loop, its > reference will be re-used for each iteration. However when declaring > variables inside a loop, new references will be created for each iteration. > In our case, 10,000 references will be created by the time that this loop > finishes, meaning lots of work in terms of memory allocation and garbage > collection). 1623 instances of this optimization can be applied. > To summarize, I propose to modify the code to address issue 1 and issue 6 > (remaining issues (2 - 5) will be addressed later). Details are specified as > sub-tasks. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira