Yingyi Bu has posted comments on this change. Change subject: ASTERIXDB-1566,ASTERIXDB-1628: Fixed External Hash Group By to conform to the memory budget ......................................................................
Patch Set 9: (19 comments) Some high level comments: 1. The accounting of memory budget is still based on #Frames instead of #Bytes? How does the Big Object case get handled? 2. IMO, GC should be internal to SerializableHashTable. SerializableHashTable itself can decide whether to do GC or not in insert/delete based on a certain policy, e.g., memory is full in the last N calls and the ratio of wasted space is higher than some threshold. 3. I don't recall the exact reason why an additional parameter compiler.grouphashtablememory is needed. If it is really needed, we'd better audit in AsterixCompilerProperties.getGroupHashTableMemorySize(), e.g., it couldn't go beyond an certain upper bound calculated from compiler.groupmemory, assuming each tuple only have one int8 field payload (which is the worst case). 4. It seems that the GC code in SerializableHashTable could be simplified, using Arrays.fill(...). ByteBuffer.writeXXX() is not as efficient as direct bytes reads/writes. 5. The GC policy and GC implementation needs to be tested more thoroughly, probably using MockIto. Especially, we might want to make sure the GC code is not run too often. 6. This change should better be perf-tested as it will impact Hash Join performance as well (I guess). One possibility could be to run Jianfeng's Micro-benchmark that he used for the Big Object change to contrast the versions with/without this change. 7. Address SonarQube comments as much as possible, e.g., useless assignments and useless parens. https://asterix-gerrit.ics.uci.edu/#/c/1056/9/asterixdb/asterix-docker/docker/asterix-configuration.xml File asterixdb/asterix-docker/docker/asterix-configuration.xml: Line 251: more efficiently using pointers. Actual Group-by value is stored in data table. "Group-by value" -> "aggregate values" Line 253: the same as the given value divided by 40 bytes. (E.g., 16MB / 40 = 419,430). Can you add the explanation of why 40 is the number into the comments? https://asterix-gerrit.ics.uci.edu/#/c/1056/9/hyracks-fullstack/hyracks/hyracks-dataflow-std/src/main/java/org/apache/hyracks/dataflow/std/group/HashSpillableTableFactory.java File hyracks-fullstack/hyracks/hyracks-dataflow-std/src/main/java/org/apache/hyracks/dataflow/std/group/HashSpillableTableFactory.java: Line 91: final ITuplePartitionComputer tpc1 = new FieldHashPartitionComputerFamily(intermediateResultKeys, better variable name. e.g., tpcIntermediate https://asterix-gerrit.ics.uci.edu/#/c/1056/9/hyracks-fullstack/hyracks/hyracks-dataflow-std/src/main/java/org/apache/hyracks/dataflow/std/group/ISpillableTable.java File hyracks-fullstack/hyracks/hyracks-dataflow-std/src/main/java/org/apache/hyracks/dataflow/std/group/ISpillableTable.java: Line 55: * @return annotate @return. https://asterix-gerrit.ics.uci.edu/#/c/1056/9/hyracks-fullstack/hyracks/hyracks-dataflow-std/src/main/java/org/apache/hyracks/dataflow/std/group/external/ExternalHashGroupBy.java File hyracks-fullstack/hyracks/hyracks-dataflow-std/src/main/java/org/apache/hyracks/dataflow/std/group/external/ExternalHashGroupBy.java: Line 79: result = table.insert(accessor, i); flip the else branch as the if branch, and hence not nest more than 4 levels. https://asterix-gerrit.ics.uci.edu/#/c/1056/9/hyracks-fullstack/hyracks/hyracks-dataflow-std/src/main/java/org/apache/hyracks/dataflow/std/structures/ISerializableTable.java File hyracks-fullstack/hyracks/hyracks-dataflow-std/src/main/java/org/apache/hyracks/dataflow/std/structures/ISerializableTable.java: Line 43: boolean isGarbageCollectioNeeded(); This method is never called. Do we really need it? Line 46: throws HyracksDataException; Why does this need to be expose to the outside? It seems to be that this should be internal to this HashTable and automatically invoked in the insert/delete calls from time to time. https://asterix-gerrit.ics.uci.edu/#/c/1056/9/hyracks-fullstack/hyracks/hyracks-dataflow-std/src/main/java/org/apache/hyracks/dataflow/std/structures/SerializableHashTable.java File hyracks-fullstack/hyracks/hyracks-dataflow-std/src/main/java/org/apache/hyracks/dataflow/std/structures/SerializableHashTable.java: Line 47: private static int INVALID_VALUE = -1; 1. static final 2. can we write -1 as 0xFFFFFFFF, such that you can use arrays.fill(...) instead of ByteBuffer.writeInt(...), which is more convenient and more efficient? Line 68: int nextSlotIntPosInPageForGC = 0; Why do ***GC need to class fields instead of local variables? It seems to me that they are all local to the GC procedure. It seems tthat the GC procedure stops any insert/delete and can be stateless. Line 262: lastContentFrame.writeInt(lastOffsetInCurrentFrame + i, INVALID_VALUE); Simplify that using Arrays.fill(...). Line 384: for (int i = 0; i < frameCapacity; i++) Simplify that using one line: Arrays.fill(...) Line 413: int numberOfHeaderFrame = (int) Math.ceil(tableSize * 2 / frameSize); > CRITICAL SonarQube violation: Pls address this. Line 414: int numberOfContentFrame = (int) Math > MAJOR SonarQube violation: Pls address this. Line 439: // For calculating the original hash value wastedIntSpaceCount is not modified after executeGarbageCollection? Line 440: TuplePointer tmpTuplePointerForGC = new TuplePointer(); reuse the object. Line 488: // Swipe the region that can't be used. Using Arrays.fill(...) to simplify that. Line 505: } else { flip the else branch as the if branch and hence you can reduce nesting levels. Line 525: } Should you call ctx.deallocate(...) such that from the ctx perspective, it actually deallocates and more space can be allocated from ctx? Line 643: currentReadContentFrameForGC.writeInt(tempReadIntPosInPage + i, INVALID_VALUE); Simplify that using Arrays.fill(...). -- To view, visit https://asterix-gerrit.ics.uci.edu/1056 To unsubscribe, visit https://asterix-gerrit.ics.uci.edu/settings Gerrit-MessageType: comment Gerrit-Change-Id: I2b323e9a2141b4c1dd1652a360d2d9354d3bc3f5 Gerrit-PatchSet: 9 Gerrit-Project: asterixdb Gerrit-Branch: master Gerrit-Owner: Taewoo Kim <[email protected]> Gerrit-Reviewer: Jenkins <[email protected]> Gerrit-Reviewer: Jianfeng Jia <[email protected]> Gerrit-Reviewer: Michael Blow <[email protected]> Gerrit-Reviewer: Taewoo Kim <[email protected]> Gerrit-Reviewer: Till Westmann <[email protected]> Gerrit-Reviewer: Wenhai Li <[email protected]> Gerrit-Reviewer: Yingyi Bu <[email protected]> Gerrit-HasComments: Yes
