Yingyi Bu has posted comments on this change.

Change subject: ASTERIXDB-1566,ASTERIXDB-1628: Fixed External Hash Group By to 
conform to the memory budget
......................................................................


Patch Set 9:

(19 comments)

Some high level comments:

1.  The accounting of memory budget is still based on #Frames instead of 
#Bytes?  How does the Big Object case get handled?

2.  IMO, GC should be internal to SerializableHashTable.  SerializableHashTable 
itself can decide whether to do GC or not in insert/delete based on a certain 
policy, e.g., memory is full in the last N calls and the ratio of wasted space 
is higher than some threshold.

3.  I don't recall the exact reason why an additional parameter 
compiler.grouphashtablememory is needed. If it is really needed, we'd better 
audit in  AsterixCompilerProperties.getGroupHashTableMemorySize(), e.g., it 
couldn't go beyond an certain upper bound calculated from compiler.groupmemory, 
assuming each tuple only have one int8 field payload (which is the worst case).

4.  It seems that the GC code in SerializableHashTable could be simplified, 
using Arrays.fill(...).  ByteBuffer.writeXXX() is not as efficient as direct 
bytes reads/writes.

5.  The GC policy and GC implementation needs to be tested more thoroughly, 
probably using MockIto.  Especially, we might want to make sure the GC code is 
not run too often.

6.  This change should better be perf-tested as it will impact Hash Join 
performance as well (I guess).  One possibility could be to run Jianfeng's 
Micro-benchmark that he used for the Big Object change to contrast the versions 
with/without this change.

7.  Address SonarQube comments as much as possible, e.g., useless assignments 
and useless parens.

https://asterix-gerrit.ics.uci.edu/#/c/1056/9/asterixdb/asterix-docker/docker/asterix-configuration.xml
File asterixdb/asterix-docker/docker/asterix-configuration.xml:

Line 251:       more efficiently using pointers. Actual Group-by value is 
stored in data table.
"Group-by value" -> "aggregate values"


Line 253:       the same as the given value divided by 40 bytes. (E.g., 16MB / 
40 = 419,430).
Can you add the explanation of why 40 is the number into the comments?


https://asterix-gerrit.ics.uci.edu/#/c/1056/9/hyracks-fullstack/hyracks/hyracks-dataflow-std/src/main/java/org/apache/hyracks/dataflow/std/group/HashSpillableTableFactory.java
File 
hyracks-fullstack/hyracks/hyracks-dataflow-std/src/main/java/org/apache/hyracks/dataflow/std/group/HashSpillableTableFactory.java:

Line 91:         final ITuplePartitionComputer tpc1 = new 
FieldHashPartitionComputerFamily(intermediateResultKeys,
better variable name. e.g.,
tpcIntermediate


https://asterix-gerrit.ics.uci.edu/#/c/1056/9/hyracks-fullstack/hyracks/hyracks-dataflow-std/src/main/java/org/apache/hyracks/dataflow/std/group/ISpillableTable.java
File 
hyracks-fullstack/hyracks/hyracks-dataflow-std/src/main/java/org/apache/hyracks/dataflow/std/group/ISpillableTable.java:

Line 55:      * @return
annotate @return.


https://asterix-gerrit.ics.uci.edu/#/c/1056/9/hyracks-fullstack/hyracks/hyracks-dataflow-std/src/main/java/org/apache/hyracks/dataflow/std/group/external/ExternalHashGroupBy.java
File 
hyracks-fullstack/hyracks/hyracks-dataflow-std/src/main/java/org/apache/hyracks/dataflow/std/group/external/ExternalHashGroupBy.java:

Line 79:                         result = table.insert(accessor, i);
flip the else branch as the if branch, and hence not nest more than 4 levels.


https://asterix-gerrit.ics.uci.edu/#/c/1056/9/hyracks-fullstack/hyracks/hyracks-dataflow-std/src/main/java/org/apache/hyracks/dataflow/std/structures/ISerializableTable.java
File 
hyracks-fullstack/hyracks/hyracks-dataflow-std/src/main/java/org/apache/hyracks/dataflow/std/structures/ISerializableTable.java:

Line 43:     boolean isGarbageCollectioNeeded();
This method is never called. Do we really need it?


Line 46:             throws HyracksDataException;
Why does this need to be expose to the outside?
It seems to be that this should be internal to this HashTable and automatically 
invoked in the insert/delete calls from time to time.


https://asterix-gerrit.ics.uci.edu/#/c/1056/9/hyracks-fullstack/hyracks/hyracks-dataflow-std/src/main/java/org/apache/hyracks/dataflow/std/structures/SerializableHashTable.java
File 
hyracks-fullstack/hyracks/hyracks-dataflow-std/src/main/java/org/apache/hyracks/dataflow/std/structures/SerializableHashTable.java:

Line 47:     private static int INVALID_VALUE = -1;
1. static final

2. can we write -1 as 0xFFFFFFFF, such that you can use arrays.fill(...) 
instead of ByteBuffer.writeInt(...), which is more convenient and more 
efficient?


Line 68:     int nextSlotIntPosInPageForGC = 0;
Why do ***GC need to class fields instead of local variables?
It seems to me that they are all local to the GC procedure.  It seems tthat the 
GC procedure stops any insert/delete and can be stateless.


Line 262:                     
lastContentFrame.writeInt(lastOffsetInCurrentFrame + i, INVALID_VALUE);
Simplify that using Arrays.fill(...).


Line 384:         for (int i = 0; i < frameCapacity; i++)
Simplify that using one line:

Arrays.fill(...)


Line 413:         int numberOfHeaderFrame = (int) Math.ceil(tableSize * 2 / 
frameSize);
> CRITICAL SonarQube violation:
Pls address this.


Line 414:         int numberOfContentFrame = (int) Math
> MAJOR SonarQube violation:
Pls address this.


Line 439:         // For calculating the original hash value
wastedIntSpaceCount is not modified after executeGarbageCollection?


Line 440:         TuplePointer tmpTuplePointerForGC = new TuplePointer();
reuse the object.


Line 488:                             // Swipe the region that can't be used.
Using Arrays.fill(...) to simplify that.


Line 505:                 } else {
flip the else branch as the if branch and hence you can reduce nesting levels.


Line 525:             }
Should you call ctx.deallocate(...)  such that from the ctx perspective, it 
actually deallocates and more space can be allocated from ctx?


Line 643:                 
currentReadContentFrameForGC.writeInt(tempReadIntPosInPage + i, INVALID_VALUE);
Simplify that using Arrays.fill(...).


-- 
To view, visit https://asterix-gerrit.ics.uci.edu/1056
To unsubscribe, visit https://asterix-gerrit.ics.uci.edu/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I2b323e9a2141b4c1dd1652a360d2d9354d3bc3f5
Gerrit-PatchSet: 9
Gerrit-Project: asterixdb
Gerrit-Branch: master
Gerrit-Owner: Taewoo Kim <[email protected]>
Gerrit-Reviewer: Jenkins <[email protected]>
Gerrit-Reviewer: Jianfeng Jia <[email protected]>
Gerrit-Reviewer: Michael Blow <[email protected]>
Gerrit-Reviewer: Taewoo Kim <[email protected]>
Gerrit-Reviewer: Till Westmann <[email protected]>
Gerrit-Reviewer: Wenhai Li <[email protected]>
Gerrit-Reviewer: Yingyi Bu <[email protected]>
Gerrit-HasComments: Yes

Reply via email to