[GitHub] Ben-Zvi commented on a change in pull request #1606: Drill 6845: Semi-Hash-Join to skip incoming build duplicates, automatically stop skipping if too few

GitBox Fri, 11 Jan 2019 19:48:35 -0800

Ben-Zvi commented on a change in pull request #1606: Drill 6845: Semi-Hash-Join 
to skip incoming build duplicates, automatically stop skipping if too few
URL: https://github.com/apache/drill/pull/1606#discussion_r247301472


 ##########
 File path: 
exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinSpillControlImpl.java
 ##########
 @@ -0,0 +1,52 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.drill.exec.physical.impl.join;
+
+  import org.apache.drill.exec.memory.BufferAllocator;
+  import org.apache.drill.exec.record.RecordBatchSizer;
+  import org.apache.drill.exec.record.VectorContainer;
+
+/**
+ * This class is currently used only for Semi-Hash-Join that avoids duplicates 
by the use of a hash table
+ * The method {@link 
HashJoinMemoryCalculator.HashJoinSpillControl#shouldSpill(VectorContainer)} 
returns true if the memory available now to the allocator if not enough
+ * to hold (a multiple of, for safety) a new allocated batch
+ */
+public class HashJoinSpillControlImpl implements 
HashJoinMemoryCalculator.HashJoinSpillControl {
+  private BufferAllocator allocator;
+  private int recordsPerBatch;
+  private int minBatchesInAvailableMemory;
+
+  HashJoinSpillControlImpl(BufferAllocator allocator, int recordsPerBatch, int 
minBatchesInAvailableMemory) {
+    this.allocator = allocator;
+    this.recordsPerBatch = recordsPerBatch;
+    this.minBatchesInAvailableMemory = minBatchesInAvailableMemory;
+  }
+
+  @Override
+  public boolean shouldSpill(VectorContainer currentVectorContainer) {
+    assert currentVectorContainer.hasRecordCount();
 
 Review comment:
   Another commit (3613689) addressing the above issues.  The new "spill 
control" code was rearranged in a way similar to the regular memory calculator:
   * The parameter to `shouldSpill()` was removed. Instead that information 
(inner build batch size) is calculated from the batchMemoryManager.  
   * A post-build class was added, which performs a similar calculation using 
the inner probe batch size, plus the number of spilled partitions (added as a 
parameter to `initialize()`). This would be called after the build, and spill 
whole partition(s) is there is not enough memory to hold one inner probe batch 
per a spilled partition. (The shouldSpill() call now covers "skipping semi" 
cases as well - restored the order mention in the prior comments).
   * The number of partitions is reduced as needed, in a manner similar to the 
regular memory calculator - initialize() calls an internal 
calculateMemoryUsage().
   
   Ran some tests - with the current defaults is is practically impossible to 
spill (32 partitions, 1024 rows in inner batch, key size - all together much 
less than 1 meg, and each HJ gets at least 40 meg).
    

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] Ben-Zvi commented on a change in pull request #1606: Drill 6845: Semi-Hash-Join to skip incoming build duplicates, automatically stop skipping if too few

Reply via email to