Ben-Zvi commented on a change in pull request #1606: Drill 6845: Semi-Hash-Join to skip incoming build duplicates, automatically stop skipping if too few URL: https://github.com/apache/drill/pull/1606#discussion_r247301472
########## File path: exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/join/HashJoinSpillControlImpl.java ########## @@ -0,0 +1,52 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.drill.exec.physical.impl.join; + + import org.apache.drill.exec.memory.BufferAllocator; + import org.apache.drill.exec.record.RecordBatchSizer; + import org.apache.drill.exec.record.VectorContainer; + +/** + * This class is currently used only for Semi-Hash-Join that avoids duplicates by the use of a hash table + * The method {@link HashJoinMemoryCalculator.HashJoinSpillControl#shouldSpill(VectorContainer)} returns true if the memory available now to the allocator if not enough + * to hold (a multiple of, for safety) a new allocated batch + */ +public class HashJoinSpillControlImpl implements HashJoinMemoryCalculator.HashJoinSpillControl { + private BufferAllocator allocator; + private int recordsPerBatch; + private int minBatchesInAvailableMemory; + + HashJoinSpillControlImpl(BufferAllocator allocator, int recordsPerBatch, int minBatchesInAvailableMemory) { + this.allocator = allocator; + this.recordsPerBatch = recordsPerBatch; + this.minBatchesInAvailableMemory = minBatchesInAvailableMemory; + } + + @Override + public boolean shouldSpill(VectorContainer currentVectorContainer) { + assert currentVectorContainer.hasRecordCount(); Review comment: Another commit (3613689) addressing the above issues. The new "spill control" code was rearranged in a way similar to the regular memory calculator: * The parameter to `shouldSpill()` was removed. Instead that information (inner build batch size) is calculated from the batchMemoryManager. * A post-build class was added, which performs a similar calculation using the inner probe batch size, plus the number of spilled partitions (added as a parameter to `initialize()`). This would be called after the build, and spill whole partition(s) is there is not enough memory to hold one inner probe batch per a spilled partition. (The shouldSpill() call now covers "skipping semi" cases as well - restored the order mention in the prior comments). * The number of partitions is reduced as needed, in a manner similar to the regular memory calculator - initialize() calls an internal calculateMemoryUsage(). Ran some tests - with the current defaults is is practically impossible to spill (32 partitions, 1024 rows in inner batch, key size - all together much less than 1 meg, and each HJ gets at least 40 meg). ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services
