Hi Aman: Thanks for your tips. I have rebased the latest code from the master branch . Yes, the spill-to-disk feature does changed the original implementation. I have adjusted my implementation according to the new feature. But as you say, it will take some challenge to integration as I noticed the spill-to-disk feature will continue to tune its implementation performance.
The BloomFilter was implemented natively in Drill , not an external library. It's implemented the algorithm of the paper which was mentioned by you. On Thu, May 31, 2018 at 1:56 AM Aman Sinha <[email protected]> wrote: > Hi Weijie, > I was hoping you could leverage the existing methods..so its good that you > found the ones that work for your use case. > One thing I want to point out (maybe you're already aware) .. the Hash Join > code has changed significantly in the master branch due to the > spill-to-disk feature. > So, this may pose some integration challenges for your run-time join > pushdown feature. > Also, one other question/clarification: for the bloom filter itself are > you implementing it natively in Drill or using an external library ? > > -Aman > > On Tue, May 29, 2018 at 8:23 PM, weijie tong <[email protected]> > wrote: > > > I found ClassGenerator's nestEvalBlock(JBlock block) and > unNestEvalBlock() > > which has the same effect to what I change to the ClassGenerator. So I > give > > up what I change to the ClassGenerator and hope this can help someone > else. > > > > On Tue, May 29, 2018 at 1:53 PM weijie tong <[email protected]> > > wrote: > > > > > The code formatting is not nice. Put them again: > > > > > > private void setupGetBuild64Hash(ClassGenerator<HashTable> cg, > > MappingSet > > > incomingMapping, VectorAccessible batch, LogicalExpression[] keyExprs, > > > TypedFieldId[] buildKeyFieldIds) > > > throws SchemaChangeException { > > > cg.setMappingSet(incomingMapping); > > > if (keyExprs == null || keyExprs.length == 0) { > > > cg.getEvalBlock()._return(JExpr.lit(0)); > > > } > > > String seedValue = "seedValue"; > > > String fieldId = "fieldId"; > > > LogicalExpression seed = > > > ValueExpressions.getParameterExpression(seedValue, Types.required( > > > TypeProtos.MinorType.INT)); > > > > > > LogicalExpression fieldIdParamExpr = > > > ValueExpressions.getParameterExpression(fieldId, Types.required( > > > TypeProtos.MinorType.INT) ); > > > HoldingContainer fieldIdParamHolder = cg.addExpr(fieldIdParamExpr); > > > int i = 0; > > > for (LogicalExpression expr : keyExprs) { > > > TypedFieldId targetTypeFieldId = buildKeyFieldIds[i]; > > > ValueExpressions.IntExpression targetBuildFieldIdExp = new > > > ValueExpressions.IntExpression(targetTypeFieldId.getFieldIds()[0], > > > ExpressionPosition.UNKNOWN); > > > > > > JFieldRef targetBuildSideFieldId = > cg.addExpr(targetBuildFieldIdExp, > > > ClassGenerator.BlkCreateMode.TRUE_IF_BOUND).getValue(); > > > JBlock ifBlock = > > > cg.getEvalBlock()._if(fieldIdParamHolder.getValue(). > > eq(targetBuildSideFieldId))._then(); > > > //specify a special JBlock which is a inner one of the eval block > to > > > the ClassGenerator to substitute the returned JBlock of getEvalBlock() > > > cg.setCustomizedEvalInnerBlock(ifBlock); > > > LogicalExpression hashExpression = > > > HashPrelUtil.getHashExpression(expr, seed, incomingProbe != null); > > > LogicalExpression materializedExpr = > > > ExpressionTreeMaterializer.materializeAndCheckErrors(hashExpression, > > batch, > > > context.getFunctionRegistry()); > > > HoldingContainer hash = cg.addExpr(materializedExpr, > > > ClassGenerator.BlkCreateMode.TRUE_IF_BOUND); > > > ifBlock._return(hash.getValue()); > > > //reset the customized block to null ,so the getEvalBlock() return > > the > > > truly eval JBlock > > > cg.setCustomizedEvalInnerBlock(null); > > > i++; > > > } > > > cg.getEvalBlock()._return(JExpr.lit(0)); > > > } > > > > > > > > > > > > > > > public long getBuild64HashCodeInner(int incomingRowIdx, int seedValue, > > int > > > fieldId) > > > throws SchemaChangeException > > > { > > > { > > > IntHolder fieldId12 = new IntHolder(); > > > fieldId12 .value = fieldId; > > > if (fieldId12 .value == constant14 .value) { > > > IntHolder out18 = new IntHolder(); > > > { > > > out18 .value = vv15 .getAccessor().get((incomingRowIdx)); > > > } > > > IntHolder seedValue19 = new IntHolder(); > > > seedValue19 .value = seedValue; > > > //---- start of eval portion of hash32AsDouble function. ----// > > > IntHolder out20 = new IntHolder(); > > > { > > > final IntHolder out = new IntHolder(); > > > IntHolder in = out18; > > > IntHolder seed = seedValue19; > > > > > > Hash32WithSeedAsDouble$IntHash_eval: { > > > out.value = > > > org.apache.drill.exec.expr.fn.impl.HashHelper.hash32((double) in.value, > > > seed.value); > > > } > > > > > > out20 = out; > > > } > > > //---- end of eval portion of hash32AsDouble function. ----// > > > return out20 .value; > > > } > > > return 0; > > > } > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, May 29, 2018 at 1:47 PM weijie tong <[email protected]> > > > wrote: > > > > > >> HI Paul: > > >> > > >> Thanks for your enthusiasm. I have managed this skill as you ever > > >> mentioned me at another mail thread. It's really helpful ,thanks for > > your > > >> valuable work. > > >> > > >> Now I have solved this tough problem by adding a customized JBlock > > >> member field to the ClassGenerator. So once you want the > getEvalBlock() > > of > > >> the ClassGenerator to return a inner customized JBlock , then you set > > this > > >> member, if you want the method to return eval self JBlock , you reset > > this > > >> member to null. > > >> > > >> Here is my changed setup method : > > >> > > >> > > >> private void setupGetBuild64Hash(ClassGenerator<HashTable> cg, > > MappingSet incomingMapping, VectorAccessible batch, LogicalExpression[] > > keyExprs, TypedFieldId[] buildKeyFieldIds) > > >> throws SchemaChangeException { > > >> cg.setMappingSet(incomingMapping); > > >> if (keyExprs == null || keyExprs.length == 0) { > > >> cg.getEvalBlock()._return(JExpr.lit(0)); > > >> } > > >> String seedValue = "seedValue"; > > >> String fieldId = "fieldId"; > > >> LogicalExpression seed = > ValueExpressions.getParameterExpression(seedValue, > > Types.required(TypeProtos.MinorType.INT)); > > >> > > >> LogicalExpression fieldIdParamExpr = ValueExpressions. > > getParameterExpression(fieldId, Types.required(TypeProtos.MinorType.INT) > > ); > > >> HoldingContainer fieldIdParamHolder = cg.addExpr(fieldIdParamExpr); > > >> int i = 0; > > >> for (LogicalExpression expr : keyExprs) { > > >> TypedFieldId targetTypeFieldId = buildKeyFieldIds[i]; > > >> ValueExpressions.IntExpression targetBuildFieldIdExp = new > > ValueExpressions.IntExpression(targetTypeFieldId.getFieldIds()[0], > > ExpressionPosition.UNKNOWN); > > >> > > >> JFieldRef targetBuildSideFieldId = > cg.addExpr(targetBuildFieldIdExp, > > ClassGenerator.BlkCreateMode.TRUE_IF_BOUND).getValue(); > > >> JBlock ifBlock = cg.getEvalBlock()._if( > > fieldIdParamHolder.getValue().eq(targetBuildSideFieldId))._then(); > > >> //specify a special JBlock which is a inner one of the eval block > > to the ClassGenerator to substitute the returned JBlock of getEvalBlock() > > >> cg.setCustomizedEvalInnerBlock(ifBlock); > > >> LogicalExpression hashExpression = > HashPrelUtil.getHashExpression(expr, > > seed, incomingProbe != null); > > >> LogicalExpression materializedExpr = ExpressionTreeMaterializer. > > materializeAndCheckErrors(hashExpression, batch, > > context.getFunctionRegistry()); > > >> HoldingContainer hash = cg.addExpr(materializedExpr, > > ClassGenerator.BlkCreateMode.TRUE_IF_BOUND); > > >> ifBlock._return(hash.getValue()); > > >> //reset the customized block to null ,so the getEvalBlock() return > > the truly eval JBlock > > >> cg.setCustomizedEvalInnerBlock(null); > > >> i++; > > >> } > > >> cg.getEvalBlock()._return(JExpr.lit(0)); > > >> } > > >> > > >> > > >> The corresponding generated codes : > > >> > > >> public long getBuild64HashCodeInner(int incomingRowIdx, int > > seedValue, int fieldId) > > >> throws SchemaChangeException > > >> { > > >> { > > >> IntHolder fieldId12 = new IntHolder(); > > >> fieldId12 .value = fieldId; > > >> if (fieldId12 .value == constant14 .value) { > > >> IntHolder out18 = new IntHolder(); > > >> { > > >> out18 .value = vv15 .getAccessor().get(( > > incomingRowIdx)); > > >> } > > >> IntHolder seedValue19 = new IntHolder(); > > >> seedValue19 .value = seedValue; > > >> //---- start of eval portion of hash32AsDouble > > function. ----// > > >> IntHolder out20 = new IntHolder(); > > >> { > > >> final IntHolder out = new IntHolder(); > > >> IntHolder in = out18; > > >> IntHolder seed = seedValue19; > > >> > > >> Hash32WithSeedAsDouble$IntHash_eval: { > > >> out.value = > org.apache.drill.exec.expr.fn.impl.HashHelper.hash32((double) > > in.value, seed.value); > > >> } > > >> > > >> out20 = out; > > >> } > > >> //---- end of eval portion of hash32AsDouble function. > > ----// > > >> return out20 .value; > > >> } > > >> return 0; > > >> } > > >> } > > >> > > >> > > >> > > >> Some other explanation: > > >> 1st : The if checking won't hurt the performance , as I invoke this > > >> method column by column , so it's branch predication friendly. > > >> 2nd: I will use the murmur3_64 not the murmur3_32 ,since the > efficient > > >> bloom filter algorithm needs the 64 bit hash code to avoid the > conflict. > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> > > >> On Tue, May 29, 2018 at 12:37 PM Paul Rogers > <[email protected] > > > > > >> wrote: > > >> > > >>> Hi Weijie, > > >>> > > >>> Seeing the discussion about the details of JCodeModel suggests you > may > > >>> be trying to debug your generated code at the level of the code > > generator. > > >>> > > >>> Some time ago we added the ability to step through the generated > code. > > >>> Look for the following line in the generator code: > > >>> > > >>> > > >>> // Uncomment out this line to debug the generated code. > > >>> > > >>> // cg.saveCodeForDebugging(true); > > >>> > > >>> > > >>> Uncomment the code line and Drill will save each generated file to a > > >>> configured location (which, if I recall correctly, is > > /tmp/drill/codegen, > > >>> though it may have changed after Tim's test directory changes.) > > >>> > > >>> Then, set a breakpoint in the template setup() method and you can > step > > >>> directly into the generated doSetup() method. Same for the eval() > > method. > > >>> > > >>> This way, you can not only see the generated code, you can step > through > > >>> it. I've found this to be a far easier way to understand the > generated > > code > > >>> than the older techniques folks have used (look at byte codes, use > > print > > >>> statements, brute force reasoning, etc.) > > >>> > > >>> Tim, Boaz and others have used this technique more recently and can > > >>> probably give you additional pointers. > > >>> > > >>> Thanks, > > >>> - Paul > > >>> > > >>> > > >>> > > >>> On Monday, May 28, 2018, 8:52:19 PM PDT, weijie tong < > > >>> [email protected]> wrote: > > >>> > > >>> @aman thanks for your reply. "For the ifBlock, do you need an > _else() > > >>> block > > >>> also ?" I give a default return logic at the method, so I don't need > > the > > >>> _else() block. I have noticed the IfExpression's evaluation method > at > > >>> EvaluationVisitor which also uses the JConditional . But that also > > >>> doesn't > > >>> match my requirement. I think the key point here is the > > >>> FunctionHolderExpression and ValueVectorReadExpression will put their > > >>> corresponding generated codes to the eval method's JBlock , not our > > >>> specific IfBlock which is a inner block of the eval method's JBlock . > > >>> > > >>> So it seems I should make some changes to the ClassGenerator to let > the > > >>> getEvalBlock return the IfBlock (maybe accurately the JConditional's > > then > > >>> block) or implement some special FunctionHolderExpression > > >>> 、ValueVectorReadExpression and corresponding visiting methods at the > > >>> EvaluationVisitor to generate the special code blocks. Hope someone > who > > >>> are > > >>> familiar with these part of codes to point out whether there are more > > >>> easy > > >>> or different choices to achieve the target. > > >>> > > >>> To make discussion more accurate, I put the generated codes of the > > >>> previous > > >>> setupGetBuild64Hash method here: > > >>> > > >>> public long getBuild64HashCodeInner(int incomingRowIdx, int > > >>> seedValue, int fieldId) > > >>> throws SchemaChangeException > > >>> { > > >>> { > > >>> IntHolder fieldId16 = new IntHolder(); > > >>> fieldId16 .value = fieldId; > > >>> if (fieldId16 .value == constant18 .value) { > > >>> return out24 .value; > > >>> } > > >>> IntHolder out22 = new IntHolder(); > > >>> { > > >>> out22 .value = vv19 .getAccessor().get(( > > incomingRowIdx)); > > >>> } > > >>> IntHolder seedValue23 = new IntHolder(); > > >>> seedValue23 .value = seedValue; > > >>> //---- start of eval portion of hash32AsDouble function. > > >>> ----// > > >>> IntHolder out24 = new IntHolder(); > > >>> { > > >>> final IntHolder out = new IntHolder(); > > >>> IntHolder in = out22; > > >>> IntHolder seed = seedValue23; > > >>> > > >>> Hash32WithSeedAsDouble$IntHash_eval: { > > >>> out.value = > > >>> org.apache.drill.exec.expr.fn.impl.HashHelper.hash32((double) > > >>> in.value, seed.value); > > >>> } > > >>> > > >>> out24 = out; > > >>> } > > >>> //---- end of eval portion of hash32AsDouble function. > > ----// > > >>> if (fieldId16 .value == constant18 .value) { > > >>> return out26 .value; > > >>> } > > >>> IntHolder seedValue25 = new IntHolder(); > > >>> seedValue25 .value = seedValue; > > >>> //---- start of eval portion of hash32AsDouble function. > > >>> ----// > > >>> IntHolder out26 = new IntHolder(); > > >>> { > > >>> final IntHolder out = new IntHolder(); > > >>> IntHolder in = out22; > > >>> IntHolder seed = seedValue25; > > >>> > > >>> Hash32WithSeedAsDouble$IntHash_eval: { > > >>> out.value = > > >>> org.apache.drill.exec.expr.fn.impl.HashHelper.hash32((double) > > >>> in.value, seed.value); > > >>> } > > >>> > > >>> out26 = out; > > >>> } > > >>> //---- end of eval portion of hash32AsDouble function. > > ----// > > >>> return 0; > > >>> } > > >>> } > > >>> > > >>> > > >>> > > >>> > > >>> > > >>> On Tue, May 29, 2018 at 10:51 AM Aman Sinha <[email protected]> > > >>> wrote: > > >>> > > >>> > sorry, the previous email is incomplete. > > >>> > For the ifBlock, do you need an _else() block also ? > > >>> > > > >>> > I have sometimes found that 'JConditional' is a good way to break > > down > > >>> the > > >>> > logic further. Please see example usages of JConditional here [1]. > > >>> > > > >>> > -Aman > > >>> > > > >>> > [1] > > >>> > > > >>> > > > >>> https://www.programcreek.com/java-api-examples/?api=com. > > sun.codemodel.JBlock > > >>> > > > >>> > On Mon, May 28, 2018 at 7:46 PM, Aman Sinha <[email protected]> > > >>> wrote: > > >>> > > > >>> > > Hi Weijie, > > >>> > > It would be a little cumbersome to debug such issues over email > > >>> since one > > >>> > > has to look at the generated code output and iteratively debug. > > >>> > > Couple of thoughts I have that might help: > > >>> > > > > >>> > > For this particular if-then block, should you also > > >>> > > JBlock ifBlock = > > >>> > > cg.getEvalBlock()._if(fieldIdParamHolder.getValue().eq(targe > > >>> > > tBuildSideFieldId))._then(); > > >>> > > > > >>> > > > > >>> > > > > >>> > > On Mon, May 28, 2018 at 4:17 AM, weijie tong < > > >>> [email protected]> > > >>> > > wrote: > > >>> > > > > >>> > >> HI All: > > >>> > >> Through implementing the JPPD feature ( > > >>> > >> https://issues.apache.org/jira/browse/DRILL-6385) , I was > blocked > > >>> by > > >>> > the > > >>> > >> problem: how to get the hash code of each build side of the hash > > >>> join > > >>> > >> columns through the dynamic generated java code. Hope someone > can > > >>> give > > >>> > >> some > > >>> > >> advice. > > >>> > >> > > >>> > >> I supposed to add methods as below to the HashTableTemplate : > > >>> > >> > > >>> > >> public long getBuild64HashCode(int incomingRowIdx, int > seedValue, > > >>> int > > >>> > >> fieldId) throws SchemaChangeException{ > > >>> > >> return getBuild64HashCodeInner(incomingRowIdx, seedValue, > > >>> fieldId); > > >>> > >> } > > >>> > >> > > >>> > >> protected abstract long > > >>> > >> getBuild64HashCodeInner(@Named("incomingRowIdx") int > > incomingRowIdx, > > >>> > >> @Named("seedValue") int seedValue, @Named("fieldId") int > fieldId) > > >>> > >> throws SchemaChangeException; > > >>> > >> > > >>> > >> > > >>> > >> The high level code to invoke the getBuild64HashCode method > is > > >>> at the > > >>> > >> HashJoinBatch's executeBuildPhase() : > > >>> > >> > > >>> > >> //create runtime filter > > >>> > >> if (cycleNum == 0 && enableRuntimeFilter) { > > >>> > >> //create runtime filter and send out async > > >>> > >> int condFieldIndex = 0; > > >>> > >> for (BloomFilter bloomFilter : bloomFilters) { > > >>> > >> //VV > > >>> > >> for (int ind = 0; ind < currentRecordCount; ind++) { > > >>> > >> long hashCode = partitions[0].getBuild64HashCode(ind, > > >>> > >> condFieldIndex); > > >>> > >> bloomFilter.insert(hashCode); > > >>> > >> } > > >>> > >> condFieldIndex++; > > >>> > >> } > > >>> > >> //TODO sered out async > > >>> > >> } > > >>> > >> > > >>> > >> > > >>> > >> As you know, the abstract method getBuild64HashCodeInner needs > to > > >>> > >> calculate the hash codes of each build side column by the > fieldId > > >>> input > > >>> > >> parameter. In order to achieve this target, I plan to have > > different > > >>> > >> solving parts corresponding to different column ValueVector , > > using > > >>> the > > >>> > if > > >>> > >> statement to distinguish different solving parts through the id > of > > >>> the > > >>> > >> column. The corresponding method to generate the dynamic codes > > is > > >>> as > > >>> > >> below: > > >>> > >> > > >>> > >> private void setupGetBuild64Hash(ClassGenerator<HashTable> cg, > > >>> > >> MappingSet incomingMapping, VectorAccessible batch, > > >>> > >> LogicalExpression[] keyExprs, TypedFieldId[] buildKeyFieldIds) > > >>> > >> throws SchemaChangeException { > > >>> > >> cg.setMappingSet(incomingMapping); > > >>> > >> if (keyExprs == null || keyExprs.length == 0) { > > >>> > >> cg.getEvalBlock()._return(JExpr.lit(0)); > > >>> > >> } > > >>> > >> String seedValue = "seedValue"; > > >>> > >> String fieldId = "fieldId"; > > >>> > >> LogicalExpression seed = > > >>> > >> ValueExpressions.getParameterExpression(seedValue, > > >>> > >> Types.required(TypeProtos.MinorType.INT)); > > >>> > >> > > >>> > >> LogicalExpression fieldIdParamExpr = > > >>> > >> ValueExpressions.getParameterExpression(fieldId, > > >>> > >> Types.required(TypeProtos.MinorType.INT) ); > > >>> > >> HoldingContainer fieldIdParamHolder = > > cg.addExpr(fieldIdParamExpr); > > >>> > >> int i = 0; > > >>> > >> for (LogicalExpression expr : keyExprs) { > > >>> > >> TypedFieldId targetTypeFieldId = buildKeyFieldIds[i]; > > >>> > >> ValueExpressions.IntExpression targetBuildFieldIdExp = new > > >>> > >> ValueExpressions.IntExpression(targetTypeFieldId.getFieldIds( > > )[0], > > >>> > >> ExpressionPosition.UNKNOWN); > > >>> > >> JFieldRef targetBuildSideFieldId = > > >>> > >> cg.addExpr(targetBuildFieldIdExp, > > >>> > >> ClassGenerator.BlkCreateMode.TRUE_IF_BOUND).getValue(); > > >>> > >> JBlock ifBlock = > > >>> > >> cg.getEvalBlock()._if(fieldIdParamHolder.getValue().eq(targe > > >>> > >> tBuildSideFieldId))._then(); > > >>> > >> > > >>> > >> LogicalExpression hashExpression = > > >>> > >> HashPrelUtil.getHashExpression(expr, seed, incomingProbe != > > null); > > >>> > >> LogicalExpression materializedExpr = > > >>> > >> ExpressionTreeMaterializer.materializeAndCheckErrors( > > hashExpression, > > >>> > >> batch, context.getFunctionRegistry()); > > >>> > >> HoldingContainer hash = cg.addExpr(materializedExpr, > > >>> > >> ClassGenerator.BlkCreateMode.FALSE); > > >>> > >> > > >>> > >> > > >>> > >> ifBlock._return(hash.getValue()); > > >>> > >> i++; > > >>> > >> } > > >>> > >> cg.getEvalBlock()._return(JExpr.lit(0)); > > >>> > >> > > >>> > >> } > > >>> > >> > > >>> > >> But unfortunately, the generated codes are not what I expected. > > The > > >>> > codes > > >>> > >> to read ValueVector , calculate hash code of the read value do > not > > >>> stay > > >>> > in > > >>> > >> the if block. So how can I let the related codes stay in the if > > >>> block ? > > >>> > >> > > >>> > > > > >>> > > > > >>> > > > >> > > >> > > >
