advancedxy commented on issue #121:
URL: 
https://github.com/apache/arrow-datafusion-comet/issues/121#issuecomment-1981343136

    I did some research to support `InSubqueryExec`. I think we should postpone 
the support a little bit, at least after `Comet` supporting Join operators.
   
   The `InSubqueryExec` is mainly used for
   1.  DPP(dynamic partition pruning), which evaluates the in predicate in the 
driver side.
   2. Some special cases, which actually performs the `inSet` evaluation in the 
executor side(for Comet, the native side).
   
   For the first part, it would be pretty straightforward to support in the 
`Comet` side as all the evaluations happens at the driver(/JVM) side. We can 
model that like `InSubqueryExec` to prepare subqueries first and do some 
potential expression and plan transforms. We are good to go. However, DPP 
applies to Join operators. It would be reasonable to add DPP support after we 
have Join operators in Comet.
   
   For the second part, it's slightly complicated. Per my understanding, we 
have multiple options:
   1. Like we did for `ScalarSubqueryExec`, we can add a `InSubquery` 
PhysicalExpr implementation. The main problem is how to transform the list data 
from JVM to the native side. I'm skeptical to just transfer the java object 
array via the JNI call as the list might be pretty big. Maybe we should 
transform that to a RecordBatch/CometVector and then pass it back to the native 
side?
   2. Instead of implementing `InSubquery`, we can rewrite it with the `InSet` 
expression as we have already has the subquery list collected before we 
actually execute the plan. The problem is that:
       - Currently, we don't have a way to rewrite/transform the native 
operator after we created it
       - The proto message should have a size limit, something like 64MB? It 
will not work for the huge inSet.
   
   cc @viirya @sunchao appreciate if you guys have more insights about this 
topic.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to