[ 
https://issues.apache.org/jira/browse/FLINK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16642927#comment-16642927
 ] 

Fabian Hueske commented on FLINK-10474:
---------------------------------------

Thanks for raising these points [~pnowojski].

I forgot to explicitly mention that this issue is only about IN with a list of 
literal values. It is quite common to have dedicated optimizations for constant 
expressions and terms. Calcite already translates IN with less than (IIRC) 25 
values into a disjunctive predicate. This is happening for batch and streaming 
queries.

The main issue here is to adjust Calcite's threshold when to switch to a join. 
This should be a lightweight change.

In addition, we discussed adding a special execution code path for IN 
predicates. Briefly looking over the PR, it seems that most of the code is 
related to this feature (adding three optimization rules). I agree, that this 
is a lot of code and adding optimization overhead for such a specific 
optimization. Maybe, there's a better way, e.g., if the code generator checks 
whether a disjunctive predicate only contains constant equality predicates. In 
fact, the optimized IN evaluation is beneficial for batch and streaming 
queries. I don't think that this would be a wasted effort once we have support 
for joins on bounded, non-updating input. In most cases an in-memory HashMap 
should be better for IN with literal values than adding more operators with 
state that needs to be checkpointed and network shuffles.

Regarding your points
 # Yes, it optimizes a special case but this case was reported by a user. The 
general solution is not (even remotely) in sight and I think the optimization 
(if we can evaluate IN with a HashMap) will still be beneficial once the 
generic solution was implemented.
 # Increasing the threshold uses an existing mechanism in Calcite. I agree that 
we should find a less intrusive way to add special execution for IN. The 
current approach in the PR adds much complexity, but can improve the 
performance for IN for batch and streaming. If we don't find a good way to 
optimize IN, we can only adjust the threshold and evaluate as disjunctive 
predicates.
 # You are right. We should try to solve this without adding optimization rules.

> Don't translate IN to JOIN with VALUES for streaming queries
> ------------------------------------------------------------
>
>                 Key: FLINK-10474
>                 URL: https://issues.apache.org/jira/browse/FLINK-10474
>             Project: Flink
>          Issue Type: Improvement
>          Components: Table API & SQL
>    Affects Versions: 1.6.1, 1.7.0
>            Reporter: Fabian Hueske
>            Assignee: Hequn Cheng
>            Priority: Major
>              Labels: pull-request-available
>
> IN clauses are translated to JOIN with VALUES if the number of elements in 
> the IN clause exceeds a certain threshold. This should not be done, because a 
> streaming join is very heavy and materializes both inputs (which is fine for 
> the VALUES) input but not for the other.
> There are two ways to solve this:
>  # don't translate IN to a JOIN at all
>  # translate it to a JOIN but have a special join strategy if one input is 
> bound and final (non-updating)
> Option 1. should be easy to do, option 2. requires much more effort.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to