[
https://issues.apache.org/jira/browse/FLINK-10474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16642927#comment-16642927
]
Fabian Hueske commented on FLINK-10474:
---------------------------------------
Thanks for raising these points [~pnowojski].
I forgot to explicitly mention that this issue is only about IN with a list of
literal values. It is quite common to have dedicated optimizations for constant
expressions and terms. Calcite already translates IN with less than (IIRC) 25
values into a disjunctive predicate. This is happening for batch and streaming
queries.
The main issue here is to adjust Calcite's threshold when to switch to a join.
This should be a lightweight change.
In addition, we discussed adding a special execution code path for IN
predicates. Briefly looking over the PR, it seems that most of the code is
related to this feature (adding three optimization rules). I agree, that this
is a lot of code and adding optimization overhead for such a specific
optimization. Maybe, there's a better way, e.g., if the code generator checks
whether a disjunctive predicate only contains constant equality predicates. In
fact, the optimized IN evaluation is beneficial for batch and streaming
queries. I don't think that this would be a wasted effort once we have support
for joins on bounded, non-updating input. In most cases an in-memory HashMap
should be better for IN with literal values than adding more operators with
state that needs to be checkpointed and network shuffles.
Regarding your points
# Yes, it optimizes a special case but this case was reported by a user. The
general solution is not (even remotely) in sight and I think the optimization
(if we can evaluate IN with a HashMap) will still be beneficial once the
generic solution was implemented.
# Increasing the threshold uses an existing mechanism in Calcite. I agree that
we should find a less intrusive way to add special execution for IN. The
current approach in the PR adds much complexity, but can improve the
performance for IN for batch and streaming. If we don't find a good way to
optimize IN, we can only adjust the threshold and evaluate as disjunctive
predicates.
# You are right. We should try to solve this without adding optimization rules.
> Don't translate IN to JOIN with VALUES for streaming queries
> ------------------------------------------------------------
>
> Key: FLINK-10474
> URL: https://issues.apache.org/jira/browse/FLINK-10474
> Project: Flink
> Issue Type: Improvement
> Components: Table API & SQL
> Affects Versions: 1.6.1, 1.7.0
> Reporter: Fabian Hueske
> Assignee: Hequn Cheng
> Priority: Major
> Labels: pull-request-available
>
> IN clauses are translated to JOIN with VALUES if the number of elements in
> the IN clause exceeds a certain threshold. This should not be done, because a
> streaming join is very heavy and materializes both inputs (which is fine for
> the VALUES) input but not for the other.
> There are two ways to solve this:
> # don't translate IN to a JOIN at all
> # translate it to a JOIN but have a special join strategy if one input is
> bound and final (non-updating)
> Option 1. should be easy to do, option 2. requires much more effort.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)