Gustavo de Morais created FLINK-37844:
-----------------------------------------
Summary: FLIP-516 Optimization: Push down projections for
StreamingMultiJoinOperator
Key: FLINK-37844
URL: https://issues.apache.org/jira/browse/FLINK-37844
Project: Flink
Issue Type: Improvement
Reporter: Gustavo de Morais
We're currently adding support for a StreamingMultiJoinOperator which is able
to join N inputs. There are multiple minor optimizations we might be able to do
that weren't so easy to do with multiple chained binary joins. One of them is
materializing into state only attributes that are either joined in any of the N
- 1 join conditions or are projected in the final output. We'd have to do the
following:
* We already have the information of used fields for each input in
joinAttributeMap and can either pass that to the operator or add a new method
to the join extractor.
* The MultiJoin will contain the list of fields to be projected. We might have
to adapt and expose that as a map per inputid when creating the FlinkMultiJoin.
* When adding a record to state, we remove attributes that will not be used in
join conditions or projected.
* If we use null for these attributes, we don't have to adapt the logic. If we
recreate rows with a smaller arity, multiple places have to be adjusted so that
all our index-based logic is updated and correct.
Obs: this was a even more significant problem for binary joins, since we
materialized all attributes for all intermediate results. However, it's also
relevant here. I plan to measure impacts for each of the optimizations before
adding them [based on a
benchmark|https://github.com/apache/flink-benchmarks?tab=readme-ov-file#general-remarks],
and we'll first merge the operator. However, I'll be documenting the
optimizations with tickets so we track them here. This ticket arose from a
discussion with [~roman]
[here.|https://github.com/apache/flink/pull/26313#discussion_r2105917437]
--
This message was sent by Atlassian Jira
(v8.20.10#820010)