[ https://issues.apache.org/jira/browse/FLINK-16392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated FLINK-16392: ----------------------------------- Labels: pull-request-available (was: ) > oneside sorted cache in intervaljoin > ------------------------------------ > > Key: FLINK-16392 > URL: https://issues.apache.org/jira/browse/FLINK-16392 > Project: Flink > Issue Type: Improvement > Components: API / DataStream > Affects Versions: 1.10.0 > Reporter: Chen Qin > Priority: Major > Labels: pull-request-available > Fix For: 1.11.0 > > > IntervalJoin is getting lots of usecases. Those use cases shares following > similar pattern > * left stream pulled from static dataset periodically > * lookup time range is very large (days weeks) > * right stream is web traffic with high QPS > In current interval join implementation, we treat both streams equal. > Specifically as rocksdb fetch and update getting more expensive, performance > took hit and unblock large use cases. > In proposed implementation, we plan to introduce two changes > * allow user opt-in in ProcessJoinFunction if they want to skip scan when > intervaljoin operator receive events from left stream(static data set) > * build sortedMap from otherBuffer of each seen key granularity > ** expedite right stream lookup of left buffers without access rocksdb > everytime > ** if a key see event from left side, it cleanup buffer and load buffer from > right side > > Open discussion > * how to control cache size? > ** TBD > * how to avoid dirty cache > ** if a given key see insertion from other side, cache will be cleared for > that key and rebuild. This is a small overhead to populate cache, compare > with current rocksdb implemenation, we need do full loop at every event. It > saves on bucket scan logic. > * what happens when checkpoint/restore > ** state still persists in statebackend, clear cache and rebuild of each new > key seen. > -- This message was sent by Atlassian Jira (v8.3.4#803005)