Hi there,

I would like kick off discussion on
https://issues.apache.org/jira/browse/FLINK-16392 and discuss what is best
way moving forward. Here is problem statement and proposal we have in mind.
Please kindly provide feedback.

Native intervaljoin rely on statebackend(e.g rocksdb) to insert/fetch left
and right buffer. This design choice reduce minimize heap memory footprint
while bounded process throughput of single taskmanager iops to rocksdb
access speed. Here at Pinterest, we have some large use cases where
developers join large and slow evolving data stream (e.g post updates in
last 28 days) with web traffic datastream (e.g post views up to 28 days
after given update).

This post some challenge to current implementation of intervaljoin

   - partitioned rocksdb needs to keep both updates and views for 28 days,
   large buffer(especially view stream side) cause rocksdb slow down and lead
   to overall interval join performance degregate quickly as state build up.


   - view stream is web scale, even after setting large parallelism it can
   put lot of pressure on each subtask and backpressure entire job

In proposed implementation, we plan to introduce two changes

   - support ProcessJoinFunction settings to opt-in earlier cleanup time of
   right stream(e.g view stream don't have to stay in buffer for 28 days and
   wait for update stream to join, related post views happens after update in
   event time semantic) This optimization can reduce state size to improve
   rocksdb throughput. If extreme case, user can opt-in in flight join and
   skip write into right view stream buffer to save iops budget on each subtask


   - support ProcessJoinFunction settings to expedite keyed lookup of slow
   changing stream. Instead of every post view pull post updates from rocksdb.
   user can opt-in and having one side buffer cache available in memory. If a
   given post update, cache load recent views from right buffer and use
   sortedMap to find buckets. If a given post view, cache load recent updates
   from left buffer to memory. When another view for that post arrives, flink
   save cost of rocksdb access.

Thanks,
Chen Qin

Reply via email to