[GitHub] [flink] chenqin opened a new pull request #11291: [FLINK-16392] [API / DataStream] oneside sorted cache in intervaljoin

GitBox Mon, 02 Mar 2020 15:14:06 -0800

chenqin opened a new pull request #11291: [FLINK-16392] [API / DataStream] 
oneside sorted cache in intervaljoin
URL: https://github.com/apache/flink/pull/11291
 
 
   ## What is the purpose of the change
   
   IntervalJoin is getting lots of usecases. Those use cases shares following 
similar pattern
   
       left stream  pulled from static dataset periodically
       lookup time range is very large (days weeks)
       right stream is web traffic with high QPS
   
   In current interval join implementation, we treat both streams equal. 
Specifically as rocksdb fetch and update getting more expensive, performance 
took hit and unblock large use cases.
   
   In proposed implementation, we plan to introduce two changes
   
       allow user opt-in in ProcessJoinFunction if they want to skip scan when 
intervaljoin operator receive events from left stream(static data set)
       build sortedMap from otherBuffer of each seen key granularity
           expedite right stream lookup of left buffers without access rocksdb 
everytime
           if a key see event from left side, it cleanup buffer and load buffer 
from right side
   
   
   
   
   ## Brief change log
   
   
   ## Verifying this change
   
   This change is a trivial rework / code cleanup without any test coverage.
   
   no
   
   This change is already covered by existing tests, such as *(please describe 
tests)*.
   
   IntervalJoinITCase
   
   This change added tests and can be verified as follows:
   
   run IntervalJoinITCase test suite
   
   ## Does this pull request potentially affect one of the following parts:
   
     - Dependencies (does it add or upgrade a dependency): (yes / no)
   no
     - The public API, i.e., is any changed class annotated with 
`@Public(Evolving)`: (yes / no)
   no
     - The serializers: (yes / no / don't know)
   no
     - The runtime per-record code paths (performance sensitive): (yes / no / 
don't know)
   don't know
     - Anything that affects deployment or recovery: JobManager (and its 
components), Checkpointing, Kubernetes/Yarn/Mesos, ZooKeeper: (yes / no / don't 
know)
   no
     - The S3 file system connector: (yes / no / don't know)
   no
   ## Documentation
   
     - Does this pull request introduce a new feature? (yes / no)
   yes
     - If yes, how is the feature documented? (not applicable / docs / JavaDocs 
/ not documented)
   javadocs


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] [flink] chenqin opened a new pull request #11291: [FLINK-16392] [API / DataStream] oneside sorted cache in intervaljoin

Reply via email to