[ 
https://issues.apache.org/jira/browse/FLINK-11050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16712265#comment-16712265
 ] 

ASF GitHub Bot commented on FLINK-11050:
----------------------------------------

Myracle commented on issue #7226: FLINK-11050 add lowerBound and upperBound for 
optimizing RocksDBMapState's entries
URL: https://github.com/apache/flink/pull/7226#issuecomment-445098945
 
 
   > Hi @Myracle, thanks for the PR. I think we should either support the new 
API for all `MapState` implementations or declare the method as a best-effort 
filter (which means we have to manually filter the returned entries).
   > 
   > What do you think about this @StefanRRichter?
   > 
   > Best, Fabian
   
   Thank you for your reply. The lowerBound and upperBound are used by 
RocksDB's interface to avoid wasting time on deleted values. The situation is 
not the same when storing state in heap. Do you still think that we should 
support the new API for all `MapState` implementations? Also, I do not 
understand the meaning of "declare the method as a best-effort filter". Because 
previous implementation will return all the values and it is just the 
bottleneck in our situation.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> When IntervalJoin, get left or right buffer's entries more quickly by 
> assigning lowerBound
> ------------------------------------------------------------------------------------------
>
>                 Key: FLINK-11050
>                 URL: https://issues.apache.org/jira/browse/FLINK-11050
>             Project: Flink
>          Issue Type: Improvement
>          Components: State Backends, Checkpointing
>    Affects Versions: 1.6.2, 1.7.0
>            Reporter: Liu
>            Priority: Major
>              Labels: performance, pull-request-available
>
>     When IntervalJoin, it is very slow to get left or right buffer's entries. 
> Because we have to scan all buffer's values, including the deleted values 
> which are out of time range. These deleted values's processing consumes too 
> much time in RocksDB's level 0. Since lowerBound is known, it can be 
> optimized by seek from the timestamp of lowerBound.
>     Our usage is like below:
> {code:java}
> labelStream.keyBy(uuid).intervalJoin(adLogStream.keyBy(uuid))
>            .between(Time.milliseconds(0), Time.milliseconds(600000))
>            .process(new processFunction())
>            .sink(kafkaProducer)
> {code}
>     Our data is huge. The job always runs for an hour and is stuck by 
> RocksDB's seek when get buffer's entries. We use rocksDB's data to simulate 
> the problem RocksDB and find that it takes too much time in deleted values. 
> So we decide to optimize it by assigning the lowerBound instead of global 
> search.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to