[
https://issues.apache.org/jira/browse/FLINK-11050?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16716775#comment-16716775
]
ASF GitHub Bot commented on FLINK-11050:
----------------------------------------
StefanRRichter commented on issue #7226: FLINK-11050 add lowerBound and
upperBound for optimizing RocksDBMapState's entries
URL: https://github.com/apache/flink/pull/7226#issuecomment-446148429
I think this suggestion is problematic for the following reasons:
1. Map state is in general unordered
2. Keys in map state are not required to be `Comparable`, so what defines
the order against which we can compare upper and lower keys?
3. Even if keys implement, `Comparable`, their order in RocksDB is depending
on the lexicographical order of there bytes in serialized form, which can be a
differnt order from what can be defined in the `compareTo()`
4. The method is doing different things for RocksDB and Heap right now and
this is not properly reflecting in the documentation of the method.
Overall that leads me to the conclusion that we cannot rush to add this
optimization but need a bit more careful thinking, e.g. introducing a subclass
OrderedMapState (openly or hidden and cast where optimization is required).
Even in that case we need to be careful when addressing the problem of
different orders (`Comparable` vs byte-lexicographical) and I feel that needs
more thought. So currently I am leaning towards 👎 for the suggested approach.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> When IntervalJoin, get left or right buffer's entries more quickly by
> assigning lowerBound
> ------------------------------------------------------------------------------------------
>
> Key: FLINK-11050
> URL: https://issues.apache.org/jira/browse/FLINK-11050
> Project: Flink
> Issue Type: Improvement
> Components: State Backends, Checkpointing
> Affects Versions: 1.6.2, 1.7.0
> Reporter: Liu
> Priority: Major
> Labels: performance, pull-request-available
>
> When IntervalJoin, it is very slow to get left or right buffer's entries.
> Because we have to scan all buffer's values, including the deleted values
> which are out of time range. These deleted values's processing consumes too
> much time in RocksDB's level 0. Since lowerBound is known, it can be
> optimized by seek from the timestamp of lowerBound.
> Our usage is like below:
> {code:java}
> labelStream.keyBy(uuid).intervalJoin(adLogStream.keyBy(uuid))
> .between(Time.milliseconds(0), Time.milliseconds(600000))
> .process(new processFunction())
> .sink(kafkaProducer)
> {code}
> Our data is huge. The job always runs for an hour and is stuck by
> RocksDB's seek when get buffer's entries. We use rocksDB's data to simulate
> the problem RocksDB and find that it takes too much time in deleted values.
> So we decide to optimize it by assigning the lowerBound instead of global
> search.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)