[ 
https://issues.apache.org/jira/browse/KAFKA-7652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16780115#comment-16780115
 ] 

Jonathan Gordon commented on KAFKA-7652:
----------------------------------------

{quote}1) when you profile on latest trunk did you see the same pattern as 
observed in [https://i.imgur.com/IHxC2cZ.png] as well as in the trace logging 
compared with 0.10.2.x?
{quote}
The image you linked is actually for 0.10.2.x, which is our current deployment. 
It shows us gated by RocksDB, but that's actually *faster* than what we saw in 
0.11.0.0, the recent trunk, or the test I just ran against 2.2.0-rc0:

[https://i.imgur.com/L6PWIEF.png]
{quote}2) practically the lookups in the caching layer is very cheap and hence 
even increased a lot it should not contribute to much overhead, whereas the 
fetches on the underlying store would be much more expensive. Could you confirm 
if the performance bottleneck is from the underlying rocksDB, or from the 
caching layer access?
{quote}
For 2.2.0-rc0, we're spending the bulk of our time trying to retrieve records 
from the NamedCache. See:

[^0.10.2.1-NamedCache.txt]

[^2.2.0-rc0_b-NamedCache.txt]

While I agree it seems it should be more performant per retrieval, as you can 
see from the latest logs, it's the difference between 1,096,089 (2.2.0-rc0) and 
19,245 (0.10.2.1) hits per second to the cache. The two orders of magnitude 
appear to outweigh whatever performance benefit we'd receive from the caching 
layer. 

This is just one of 8 tasks. During their respective runs, the services 
consumed 8.4M messages (0.10.2.1) with no lag vs 637K messages (2.2.0-rc0) with 
considerable lag. I'd be happy to run again with whatever custom logging or 
configuration you suggest to help further pinpoint the problem. 

 

 

 

> Kafka Streams Session store performance degradation from 0.10.2.2 to 0.11.0.0
> -----------------------------------------------------------------------------
>
>                 Key: KAFKA-7652
>                 URL: https://issues.apache.org/jira/browse/KAFKA-7652
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.11.0.0, 0.11.0.1, 0.11.0.2, 0.11.0.3, 1.1.1, 2.0.0, 
> 2.0.1
>            Reporter: Jonathan Gordon
>            Assignee: Guozhang Wang
>            Priority: Major
>              Labels: kip
>             Fix For: 2.2.0
>
>         Attachments: 0.10.2.1-NamedCache.txt, 2.2.0-rc0_b-NamedCache.txt, 
> kafka_10_2_1_flushes.txt, kafka_11_0_3_flushes.txt
>
>
> I'm creating this issue in response to [~guozhang]'s request on the mailing 
> list:
> [https://lists.apache.org/thread.html/97d620f4fd76be070ca4e2c70e2fda53cafe051e8fc4505dbcca0321@%3Cusers.kafka.apache.org%3E]
> We are attempting to upgrade our Kafka Streams application from 0.10.2.1 but 
> experience a severe performance degradation. The highest amount of CPU time 
> seems spent in retrieving from the local cache. Here's an example thread 
> profile with 0.11.0.0:
> [https://i.imgur.com/l5VEsC2.png]
> When things are running smoothly we're gated by retrieving from the state 
> store with acceptable performance. Here's an example thread profile with 
> 0.10.2.1:
> [https://i.imgur.com/IHxC2cZ.png]
> Some investigation reveals that it appears we're performing about 3 orders 
> magnitude more lookups on the NamedCache over a comparable time period. I've 
> attached logs of the NamedCache flush logs for 0.10.2.1 and 0.11.0.3.
> We're using session windows and have the app configured for 
> commit.interval.ms = 30 * 1000 and cache.max.bytes.buffering = 10485760
> I'm happy to share more details if they would be helpful. Also happy to run 
> tests on our data.
> I also found this issue, which seems like it may be related:
> https://issues.apache.org/jira/browse/KAFKA-4904
>  
> KIP-420: 
> [https://cwiki.apache.org/confluence/display/KAFKA/KIP-420%3A+Add+Single+Value+Fetch+in+Session+Stores]
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to