[ 
https://issues.apache.org/jira/browse/CASSANDRA-16681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17367010#comment-17367010
 ] 

Gianluca Righetto commented on CASSANDRA-16681:
-----------------------------------------------

I've been on and off looking into this ticket too.

What I noticed is that sometimes a worker thread is simply lagging behind and 
the corresponding chunk doesn't get recycled in the arbitrary window of 10 
seconds. But if we were to wait a couple more seconds (instead of failing with 
the assertion) we'd see it would eventually catch up.
 The problem is that the {{sharedRecycle}} queues fill up more quickly than the 
other threads can process them.

In order to reproduce the failure locally, I simply increase the number of 
concurrent worker threads from a factor of 2 to, say, 10 here 
[https://github.com/apache/cassandra/blob/699a1f74fcc1da1952da6b2b0309c9e2474c67f4/test/burn/org/apache/cassandra/utils/memory/LongBufferPoolTest.java#L139].

I believe some sort of CPU contention is also happening in CircleCI, given the 
test can't determine the right number of processors available to the Docker 
container.
 The xlarge instance has only 8 vCPUs, but from the logs we can see the test 
identifies 36 cores and so it starts 72 threads:
{code:java}
[junit-timeout] INFO  [main] 2021-05-19 16:18:52,488 
LongBufferPoolTest.java:264 - 2021/05/19 16:18:52 - testing 72 threads for 2m
{code}
I tested with different instance sizes (small, medium, xlarge) and they all 
report 36 cores.
 This seems to be a problem with CircleCI in general:

[https://circleci.com/docs/2.0/configuration-reference/#resourceclass]
{quote}Note: Java, Erlang and any other languages that introspect the /proc 
directory for information about CPU count may require additional configuration 
to prevent them from slowing down when using the CircleCI 2.0 resource class 
feature. Programs with this issue may request 32 CPU cores and run slower than 
they would when requesting one core. Users of languages with this issue should 
pin their CPU count to their guaranteed CPU resources.
{quote}
I have a patch to set a fixed number of workers threads (16) for this test that 
should help with this issue: [https://github.com/grighetto/cassandra/pull/8]

All other assertions in the test that deal with integrity/correctness always 
passed, which also indicates this is really just a timing issue.

> org.apache.cassandra.utils.memory.LongBufferPoolTest - tests are flaky
> ----------------------------------------------------------------------
>
>                 Key: CASSANDRA-16681
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-16681
>             Project: Cassandra
>          Issue Type: Bug
>          Components: CI
>            Reporter: Ekaterina Dimitrova
>            Assignee: Brandon Williams
>            Priority: Normal
>             Fix For: 4.0, 4.0-rc
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Jenkins history:
> [https://jenkins-cm4.apache.org/job/Cassandra-4.0/50/testReport/junit/org.apache.cassandra.utils.memory/LongBufferPoolTest/testPoolAllocateWithRecyclePartially/history/]
> Fails being run in a loop in CircleCI:
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/844/workflows/945011f4-00ac-4678-89f6-5c0db0a40169/jobs/5008
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to