[
https://issues.apache.org/jira/browse/CASSANDRA-16681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17367010#comment-17367010
]
Gianluca Righetto commented on CASSANDRA-16681:
-----------------------------------------------
I've been on and off looking into this ticket too.
What I noticed is that sometimes a worker thread is simply lagging behind and
the corresponding chunk doesn't get recycled in the arbitrary window of 10
seconds. But if we were to wait a couple more seconds (instead of failing with
the assertion) we'd see it would eventually catch up.
The problem is that the {{sharedRecycle}} queues fill up more quickly than the
other threads can process them.
In order to reproduce the failure locally, I simply increase the number of
concurrent worker threads from a factor of 2 to, say, 10 here
[https://github.com/apache/cassandra/blob/699a1f74fcc1da1952da6b2b0309c9e2474c67f4/test/burn/org/apache/cassandra/utils/memory/LongBufferPoolTest.java#L139].
I believe some sort of CPU contention is also happening in CircleCI, given the
test can't determine the right number of processors available to the Docker
container.
The xlarge instance has only 8 vCPUs, but from the logs we can see the test
identifies 36 cores and so it starts 72 threads:
{code:java}
[junit-timeout] INFO [main] 2021-05-19 16:18:52,488
LongBufferPoolTest.java:264 - 2021/05/19 16:18:52 - testing 72 threads for 2m
{code}
I tested with different instance sizes (small, medium, xlarge) and they all
report 36 cores.
This seems to be a problem with CircleCI in general:
[https://circleci.com/docs/2.0/configuration-reference/#resourceclass]
{quote}Note: Java, Erlang and any other languages that introspect the /proc
directory for information about CPU count may require additional configuration
to prevent them from slowing down when using the CircleCI 2.0 resource class
feature. Programs with this issue may request 32 CPU cores and run slower than
they would when requesting one core. Users of languages with this issue should
pin their CPU count to their guaranteed CPU resources.
{quote}
I have a patch to set a fixed number of workers threads (16) for this test that
should help with this issue: [https://github.com/grighetto/cassandra/pull/8]
All other assertions in the test that deal with integrity/correctness always
passed, which also indicates this is really just a timing issue.
> org.apache.cassandra.utils.memory.LongBufferPoolTest - tests are flaky
> ----------------------------------------------------------------------
>
> Key: CASSANDRA-16681
> URL: https://issues.apache.org/jira/browse/CASSANDRA-16681
> Project: Cassandra
> Issue Type: Bug
> Components: CI
> Reporter: Ekaterina Dimitrova
> Assignee: Brandon Williams
> Priority: Normal
> Fix For: 4.0, 4.0-rc
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> Jenkins history:
> [https://jenkins-cm4.apache.org/job/Cassandra-4.0/50/testReport/junit/org.apache.cassandra.utils.memory/LongBufferPoolTest/testPoolAllocateWithRecyclePartially/history/]
> Fails being run in a loop in CircleCI:
> https://app.circleci.com/pipelines/github/ekaterinadimitrova2/cassandra/844/workflows/945011f4-00ac-4678-89f6-5c0db0a40169/jobs/5008
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]