[
https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754406#comment-16754406
]
Sanket Reddy commented on SPARK-25692:
--------------------------------------
I had a few observations regarding this test suite...
When i run it on mac
$ sysctl hw.physicalcpu hw.logicalcpu
hw.physicalcpu: 4
hw.logicalcpu: 8
I dont see the issue. This i think has got to do with io.serverThreads and the
number of threads being used for handling chunk blocks and since there are
multiple tests in the suite handling chunked blocks might require sufficient
threads to handle the requests.
On a vm i was able to reproduce this consistently
-bash-4.1$ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('
CPU(s): 4
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 4
The root cause might be why it is actually failing is due to like [~zsxwing]
pointed out is due to
[https://github.com/apache/spark/blob/c00186f90cfcc33492d760f874ead34f0e3da6ed/common/network-common/src/main/java/org/apache/spark/network/TransportContext.java#L88|https://github.com/apache/spark/blob/c00186f90cfcc33492d760f874ead34f0e3da6ed/common/network-common/src/main/java/org/apache/spark/network/TransportContext.java#L88.]
sharing of worker threads.
When I remove the static I no longer see the test failure.
So do we really need it to be static?
I dont think this requires a global declaration as these threads are only
required on the shuffle server end and on the client TransportContext
initialization i.e the Client don't initialize these threads. I assume for
Shuffle Server there would be only one TransportContext object. So, I think
this is fine to be an instance variable and I see no harm. Will do some testing
again and if everything is fine will put up the pr...
> Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks
> ------------------------------------------------------
>
> Key: SPARK-25692
> URL: https://issues.apache.org/jira/browse/SPARK-25692
> Project: Spark
> Issue Type: Bug
> Components: Spark Core
> Affects Versions: 3.0.0
> Reporter: Shixiong Zhu
> Priority: Blocker
> Fix For: 3.0.0
>
> Attachments: Screen Shot 2018-10-22 at 4.12.41 PM.png, Screen Shot
> 2018-11-01 at 10.17.16 AM.png
>
>
> Looks like the whole test suite is pretty flaky. See:
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/
> This may be a regression in 3.0 as this didn't happen in 2.4 branch.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]