[ 
https://issues.apache.org/jira/browse/SPARK-25692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16754406#comment-16754406
 ] 

Sanket Reddy commented on SPARK-25692:
--------------------------------------

I had a few observations regarding this test suite...

When i run it on mac

$ sysctl hw.physicalcpu hw.logicalcpu
hw.physicalcpu: 4
hw.logicalcpu: 8 

I dont see the issue. This i think has got to do with io.serverThreads and the 
number of threads being used for handling chunk blocks and since there are 
multiple tests in the suite handling chunked blocks might require sufficient 
threads to handle the requests.

 

On a vm i was able to reproduce this consistently

-bash-4.1$ lscpu | grep -E '^Thread|^Core|^Socket|^CPU\('
CPU(s): 4
Thread(s) per core: 1
Core(s) per socket: 1
Socket(s): 4

The root cause might be why it is actually failing is due to like [~zsxwing] 
pointed out is due to 
[https://github.com/apache/spark/blob/c00186f90cfcc33492d760f874ead34f0e3da6ed/common/network-common/src/main/java/org/apache/spark/network/TransportContext.java#L88|https://github.com/apache/spark/blob/c00186f90cfcc33492d760f874ead34f0e3da6ed/common/network-common/src/main/java/org/apache/spark/network/TransportContext.java#L88.]
 sharing of worker threads.

When I remove the static I no longer see the test failure.

 

So do we really need it to be static?

I dont think this requires a global declaration as these threads are only 
required on the shuffle server end and on the client TransportContext 
initialization i.e the Client don't initialize these threads. I assume for 
Shuffle Server there would be only one TransportContext object. So, I think 
this is fine to be an instance variable and I see no harm. Will do some testing 
again and if everything is fine will put up the pr...

 

> Flaky test: ChunkFetchIntegrationSuite.fetchBothChunks
> ------------------------------------------------------
>
>                 Key: SPARK-25692
>                 URL: https://issues.apache.org/jira/browse/SPARK-25692
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.0.0
>            Reporter: Shixiong Zhu
>            Priority: Blocker
>             Fix For: 3.0.0
>
>         Attachments: Screen Shot 2018-10-22 at 4.12.41 PM.png, Screen Shot 
> 2018-11-01 at 10.17.16 AM.png
>
>
> Looks like the whole test suite is pretty flaky. See: 
> https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-2.6/5490/testReport/junit/org.apache.spark.network/ChunkFetchIntegrationSuite/history/
> This may be a regression in 3.0 as this didn't happen in 2.4 branch.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to