[ 
https://issues.apache.org/jira/browse/SPARK-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14032867#comment-14032867
 ] 

Chen Jin commented on SPARK-1112:
---------------------------------

To follow up this thread, I have done some experiments when the frameSize is 
around 10MB .

1) spark.akka.frameSize = 10
If one of the partition size is very close to 10MB, say 9.97MB, the execution 
blocks without any exception or warning. Worker finished the task to send the 
serialized result, and then throw exception saying hadoop IPC client connection 
stops (changing the logging to debug level). However, the master never receives 
the results and the program just hangs.
But if sizes for all the partitions less than some number btw 9.96MB amd 
9.97MB, the program works fine.
2) spark.akka.frameSize = 9
when the partition size is just a little bit smaller than 9MB, it fails as well.

This bug behavior is not exactly what spark-1112 is about, could you please 
guide me how to open a separate bug when the serialization size is very close 
to 10MB. 

Thanks a lot

> When spark.akka.frameSize > 10, task results bigger than 10MiB block execution
> ------------------------------------------------------------------------------
>
>                 Key: SPARK-1112
>                 URL: https://issues.apache.org/jira/browse/SPARK-1112
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 0.9.0
>            Reporter: Guillaume Pitel
>            Priority: Critical
>             Fix For: 0.9.2
>
>
> When I set the spark.akka.frameSize to something over 10, the messages sent 
> from the executors to the driver completely block the execution if the 
> message is bigger than 10MiB and smaller than the frameSize (if it's above 
> the frameSize, it's ok)
> Workaround is to set the spark.akka.frameSize to 10. In this case, since 
> 0.8.1, the blockManager deal with  the data to be sent. It seems slower than 
> akka direct message though.
> The configuration seems to be correctly read (see actorSystemConfig.txt), so 
> I don't see where the 10MiB could come from 



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to