Re: Spark on EMR suddenly stalling

M Singh Mon, 01 Jan 2018 08:13:29 -0800

Hi Jeroen:
I am not sure if I missed it - but can you let us know what is your input 
source and output sink ?  
In some cases, I found that saving to S3 was a problem. In this case I started 
saving the output to the EMR HDFS and later copied to S3 using s3-dist-cp which 
solved our issue.

Mans 

    On Monday, January 1, 2018 7:41 AM, Rohit Karlupia <roh...@qubole.com> 
wrote:

 Here is the list that I will probably try to fill:   
   - Check GC on the offending executor when the task is running. May be you 
need even more memory.  
   - Go back to some previous successful run of the job and check the spark ui 
for the offending stage and check max task time/max input/max shuffle in/out 
for the largest task. Will help you understand the degree of skew in this 
stage. 
   - Take a thread dump of the executor from the Spark UI and verify if the 
task is really doing any work or it stuck in some deadlock. Some of the hive 
serde are not really usable from multi-threaded/multi-use spark executors. 
   - Take a thread dump of the executor from the Spark UI and verify if the 
task is spilling to disk. Playing with storage and memory fraction or generally 
increasing the memory will help. 
   - Check the disk utilisation on the machine running the executor. 
   - Look for event loss messages in the logs due to event queue full. Loss of 
events can send some of the spark components into really bad states.  

thanks,rohitk

On Sun, Dec 31, 2017 at 12:50 AM, Gourav Sengupta <gourav.sengu...@gmail.com> 
wrote:

Hi,
Please try to use the SPARK UI from the way that AWS EMR recommends, it should 
be available from the resource manager. I never ever had any problem working 
with it. THAT HAS ALWAYS BEEN MY PRIMARY AND SOLE SOURCE OF DEBUGGING.
Sadly, I cannot be of much help unless we go for a screen share session over 
google chat or skype. 
Also, I ALWAYS prefer the maximize Resource Allocation setting in EMR to be set 
to true. 
Besides that, there is a metrics in the EMR console which shows the number of 
containers getting generated by your job on graphs.

Regards,Gourav Sengupta
On Fri, Dec 29, 2017 at 6:23 PM, Jeroen Miller <bluedasya...@gmail.com> wrote:

Hello,

Just a quick update as I did not made much progress yet.

On 28 Dec 2017, at 21:09, Gourav Sengupta <gourav.sengu...@gmail.com> wrote:
> can you try to then use the EMR version 5.10 instead or EMR version 5.11 
> instead?

Same issue with EMR 5.11.0. Task 0 in one stage never finishes.

> can you please try selecting a subnet which is in a different availability 
> zone?

I did not try this yet. But why should that make a difference?

> if possible just try to increase the number of task instances and see the 
> difference?

I tried with 512 partitions -- no difference.

> also in case you are using caching,

No caching used.

> Also can you please report the number of containers that your job is creating 
> by looking at the metrics in the EMR console?

8 containers if I trust the directories in j-xxx/containers/application_x xx/.

> Also if you see the spark UI then you can easily see which particular step is 
> taking the longest period of time - you just have to drill in a bit in order 
> to see that. Generally in case shuffling is an issue then it definitely 
> appears in the SPARK UI as I drill into the steps and see which particular 
> one is taking the longest.

I always have issues with the Spark UI on EC2 -- it never seems to be up to 
date.

JM

Re: Spark on EMR suddenly stalling

Reply via email to