Thanks for your reply, Gopal and Bikas. The first reason of the problem was a wrong JVM option about AM container. However, there still is hang-up at startup time after configuring it correctly. This problem is caused even if tez.am.max.app.attempts == 1. I attached the log on TEZ-1457, so I appreciate you can check it.
Thanks, - Tsuyoshi ---------- Forwarded message ---------- From: Gopal V <[email protected]> Date: Tue, Aug 19, 2014 at 4:45 AM Subject: Re: OrderedWordCount slow down or hung-up for a time To: Tsuyoshi OZAWA <[email protected]>, [email protected] On 8/18/14, 1:26 AM, Tsuyoshi OZAWA wrote: Hi, That is not the correct behaviour for the normal runs. The 10 minute gap sounds a lot like AM recovery is kicking in. This will happen if the application master is killed half-way through a job and it recovers work from the HDFS recovery logs. The most common case for me is when the AM container size and java opts are not in sync. Can you check if that's happening in cases with only just a single AM attempt? Cheers, Gopal > I'm trying some jobs in examples of tez. I found that sometimes jobs > get hung-up on my distributed environment. > Because of this behavior, the jobs on tez get slow down and sometimes > get slower than original MapReduce jobs(e.g. if I run > OrderedWordCount, WordCount + Sort in MapReduce is faster than > OrderedWordCount on Tez sometimes). Is this correct behavior? Do you > know how can we solve it or tune Tez program? I attached logs at the > time as follows: > > <--- log1 - job start up time --> > 14/08/18 16:02:36 INFO rpc.DAGClientRPCImpl: DAG: State: RUNNING > Progress: 0% TotalTasks: 6 Succeeded: 0 Running: 0 Failed: 0 Killed: 0 > 14/08/18 16:02:36 INFO rpc.DAGClientRPCImpl: VertexStatus: VertexName: > Sorter Progress: 0% TotalTasks: 1 Succeeded: 0 Running: 0 Failed: 0 > Killed: 0 > 14/08/18 16:02:36 INFO rpc.DAGClientRPCImpl: VertexStatus: VertexName: > Tokenizer Progress: 0% TotalTasks: -1 Succeeded: 0 Running: 0 Failed: > 0 Killed: 0 > 14/08/18 16:02:36 INFO rpc.DAGClientRPCImpl: VertexStatus: VertexName: > Summation Progress: 0% TotalTasks: 6 Succeeded: 0 Running: 0 Failed: 0 > Killed: 0 > > -- next print is 10 minutes later -- > > 14/08/18 16:15:57 INFO rpc.DAGClientRPCImpl: DAG: State: RUNNING > Progress: 1.04% TotalTasks: 96 Succeeded: 1 Running: 52 Failed: 0 > Killed: 0 > 14/08/18 16:15:57 INFO rpc.DAGClientRPCImpl: VertexStatus: VertexName: > Sorter Progress: 0% TotalTasks: 1 Succeeded: 0 Running: 0 Failed: 0 > Killed: 0 > 14/08/18 16:15:57 INFO rpc.DAGClientRPCImpl: VertexStatus: VertexName: > Tokenizer Progress: 3.37% TotalTasks: 89 Succeeded: 3 Running: 52 > Failed: 0 Killed: 0 > 14/08/18 16:15:57 INFO rpc.DAGClientRPCImpl: VertexStatus: VertexName: > Summation Progress: 0% TotalTasks: 6 Succeeded: 0 Running: 0 Failed: 0 > Killed: 0 > <--- log1 --> > > > <--- log2 --> > 14/08/18 16:35:15 INFO rpc.DAGClientRPCImpl: DAG: State: RUNNING > Progress: 92.71% TotalTasks: 96 Succeeded: 89 Running: 6 Failed: 0 > Killed: 0 > 14/08/18 16:35:15 INFO rpc.DAGClientRPCImpl: VertexStatus: VertexName: > Tokenizer Progress: 100% TotalTasks: 89 Succeeded: 89 Running: 0 > Failed: 0 Killed: 0 > 14/08/18 12:19:25 INFO rpc.DAGClientRPCImpl: VertexStatus: > VertexStatus: VertexName: Sorter Progress: 0% TotalTasks: 1 Succeeded: > 0 Running: 0 Failed: 0 Killed: 0 > 14/08/18 16:35:15 INFO rpc.DAGClientRPCImpl: VertexStatus: VertexName: > Summation Progress: 0% TotalTasks: 6 Succeeded: 0 Running: 6 Failed: 0 > Killed: 0 > > --- next print is 10 minutes later -- > > 14/08/18 16:53:28 INFO rpc.DAGClientRPCImpl: DAG: State: RUNNING > Progress: 93.75% TotalTasks: 96 Succeeded: 90 Running: 5 Failed: 0 > Killed: 0 > 14/08/18 16:53:28 INFO rpc.DAGClientRPCImpl: VertexStatus: VertexName: > Sorter Progress: 0% TotalTasks: 1 Succeeded: 0 Running: 0 Failed: 0 > Killed: 0 > 14/08/18 16:53:28 INFO rpc.DAGClientRPCImpl: VertexStatus: VertexName: > Tokenizer Progress: 100% TotalTasks: 89 Succeeded: 89 Running: 0 > Failed: 0 Killed: 0 > 14/08/18 16:53:28 INFO rpc.DAGClientRPCImpl: VertexStatus: VertexName: > Summation Progress: 16.67% TotalTasks: 6 Succeeded: 1 Running: 5 > Failed: 0 Killed: 0 > <--- log2 --> > > Thanks, > - Tsuyoshi > > -- - Tsuyoshi
