Re: spark job automatically killed without rhyme or reason

2016-06-23 Thread Aakash Basu
Hey,

I've come across this. There's a command called "yarn application -kill
", which kills the application with a one liner 'Killed'.

If it is memory issue, the error shows up in form of 'GC Overhead' or
forming up tree or something of the sort.

So, I think someone killed your job by that command I gave. To the person
who's running, in the log, it will just give that one word, 'Killed' in the
end.

Maybe this is what you faced. Maybe!

Thanks,
Aakash.
On 23-Jun-2016 11:52 AM, "Zhiliang Zhu" <zchl.j...@yahoo.com.invalid> wrote:

> Thanks a lot for all  the comments, and the useful  information .
>
> Yes, I have much experience to write and run spark jobs, something
> unstable will be there while it run on more data or more time.
> Sometimes it would be not okay while reset some parameter in command line,
> but will be okay while removing it by using default setting. Sometimes it
> is opposite, proper parameter setting needs to be set.
>
> Here is installing spark 1.5 by other person.
>
>
>
>
> On Wednesday, June 22, 2016 1:59 PM, Nirav Patel <npa...@xactlycorp.com>
> wrote:
>
>
> spark is memory hogger and suicidal if you have a job processing bigger
> dataset. however databricks claims that  spark > 1.6  have optimization
> related to memory footprint as well as processing. It will only be
> available if you use dataframe or dataset. if you are using rdd you have to
> do lot of testing and tuning.
>
> On Mon, Jun 20, 2016 at 1:34 AM, Sean Owen <so...@cloudera.com> wrote:
>
> I'm not sure that's the conclusion. It's not trivial to tune and
> configure YARN and Spark to match your app's memory needs and profile,
> but, it's also just a matter of setting them properly. I'm not clear
> you've set the executor memory for example, in particular
> spark.yarn.executor.memoryOverhead
>
> Everything else you mention is a symptom of YARN shutting down your
> jobs because your memory settings don't match what your app does.
> They're not problems per se, based on what you have provided.
>
>
> On Mon, Jun 20, 2016 at 9:17 AM, Zhiliang Zhu
> <zchl.j...@yahoo.com.invalid> wrote:
> > Hi Alexander ,
> >
> > Thanks a lot for your comments.
> >
> > Spark seems not that stable when it comes to run big job, too much data
> or
> > too much time, yes, the problem is gone when reducing the scale.
> > Sometimes reset some job running parameter (such as --drive-memory may
> help
> > in GC issue) , sometimes may rewrite the codes by applying other
> algorithm.
> >
> > As you commented the shuffle operation, it sounds some as the reason ...
> >
> > Best Wishes !
> >
> >
> >
> > On Friday, June 17, 2016 8:45 PM, Alexander Kapustin <kp...@hotmail.com>
> > wrote:
> >
> >
> > Hi Zhiliang,
> >
> > Yes, find the exact reason of failure is very difficult. We have issue
> with
> > similar behavior, due to limited time for investigation, we reduce the
> > number of processed data, and problem has gone.
> >
> > Some points which may help you in investigations:
> > · If you start spark-history-server (or monitoring running
> > application on 4040 port), look into failed stages (if any). By default
> > Spark try to retry stage execution 2 times, after that job fails
> > · Some useful information may contains in yarn logs on Hadoop
> nodes
> > (yarn--nodemanager-.log), but this is only information about
> > killed container, not about the reasons why this stage took so much
> memory
> >
> > As I can see in your logs, failed step relates to shuffle operation,
> could
> > you change your job to avoid massive shuffle operation?
> >
> > --
> > WBR, Alexander
> >
> > From: Zhiliang Zhu
> > Sent: 17 июня 2016 г. 14:10
> > To: User; kp...@hotmail.com
> > Subject: Re: spark job automatically killed without rhyme or reason
> >
> >
> > Show original message
> >
> >
> > Hi Alexander,
> >
> > is your yarn userlog   just for the executor log ?
> >
> > as for those logs seem a little difficult to exactly decide the wrong
> point,
> > due to sometimes successful job may also have some kinds of the error
> ...
> > but will repair itself.
> > spark seems not that stable currently ...
> >
> > Thank you in advance~
> >
> >
> >
> > On Friday, June 17, 2016 6:53 PM, Zhiliang Zhu <zchl.j...@yahoo.com>
> wrote:
> >
> >
> > Hi Alexander,
> >
> > Thanks a lot for your reply.
> >
> > Yes, submitted by yarn.
> > Do you just mean in the

Re: spark job automatically killed without rhyme or reason

2016-06-23 Thread Zhiliang Zhu
Thanks a lot for all  the comments, and the useful  information . 
Yes, I have much experience to write and run spark jobs, something unstable 
will be there while it run on more data or more time. Sometimes it would be not 
okay while reset some parameter in command line, but will be okay while 
removing it by using default setting. Sometimes it is opposite, proper 
parameter setting needs to be set.
Here is installing spark 1.5 by other person.

 

On Wednesday, June 22, 2016 1:59 PM, Nirav Patel <npa...@xactlycorp.com> 
wrote:
 

 spark is memory hogger and suicidal if you have a job processing bigger 
dataset. however databricks claims that  spark > 1.6  have optimization related 
to memory footprint as well as processing. It will only be available if you use 
dataframe or dataset. if you are using rdd you have to do lot of testing and 
tuning. 
On Mon, Jun 20, 2016 at 1:34 AM, Sean Owen <so...@cloudera.com> wrote:

I'm not sure that's the conclusion. It's not trivial to tune and
configure YARN and Spark to match your app's memory needs and profile,
but, it's also just a matter of setting them properly. I'm not clear
you've set the executor memory for example, in particular
spark.yarn.executor.memoryOverhead

Everything else you mention is a symptom of YARN shutting down your
jobs because your memory settings don't match what your app does.
They're not problems per se, based on what you have provided.


On Mon, Jun 20, 2016 at 9:17 AM, Zhiliang Zhu
<zchl.j...@yahoo.com.invalid> wrote:
> Hi Alexander ,
>
> Thanks a lot for your comments.
>
> Spark seems not that stable when it comes to run big job, too much data or
> too much time, yes, the problem is gone when reducing the scale.
> Sometimes reset some job running parameter (such as --drive-memory may help
> in GC issue) , sometimes may rewrite the codes by applying other algorithm.
>
> As you commented the shuffle operation, it sounds some as the reason ...
>
> Best Wishes !
>
>
>
> On Friday, June 17, 2016 8:45 PM, Alexander Kapustin <kp...@hotmail.com>
> wrote:
>
>
> Hi Zhiliang,
>
> Yes, find the exact reason of failure is very difficult. We have issue with
> similar behavior, due to limited time for investigation, we reduce the
> number of processed data, and problem has gone.
>
> Some points which may help you in investigations:
> ·         If you start spark-history-server (or monitoring running
> application on 4040 port), look into failed stages (if any). By default
> Spark try to retry stage execution 2 times, after that job fails
> ·         Some useful information may contains in yarn logs on Hadoop nodes
> (yarn--nodemanager-.log), but this is only information about
> killed container, not about the reasons why this stage took so much memory
>
> As I can see in your logs, failed step relates to shuffle operation, could
> you change your job to avoid massive shuffle operation?
>
> --
> WBR, Alexander
>
> From: Zhiliang Zhu
> Sent: 17 июня 2016 г. 14:10
> To: User; kp...@hotmail.com
> Subject: Re: spark job automatically killed without rhyme or reason
>
>
> Show original message
>
>
> Hi Alexander,
>
> is your yarn userlog   just for the executor log ?
>
> as for those logs seem a little difficult to exactly decide the wrong point,
> due to sometimes successful job may also have some kinds of the error  ...
> but will repair itself.
> spark seems not that stable currently     ...
>
> Thank you in advance~
>
>
>
> On Friday, June 17, 2016 6:53 PM, Zhiliang Zhu <zchl.j...@yahoo.com> wrote:
>
>
> Hi Alexander,
>
> Thanks a lot for your reply.
>
> Yes, submitted by yarn.
> Do you just mean in the executor log file by way of yarn logs -applicationId
> id,
>
> in this file, both in some containers' stdout  and stderr :
>
> 16/06/17 14:05:40 INFO client.TransportClientFactory: Found inactive
> connection to ip-172-31-20-104/172.31.20.104:49991, creating a new one.
> 16/06/17 14:05:40 ERROR shuffle.RetryingBlockFetcher: Exception while
> beginning fetch of 1 outstanding blocks
> java.io.IOException: Failed to connect to
> ip-172-31-20-104/172.31.20.104:49991              <-- may it be due to
> that spark is not stable, and spark may repair itself for these kinds of
> error ? (saw some in successful run )
>
>         at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
>         at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
> 
> Caused by: java.net.ConnectException: Connection refused:
> ip-172-31-20-104/172.31.20.104:49991
>         at sun.nio.ch.SocketChannelImpl.checkConnect(Native Me

Re: spark job automatically killed without rhyme or reason

2016-06-22 Thread Nirav Patel
spark is memory hogger and suicidal if you have a job processing bigger
dataset. however databricks claims that  spark > 1.6  have optimization
related to memory footprint as well as processing. It will only be
available if you use dataframe or dataset. if you are using rdd you have to
do lot of testing and tuning.

On Mon, Jun 20, 2016 at 1:34 AM, Sean Owen <so...@cloudera.com> wrote:

> I'm not sure that's the conclusion. It's not trivial to tune and
> configure YARN and Spark to match your app's memory needs and profile,
> but, it's also just a matter of setting them properly. I'm not clear
> you've set the executor memory for example, in particular
> spark.yarn.executor.memoryOverhead
>
> Everything else you mention is a symptom of YARN shutting down your
> jobs because your memory settings don't match what your app does.
> They're not problems per se, based on what you have provided.
>
>
> On Mon, Jun 20, 2016 at 9:17 AM, Zhiliang Zhu
> <zchl.j...@yahoo.com.invalid> wrote:
> > Hi Alexander ,
> >
> > Thanks a lot for your comments.
> >
> > Spark seems not that stable when it comes to run big job, too much data
> or
> > too much time, yes, the problem is gone when reducing the scale.
> > Sometimes reset some job running parameter (such as --drive-memory may
> help
> > in GC issue) , sometimes may rewrite the codes by applying other
> algorithm.
> >
> > As you commented the shuffle operation, it sounds some as the reason ...
> >
> > Best Wishes !
> >
> >
> >
> > On Friday, June 17, 2016 8:45 PM, Alexander Kapustin <kp...@hotmail.com>
> > wrote:
> >
> >
> > Hi Zhiliang,
> >
> > Yes, find the exact reason of failure is very difficult. We have issue
> with
> > similar behavior, due to limited time for investigation, we reduce the
> > number of processed data, and problem has gone.
> >
> > Some points which may help you in investigations:
> > · If you start spark-history-server (or monitoring running
> > application on 4040 port), look into failed stages (if any). By default
> > Spark try to retry stage execution 2 times, after that job fails
> > · Some useful information may contains in yarn logs on Hadoop
> nodes
> > (yarn--nodemanager-.log), but this is only information about
> > killed container, not about the reasons why this stage took so much
> memory
> >
> > As I can see in your logs, failed step relates to shuffle operation,
> could
> > you change your job to avoid massive shuffle operation?
> >
> > --
> > WBR, Alexander
> >
> > From: Zhiliang Zhu
> > Sent: 17 июня 2016 г. 14:10
> > To: User; kp...@hotmail.com
> > Subject: Re: spark job automatically killed without rhyme or reason
> >
> >
> > Show original message
> >
> >
> > Hi Alexander,
> >
> > is your yarn userlog   just for the executor log ?
> >
> > as for those logs seem a little difficult to exactly decide the wrong
> point,
> > due to sometimes successful job may also have some kinds of the error
> ...
> > but will repair itself.
> > spark seems not that stable currently ...
> >
> > Thank you in advance~
> >
> >
> >
> > On Friday, June 17, 2016 6:53 PM, Zhiliang Zhu <zchl.j...@yahoo.com>
> wrote:
> >
> >
> > Hi Alexander,
> >
> > Thanks a lot for your reply.
> >
> > Yes, submitted by yarn.
> > Do you just mean in the executor log file by way of yarn logs
> -applicationId
> > id,
> >
> > in this file, both in some containers' stdout  and stderr :
> >
> > 16/06/17 14:05:40 INFO client.TransportClientFactory: Found inactive
> > connection to ip-172-31-20-104/172.31.20.104:49991, creating a new one.
> > 16/06/17 14:05:40 ERROR shuffle.RetryingBlockFetcher: Exception while
> > beginning fetch of 1 outstanding blocks
> > java.io.IOException: Failed to connect to
> > ip-172-31-20-104/172.31.20.104:49991  <-- may it be due
> to
> > that spark is not stable, and spark may repair itself for these kinds of
> > error ? (saw some in successful run )
> >
> > at
> >
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
> > at
> >
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
> > 
> > Caused by: java.net.ConnectException: Connection refused:
> > ip-172-31-20-104/172.31.20.104:49991
> > at sun.nio.ch.SocketChan

Re: spark job automatically killed without rhyme or reason

2016-06-20 Thread Sean Owen
I'm not sure that's the conclusion. It's not trivial to tune and
configure YARN and Spark to match your app's memory needs and profile,
but, it's also just a matter of setting them properly. I'm not clear
you've set the executor memory for example, in particular
spark.yarn.executor.memoryOverhead

Everything else you mention is a symptom of YARN shutting down your
jobs because your memory settings don't match what your app does.
They're not problems per se, based on what you have provided.


On Mon, Jun 20, 2016 at 9:17 AM, Zhiliang Zhu
<zchl.j...@yahoo.com.invalid> wrote:
> Hi Alexander ,
>
> Thanks a lot for your comments.
>
> Spark seems not that stable when it comes to run big job, too much data or
> too much time, yes, the problem is gone when reducing the scale.
> Sometimes reset some job running parameter (such as --drive-memory may help
> in GC issue) , sometimes may rewrite the codes by applying other algorithm.
>
> As you commented the shuffle operation, it sounds some as the reason ...
>
> Best Wishes !
>
>
>
> On Friday, June 17, 2016 8:45 PM, Alexander Kapustin <kp...@hotmail.com>
> wrote:
>
>
> Hi Zhiliang,
>
> Yes, find the exact reason of failure is very difficult. We have issue with
> similar behavior, due to limited time for investigation, we reduce the
> number of processed data, and problem has gone.
>
> Some points which may help you in investigations:
> · If you start spark-history-server (or monitoring running
> application on 4040 port), look into failed stages (if any). By default
> Spark try to retry stage execution 2 times, after that job fails
> · Some useful information may contains in yarn logs on Hadoop nodes
> (yarn--nodemanager-.log), but this is only information about
> killed container, not about the reasons why this stage took so much memory
>
> As I can see in your logs, failed step relates to shuffle operation, could
> you change your job to avoid massive shuffle operation?
>
> --
> WBR, Alexander
>
> From: Zhiliang Zhu
> Sent: 17 июня 2016 г. 14:10
> To: User; kp...@hotmail.com
> Subject: Re: spark job automatically killed without rhyme or reason
>
>
> Show original message
>
>
> Hi Alexander,
>
> is your yarn userlog   just for the executor log ?
>
> as for those logs seem a little difficult to exactly decide the wrong point,
> due to sometimes successful job may also have some kinds of the error  ...
> but will repair itself.
> spark seems not that stable currently ...
>
> Thank you in advance~
>
>
>
> On Friday, June 17, 2016 6:53 PM, Zhiliang Zhu <zchl.j...@yahoo.com> wrote:
>
>
> Hi Alexander,
>
> Thanks a lot for your reply.
>
> Yes, submitted by yarn.
> Do you just mean in the executor log file by way of yarn logs -applicationId
> id,
>
> in this file, both in some containers' stdout  and stderr :
>
> 16/06/17 14:05:40 INFO client.TransportClientFactory: Found inactive
> connection to ip-172-31-20-104/172.31.20.104:49991, creating a new one.
> 16/06/17 14:05:40 ERROR shuffle.RetryingBlockFetcher: Exception while
> beginning fetch of 1 outstanding blocks
> java.io.IOException: Failed to connect to
> ip-172-31-20-104/172.31.20.104:49991  <-- may it be due to
> that spark is not stable, and spark may repair itself for these kinds of
> error ? (saw some in successful run )
>
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
> at
> org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)
> 
> Caused by: java.net.ConnectException: Connection refused:
> ip-172-31-20-104/172.31.20.104:49991
> at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
> at
> sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
> at
> io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
> at
> io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
> at
> io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
> at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)
> at
> io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)
>
>
> 16/06/17 11:54:38 ERROR executor.Executor: Managed memory leak detected;
> size = 16777216 bytes, TID

Re: spark job automatically killed without rhyme or reason

2016-06-20 Thread Zhiliang Zhu
Hi Alexander ,
Thanks a lot for your comments.
Spark seems not that stable when it comes to run big job, too much data or too 
much time, yes, the problem is gone when reducing the scale.Sometimes reset 
some job running parameter (such as --drive-memory may help in GC issue) , 
sometimes may rewrite the codes by applying other algorithm.
As you commented the shuffle operation, it sounds some as the reason ...
Best Wishes !  
 

On Friday, June 17, 2016 8:45 PM, Alexander Kapustin <kp...@hotmail.com> 
wrote:
 

 #yiv4291334619 #yiv4291334619 -- _filtered #yiv4291334619 
{font-family:Wingdings;panose-1:5 0 0 0 0 0 0 0 0 0;} _filtered #yiv4291334619 
{panose-1:2 4 5 3 5 4 6 3 2 4;} _filtered #yiv4291334619 
{font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv4291334619 
#yiv4291334619 p.yiv4291334619MsoNormal, #yiv4291334619 
li.yiv4291334619MsoNormal, #yiv4291334619 div.yiv4291334619MsoNormal 
{margin:0cm;margin-bottom:.0001pt;font-size:11.0pt;}#yiv4291334619 a:link, 
#yiv4291334619 span.yiv4291334619MsoHyperlink 
{color:blue;text-decoration:underline;}#yiv4291334619 a:visited, #yiv4291334619 
span.yiv4291334619MsoHyperlinkFollowed 
{color:#954F72;text-decoration:underline;}#yiv4291334619 
p.yiv4291334619MsoListParagraph, #yiv4291334619 
li.yiv4291334619MsoListParagraph, #yiv4291334619 
div.yiv4291334619MsoListParagraph 
{margin-top:0cm;margin-right:0cm;margin-bottom:0cm;margin-left:36.0pt;margin-bottom:.0001pt;font-size:11.0pt;}#yiv4291334619
 span.yiv4291334619qtd-expansion-text {}#yiv4291334619 
.yiv4291334619MsoChpDefault {} _filtered #yiv4291334619 {margin:2.0cm 42.5pt 
2.0cm 3.0cm;}#yiv4291334619 div.yiv4291334619WordSection1 {}#yiv4291334619 
_filtered #yiv4291334619 {} _filtered #yiv4291334619 {font-family:Symbol;} 
_filtered #yiv4291334619 {} _filtered #yiv4291334619 {font-family:Wingdings;} 
_filtered #yiv4291334619 {font-family:Symbol;} _filtered #yiv4291334619 {} 
_filtered #yiv4291334619 {font-family:Wingdings;} _filtered #yiv4291334619 
{font-family:Symbol;} _filtered #yiv4291334619 {} _filtered #yiv4291334619 
{font-family:Wingdings;}#yiv4291334619 ol {margin-bottom:0cm;}#yiv4291334619 ul 
{margin-bottom:0cm;}#yiv4291334619 Hi Zhiliang,    Yes, find the exact reason 
of failure is very difficult. We have issue with similar behavior, due to 
limited time for investigation, we reduce the number of processed data, and 
problem has gone.    Some points which may help you in investigations: ·
If you start spark-history-server (or monitoring running application on 4040 
port), look into failed stages (if any). By default Spark try to retry stage 
execution 2 times, after that job fails·Some useful information may 
contains in yarn logs on Hadoop nodes (yarn--nodemanager-.log), but 
this is only information about killed container, not about the reasons why this 
stage took so much memory   As I can see in your logs, failed step relates to 
shuffle operation, could you change your job to avoid massive shuffle 
operation?    --WBR, Alexander   From: Zhiliang Zhu
Sent: 17 июня 2016 г. 14:10
To: User; kp...@hotmail.com
Subject: Re: spark job automatically killed without rhyme or reason   
Show original message

Hi Alexander,
is your yarn userlog   just for the executor log ?
as for those logs seem a little difficult to exactly decide the wrong point, 
due to sometimes successful job may also have some kinds of the error  ... but 
will repair itself.spark seems not that stable currently     ...
Thank you in advance~  

On Friday, June 17, 2016 6:53 PM, Zhiliang Zhu <zchl.j...@yahoo.com> wrote:


Hi Alexander,
Thanks a lot for your reply.
Yes, submitted by yarn.Do you just mean in the executor log file by way of yarn 
logs -applicationId id, 
in this file, both in some containers' stdout  and stderr :
16/06/17 14:05:40 INFO client.TransportClientFactory: Found inactive connection 
to ip-172-31-20-104/172.31.20.104:49991, creating a new one.
16/06/17 14:05:40 ERROR shuffle.RetryingBlockFetcher: Exception while beginning 
fetch of 1 outstanding blocksjava.io.IOException:Failed to connect to 
ip-172-31-20-104/172.31.20.104:49991              <--may it be due to that 
spark is not stable, and spark may repair itself for these kinds of error ? 
(saw some in successful run )
        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)Caused
 by: java.net.ConnectException: Connection refused: 
ip-172-31-20-104/172.31.20.104:49991        at 
sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)        
at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
        at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioC

RE: spark job automatically killed without rhyme or reason

2016-06-17 Thread Alexander Kapustin
Hi Zhiliang,

Yes, find the exact reason of failure is very difficult. We have issue with 
similar behavior, due to limited time for investigation, we reduce the number 
of processed data, and problem has gone.

Some points which may help you in investigations:

· If you start spark-history-server (or monitoring running application 
on 4040 port), look into failed stages (if any). By default Spark try to retry 
stage execution 2 times, after that job fails

· Some useful information may contains in yarn logs on Hadoop nodes 
(yarn--nodemanager-.log), but this is only information about killed 
container, not about the reasons why this stage took so much memory

As I can see in your logs, failed step relates to shuffle operation, could you 
change your job to avoid massive shuffle operation?

--
WBR, Alexander

From: Zhiliang Zhu<mailto:zchl.j...@yahoo.com.INVALID>
Sent: 17 июня 2016 г. 14:10
To: User<mailto:user@spark.apache.org>; 
kp...@hotmail.com<mailto:kp...@hotmail.com>
Subject: Re: spark job automatically killed without rhyme or reason


  Show original message

 Hi Alexander,
is your yarn userlog   just for the executor log ?
as for those logs seem a little difficult to exactly decide the wrong point, 
due to sometimes successful job may also have some kinds of the error  ... but 
will repair itself.spark seems not that stable currently ...
Thank you in advance~  

On Friday, June 17, 2016 6:53 PM, Zhiliang Zhu <zchl.j...@yahoo.com> wrote:


 Hi Alexander,
Thanks a lot for your reply.
Yes, submitted by yarn.Do you just mean in the executor log file by way of yarn 
logs -applicationId id,
in this file, both in some containers' stdout  and stderr :
16/06/17 14:05:40 INFO client.TransportClientFactory: Found inactive connection 
to ip-172-31-20-104/172.31.20.104:49991, creating a new one.
16/06/17 14:05:40 ERROR shuffle.RetryingBlockFetcher: Exception while beginning 
fetch of 1 outstanding blocksjava.io.IOException: Failed to connect to 
ip-172-31-20-104/172.31.20.104:49991  <-- may it be due to that 
spark is not stable, and spark may repair itself for these kinds of error ? 
(saw some in successful run )
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)Caused
 by: java.net.ConnectException: Connection refused: 
ip-172-31-20-104/172.31.20.104:49991at 
sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)
at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528) 
   at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)
at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)

16/06/17 11:54:38 ERROR executor.Executor: Managed memory leak detected; size = 
16777216 bytes, TID = 100323   <-   would it be memory leak 
issue? though no GC exception threw for other normal kinds of out of memory 
16/06/17 11:54:38 ERROR executor.Executor: Exception in task 145.0 in stage 
112.0 (TID 100323)java.io.IOException: Filesystem closedat 
org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:837)at 
org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:679)at 
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:903)at 
java.io.DataInputStream.readFully(DataInputStream.java:195)at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:2265)
at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:2635)...
sorry, there is some information in the middle of the log file, but all is okay 
at the end  part of the log .in the run log file as log_file generated by 
command:nohup spark-submit --driver-memory 20g  --num-executors 20 --class 
com.dianrong.Main  --master yarn-client  dianrong-retention_2.10-1.0.jar  
doAnalysisExtremeLender  /tmp/drretention/test/output  0.96  
/tmp/drretention/evaluation/test_karthik/lgmodel   
/tmp/drretention/input/feature_6.0_20151001_20160531_behavior_201511_201604_summary/lenderId_feature_live
 50 > log_file

executor 40 lost<--would it be due to this, 
sometimes job may fail for the reason
..
at org.apache.hadoop.hdfs.DFSInpu

Re: spark job automatically killed without rhyme or reason

2016-06-17 Thread Zhiliang Zhu
cm 3.0cm;}#yiv7679307012 div.yiv7679307012WordSection1 {}#yiv7679307012 Hi,  
 Did you submit spark job via YARN? In some cases (memory configuration 
probably), yarn can kill containers where spark tasks are executed. In this 
situation, please check yarn userlogs for more information…    --WBR, Alexander 
  From: Zhiliang Zhu
Sent: 17 июня 2016 г. 9:36
To: Zhiliang Zhu; User
Subject: Re: spark job automatically killed without rhyme or reason   anyone 
ever met the similar problem, which is quite strange ... 

On Friday, June 17, 2016 2:13 PM, Zhiliang Zhu <zchl.j...@yahoo.com.INVALID> 
wrote:


Hi All,
I have a big job which mainly takes more than one hour to run the whole, 
however, it is very much unreasonable to exit & finish to run midway (almost 
80% of the job finished actually, but not all), without any apparent error or 
exception log.
I submitted the same job for many times, it is same as that.In the last line of 
the run log, just one word "killed" to end, or sometimes not any  other wrong 
log, all seems okay but should not finish.
What is the way for the problem? Is there any other friends that ever met the 
similar issue ...
Thanks in advance!  



   

   

  

Re: spark job automatically killed without rhyme or reason

2016-06-17 Thread Zhiliang Zhu



 Hi Alexander,
Thanks a lot for your reply.
Yes, submitted by yarn.Do you just mean in the executor log file by way of yarn 
logs -applicationId id, 
in this file, both in some containers' stdout  and stderr :
16/06/17 14:05:40 INFO client.TransportClientFactory: Found inactive connection 
to ip-172-31-20-104/172.31.20.104:49991, creating a new one.
16/06/17 14:05:40 ERROR shuffle.RetryingBlockFetcher: Exception while beginning 
fetch of 1 outstanding blocksjava.io.IOException: Failed to connect to 
ip-172-31-20-104/172.31.20.104:49991              <-- may it be due to that 
spark is not stable, and spark may repair itself for these kinds of error ? 
(saw some in successful run )
        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)Caused
 by: java.net.ConnectException: Connection refused: 
ip-172-31-20-104/172.31.20.104:49991        at 
sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)        
at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
        at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)     
   at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)    
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)        at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)

16/06/17 11:54:38 ERROR executor.Executor: Managed memory leak detected; size = 
16777216 bytes, TID = 100323           <-       would it be memory leak 
issue? though no GC exception threw for other normal kinds of out of memory 
16/06/17 11:54:38 ERROR executor.Executor: Exception in task 145.0 in stage 
112.0 (TID 100323)java.io.IOException: Filesystem closed        at 
org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:837)        at 
org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:679)        at 
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:903)        at 
java.io.DataInputStream.readFully(DataInputStream.java:195)        at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:2265)
        at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:2635)...
sorry, there is some information in the middle of the log file, but all is okay 
at the end  part of the log .in the run log file as log_file generated by 
command:nohup spark-submit --driver-memory 20g  --num-executors 20 --class 
com.dianrong.Main  --master yarn-client  dianrong-retention_2.10-1.0.jar  
doAnalysisExtremeLender  /tmp/drretention/test/output  0.96  
/tmp/drretention/evaluation/test_karthik/lgmodel   
/tmp/drretention/input/feature_6.0_20151001_20160531_behavior_201511_201604_summary/lenderId_feature_live
 50 > log_file

executor 40 lost                        <--    would it be due to this, 
sometimes job may fail for the reason
..
        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:903)  
      at java.io.DataInputStream.readFully(DataInputStream.java:195)        at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:2265)
        at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:2635)..

Thanks in advance!


 

On Friday, June 17, 2016 3:52 PM, Alexander Kapustin <kp...@hotmail.com> 
wrote:
 

 #yiv1365829940 -- filtered {panose-1:2 4 5 3 5 4 6 3 2 4;}#yiv1365829940 
filtered {font-family:Calibri;panose-1:2 15 5 2 2 2 4 3 2 4;}#yiv1365829940 
p.yiv1365829940MsoNormal, #yiv1365829940 li.yiv1365829940MsoNormal, 
#yiv1365829940 div.yiv1365829940MsoNormal 
{margin:0cm;margin-bottom:.0001pt;font-size:11.0pt;}#yiv1365829940 a:link, 
#yiv1365829940 span.yiv1365829940MsoHyperlink 
{color:blue;text-decoration:underline;}#yiv1365829940 a:visited, #yiv1365829940 
span.yiv1365829940MsoHyperlinkFollowed 
{color:#954F72;text-decoration:underline;}#yiv1365829940 
.yiv1365829940MsoChpDefault {}#yiv1365829940 filtered {margin:2.0cm 42.5pt 
2.0cm 3.0cm;}#yiv1365829940 div.yiv1365829940WordSection1 {}#yiv1365829940 Hi,  
 Did you submit spark job via YARN? In some cases (memory configuration 
probably), yarn can kill containers where spark tasks are executed. In this 
situation, please check yarn userlogs for more information…    --WBR, Alexander 
  From: Zhiliang Zhu
Sent: 17 июня 2016 г. 9:36
To: Zhiliang Zhu; User
Subject: Re: spark job automatically killed without rhyme or reason 

Re: spark job automatically killed without rhyme or reason

2016-06-17 Thread Zhiliang Zhu
ion1 {}#yiv1365829940 Hi,  
 Did you submit spark job via YARN? In some cases (memory configuration 
probably), yarn can kill containers where spark tasks are executed. In this 
situation, please check yarn userlogs for more information…    --WBR, Alexander 
  From: Zhiliang Zhu
Sent: 17 июня 2016 г. 9:36
To: Zhiliang Zhu; User
Subject: Re: spark job automatically killed without rhyme or reason   anyone 
ever met the similar problem, which is quite strange ... 

On Friday, June 17, 2016 2:13 PM, Zhiliang Zhu <zchl.j...@yahoo.com.INVALID> 
wrote:


Hi All,
I have a big job which mainly takes more than one hour to run the whole, 
however, it is very much unreasonable to exit & finish to run midway (almost 
80% of the job finished actually, but not all), without any apparent error or 
exception log.
I submitted the same job for many times, it is same as that.In the last line of 
the run log, just one word "killed" to end, or sometimes not any  other wrong 
log, all seems okay but should not finish.
What is the way for the problem? Is there any other friends that ever met the 
similar issue ...
Thanks in advance!  



   

  

Re: spark job automatically killed without rhyme or reason

2016-06-17 Thread Zhiliang Zhu
Hi Alexander,
Thanks a lot for your reply.
Yes, submitted by yarn.Do you just mean in the executor log file by way of yarn 
logs -applicationId id, 
in this file, both in some containers' stdout  and stderr :
16/06/17 14:05:40 INFO client.TransportClientFactory: Found inactive connection 
to ip-172-31-20-104/172.31.20.104:49991, creating a new one.
16/06/17 14:05:40 ERROR shuffle.RetryingBlockFetcher: Exception while beginning 
fetch of 1 outstanding blocksjava.io.IOException: Failed to connect to 
ip-172-31-20-104/172.31.20.104:49991              <-- may it be due to that 
spark is not stable, and spark may repair itself for these kinds of error ? 
(saw some in successful run )
        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:193)
        at 
org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:156)Caused
 by: java.net.ConnectException: Connection refused: 
ip-172-31-20-104/172.31.20.104:49991        at 
sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)        at 
sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:739)        
at 
io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:224)
        at 
io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:289)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:528)     
   at 
io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:468)
        at 
io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:382)    
    at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:354)        at 
io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:111)

16/06/17 11:54:38 ERROR executor.Executor: Managed memory leak detected; size = 
16777216 bytes, TID = 100323           <-       would it be memory leak 
issue? though no GC exception threw for other normal kinds of out of memory 
16/06/17 11:54:38 ERROR executor.Executor: Exception in task 145.0 in stage 
112.0 (TID 100323)java.io.IOException: Filesystem closed        at 
org.apache.hadoop.hdfs.DFSClient.checkOpen(DFSClient.java:837)        at 
org.apache.hadoop.hdfs.DFSInputStream.close(DFSInputStream.java:679)        at 
org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:903)        at 
java.io.DataInputStream.readFully(DataInputStream.java:195)        at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:2265)
        at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:2635)...
sorry, there is some information in the middle of the log file, but all is okay 
at the end  part of the log .in the run log file as log_file generated by 
command:nohup spark-submit --driver-memory 20g  --num-executors 20 --class 
com.dianrong.Main  --master yarn-client  dianrong-retention_2.10-1.0.jar  
doAnalysisExtremeLender  /tmp/drretention/test/output  0.96  
/tmp/drretention/evaluation/test_karthik/lgmodel   
/tmp/drretention/input/feature_6.0_20151001_20160531_behavior_201511_201604_summary/lenderId_feature_live
 50 > log_file

executor 40 lost                        <--    would it be due to this, 
sometimes job may fail for the reason
..
        at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:903)  
      at java.io.DataInputStream.readFully(DataInputStream.java:195)        at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripeFooter(RecordReaderImpl.java:2265)
        at 
org.apache.hadoop.hive.ql.io.orc.RecordReaderImpl.readStripe(RecordReaderImpl.java:2635)..

Thanks in advance!


 

On Friday, June 17, 2016 3:52 PM, Alexander Kapustin <kp...@hotmail.com> 
wrote:
 

 #yiv8423914567 #yiv8423914567 -- _filtered #yiv8423914567 {panose-1:2 4 5 3 5 
4 6 3 2 4;} _filtered #yiv8423914567 {font-family:Calibri;panose-1:2 15 5 2 2 2 
4 3 2 4;}#yiv8423914567 #yiv8423914567 p.yiv8423914567MsoNormal, #yiv8423914567 
li.yiv8423914567MsoNormal, #yiv8423914567 div.yiv8423914567MsoNormal 
{margin:0cm;margin-bottom:.0001pt;font-size:11.0pt;}#yiv8423914567 a:link, 
#yiv8423914567 span.yiv8423914567MsoHyperlink 
{color:blue;text-decoration:underline;}#yiv8423914567 a:visited, #yiv8423914567 
span.yiv8423914567MsoHyperlinkFollowed 
{color:#954F72;text-decoration:underline;}#yiv8423914567 
.yiv8423914567MsoChpDefault {} _filtered #yiv8423914567 {margin:2.0cm 42.5pt 
2.0cm 3.0cm;}#yiv8423914567 div.yiv8423914567WordSection1 {}#yiv8423914567 Hi,  
 Did you submit spark job via YARN? In some cases (memory configuration 
probably), yarn can kill containers where spark tasks are executed. In this 
situation, please check yarn userlogs for more information…    --WBR, Alexander 
  From: Zhiliang Zhu
Sent: 17 июня 2016 г. 9:36
To: Zhiliang Zhu; User
Subject: Re: spark

RE: spark job automatically killed without rhyme or reason

2016-06-17 Thread Alexander Kapustin
Hi,

Did you submit spark job via YARN? In some cases (memory configuration 
probably), yarn can kill containers where spark tasks are executed. In this 
situation, please check yarn userlogs for more information…

--
WBR, Alexander

From: Zhiliang Zhu<mailto:zchl.j...@yahoo.com.INVALID>
Sent: 17 июня 2016 г. 9:36
To: Zhiliang Zhu<mailto:zchl.j...@yahoo.com>; User<mailto:user@spark.apache.org>
Subject: Re: spark job automatically killed without rhyme or reason

anyone ever met the similar problem, which is quite strange ...

On Friday, June 17, 2016 2:13 PM, Zhiliang Zhu 
<zchl.j...@yahoo.com.INVALID> wrote:


 Hi All,
I have a big job which mainly takes more than one hour to run the whole, 
however, it is very much unreasonable to exit & finish to run midway (almost 
80% of the job finished actually, but not all), without any apparent error or 
exception log.
I submitted the same job for many times, it is same as that.In the last line of 
the run log, just one word "killed" to end, or sometimes not any  other wrong 
log, all seems okay but should not finish.
What is the way for the problem? Is there any other friends that ever met the 
similar issue ...
Thanks in advance!




Re: spark job automatically killed without rhyme or reason

2016-06-17 Thread Zhiliang Zhu
anyone ever met the similar problem, which is quite strange ...  

On Friday, June 17, 2016 2:13 PM, Zhiliang Zhu 
 wrote:
 

 Hi All,
I have a big job which mainly takes more than one hour to run the whole, 
however, it is very much unreasonable to exit & finish to run midway (almost 
80% of the job finished actually, but not all), without any apparent error or 
exception log.
I submitted the same job for many times, it is same as that.In the last line of 
the run log, just one word "killed" to end, or sometimes not any  other wrong 
log, all seems okay but should not finish.
What is the way for the problem? Is there any other friends that ever met the 
similar issue ...
Thanks in advance!