Hi Shrikant.

I think perhaps this belongs in the spark user email list, not the dev email 
list.
That being said, is the root cause perhaps that the k8s pod is shut down? Pods 
in k8s are ephemeral and might be shut down at any time (and the containers 
therein restarted in a new pod). This is especially true if you're using spot 
instances.

BR, Martin
________________________________
From: Shrikant Prasad <shrikant....@gmail.com>
Sent: Wednesday, November 9, 2022 12:11
To: Dongjoon Hyun <dongjoon.h...@gmail.com>
Cc: dev <dev@spark.apache.org>
Subject: Re: Spark Context Shutodown

You don't often get email from shrikant....@gmail.com. Learn why this is 
important<https://aka.ms/LearnAboutSenderIdentification>

EXTERNAL SENDER. Do not click links or open attachments unless you recognize 
the sender and know the content is safe. DO NOT provide your username or 
password.


I have gone through debug logs of jobs. There are no failures or exceptions in 
logs.
This issue does not seem to be specific to jobs as several of our jobs have 
been impacted by this issue and these same jobs pass also on retry.

I am trying to figure out why the driver pod is getting deleted when this issue 
occurs. Even if there was some error, driver pod should remain there in Error 
state.

What could be the potential reasons for driver pod deletion so that we can 
investage in that direction?

Regards,
Shrikant

On Sat, 29 Oct 2022 at 1:14 PM, Dongjoon Hyun 
<dongjoon.h...@gmail.com<mailto:dongjoon.h...@gmail.com>> wrote:
Maybe enabling DEBUG level log in your job and follow the processing logic 
until the failure?

BTW, you need to look at what happens during job processing.

`Spark Context was shutdown` is not the root cause, but the result of job 
failure in most cases.

Dongjoon.

On Fri, Oct 28, 2022 at 12:10 AM Shrikant Prasad 
<shrikant....@gmail.com<mailto:shrikant....@gmail.com>> wrote:
Thanks Dongjoon for replying. I have tried with Spark 3.2 and still facing the 
same issue.

Looking for some pointers which can help in debugging to find the root cause.

Regards,
Shrikant

On Thu, 27 Oct 2022 at 10:36 PM, Dongjoon Hyun 
<dongjoon.h...@gmail.com<mailto:dongjoon.h...@gmail.com>> wrote:
Hi, Shrikant.

It seems that you are using non-GA features.

FYI, since Apache Spark 3.1.1, Kubernetes Support became GA in the community.

    
https://spark.apache.org/releases/spark-release-3-1-1.html<https://gbr01.safelinks.protection.outlook.com/?url=https%3A%2F%2Fspark.apache.org%2Freleases%2Fspark-release-3-1-1.html&data=05%7C01%7Cmartin.andersson%40kambi.com%7Ceccc62eb9fb54908dc3d08dac24340b8%7Ce3ec1ec4b9944e9e82e080234621871f%7C0%7C0%7C638035891352866185%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=fZc%2BCoOqmvas540pixDhgKF8w400ktmax4%2BudS%2BOtAo%3D&reserved=0>

In addition, Apache Spark 3.1 reached EOL last month.

Could you try the latest distribution like Apache Spark 3.3.1 to see that you 
are still experiencing the same issue?

It will reduce the scope of your issues by excluding many known and fixed bugs 
at 
3.0/3.1/3.2/3.3.0.<https://gbr01.safelinks.protection.outlook.com/?url=http%3A%2F%2F3.3.0.%2F&data=05%7C01%7Cmartin.andersson%40kambi.com%7Ceccc62eb9fb54908dc3d08dac24340b8%7Ce3ec1ec4b9944e9e82e080234621871f%7C0%7C0%7C638035891352866185%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=QHUG7XxNLqAZVIkqzgwp8iQdph1FiMaODePW4YM3PEA%3D&reserved=0>

Thanks,
Dongjoon.


On Wed, Oct 26, 2022 at 11:16 PM Shrikant Prasad 
<shrikant....@gmail.com<mailto:shrikant....@gmail.com>> wrote:
Hi Everyone,

We are using Spark 3.0.1 with Kubernetes resource manager. Facing an 
intermittent issue in which the driver pod gets deleted and the driver logs 
have this message that Spark Context was shutdown.

The same job works fine with given set of configurations most of the time but 
sometimes it fails. It mostly occurs while reading or writing parquet files to 
hdfs. (but not sure if it's the only usecase affected)

Any pointers to find the root cause?

Most of the earlier reported issues mention executors getting OOM as the cause. 
But we have not seen an OOM error in any of executors. Also, why the context 
will be shutdown in this case instead of retrying with new executors.
Another doubt is why the driver pod gets deleted. Shouldn't it just error out?

Regards,
Shrikant

--
Regards,
Shrikant Prasad
--
Regards,
Shrikant Prasad
--
Regards,
Shrikant Prasad

Reply via email to