quot;" One of the offerings from the service we use is EBS migration which
> basically means if a host is about to get evicted, a new host is created
> and the EBS volume is attached to it. When Spark assigns a new executor
> to the newly created instance, it basically can recover all the
rings from the service we use is EBS migration which
basically means if a host is about to get evicted, a new host is created
and the EBS volume is attached to it. When Spark assigns a new executor to
the newly created instance, it basically can recover all the shuffle files
that are already persisted in th
; > Have you looked at why you are having these shuffles? What is the cause of
> > these large transformations ending up in shuffle
> >
> > Also on your point:
> > "..then ideally we should expect that when an executor is killed/OOM'd
> > and a new executor i
uld expect that when an executor is killed/OOM'd
> and a new executor is spawned on the same host, the new executor registers
> the shuffle files to itself. Is that so?"
>
> What guarantee is that the new executor with inherited shuffle files will
> succeed?
>
> Also OOM
when an executor is killed/OOM'd and
a new executor is spawned on the same host, the new executor registers the
shuffle files to itself. Is that so?"
What guarantee is that the new executor with inherited shuffle files will
succeed?
Also OOM is often associated with some form of skewed data
to it
When Spark assigns a new executor to the newly created instance, it
basically can recover all the shuffle files that are already persisted in
the migrated EBS volume
Is this how it works? Do executors recover / re-register the shuffle files
that they found?
So far I have not come across any
ok thanks. guess i am simply misremembering that i saw the shuffle files
getting re-used across jobs (actions). it was probably across stages for
the same job.
in structured streaming this is a pretty big deal. if you join a streaming
dataframe with a large static dataframe each microbatch
Spark can reuse shuffle stages in the same job (action), not cross jobs.
From: Koert Kuipers
Sent: Saturday, July 16, 2022 6:43 PM
To: user
Subject: [EXTERNAL] spark re-use shuffle files not happening
ATTENTION: This email originated from outside of GM.
i
i have seen many jobs where spark re-uses shuffle files (and skips a stage
of a job), which is an awesome feature given how expensive shuffles are,
and i generally now assume this will happen.
however i feel like i am going a little crazy today. i did the simplest
test in spark 3.3.0, basically i
Hello,
I have answered it on the Stack Overflow.
Best Regards,
Attila
On Wed, May 12, 2021 at 4:57 PM Chris Thomas
wrote:
> Hi,
>
> I am pretty confident I have observed Spark configured with the Shuffle
> Service continuing to fetch shuffle files on a node in the event of
> e
Hi,
I am pretty confident I have observed Spark configured with the Shuffle Service
continuing to fetch shuffle files on a node in the event of executor failure,
rather than recompute the shuffle files as happens without the Shuffle Service.
Can anyone confirm this?
(I have a SO question
You can also look at the shuffle file cleanup tricks we do inside of the
ALS algorithm in Spark.
On Fri, Feb 23, 2018 at 6:20 PM, vijay.bvp wrote:
> have you looked at
> http://apache-spark-user-list.1001560.n3.nabble.com/Limit-
> Spark-Shuffle-Disk-Usage-td23279.html
>
>
have you looked at
http://apache-spark-user-list.1001560.n3.nabble.com/Limit-Spark-Shuffle-Disk-Usage-td23279.html
and the post mentioned there
https://forums.databricks.com/questions/277/how-do-i-avoid-the-no-space-left-on-device-error.html
also try compressing the output
Got it. I understood issue in different way.
On Thu, Feb 22, 2018 at 9:19 PM Keith Chapman <keithgchap...@gmail.com>
wrote:
> My issue is that there is not enough pressure on GC, hence GC is not
> kicking in fast enough to delete the shuffle files of previous iterations.
>
>
My issue is that there is not enough pressure on GC, hence GC is not
kicking in fast enough to delete the shuffle files of previous iterations.
Regards,
Keith.
http://keith-chapman.com
On Thu, Feb 22, 2018 at 6:58 PM, naresh Goud <nareshgoud.du...@gmail.com>
wrote:
> It would be very
3 to 4 iterations I get into a situation that I run out of disk space
> on the /tmp directory. On further investigation I was able to figure out
> that the reason for this is that the shuffle files are still around,
> because I have a very large hear GC has not happen and hence the shuffle
&g
on the /tmp
directory. On further investigation I was able to figure out that the
reason for this is that the shuffle files are still around, because I have
a very large hear GC has not happen and hence the shuffle files are not
deleted. I was able to confirm this by lowering the heap size and I see GC
When the RDD using them goes out of scope.
On Mon, Mar 27, 2017 at 3:13 PM, Ashwin Sai Shankar <ashan...@netflix.com>
wrote:
> Thanks Mark! follow up question, do you know when shuffle files are
> usually un-referenced?
>
> On Mon, Mar 27, 2017 at 2:35 PM, Mark Hamstra <m.
Thanks Mark! follow up question, do you know when shuffle files are usually
un-referenced?
On Mon, Mar 27, 2017 at 2:35 PM, Mark Hamstra <m...@clearstorydata.com>
wrote:
> Shuffle files are cleaned when they are no longer referenced. See
> https://github.com/apache/spark/blob/mas
Shuffle files are cleaned when they are no longer referenced. See
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/ContextCleaner.scala
On Mon, Mar 27, 2017 at 12:38 PM, Ashwin Sai Shankar <
ashan...@netflix.com.invalid> wrote:
> Hi!
>
> In spar
Hi!
In spark on yarn, when are shuffle files on local disk removed? (Is it when
the app completes or
once all the shuffle files are fetched or end of the stage?)
Thanks,
Ashwin
Hi,
I'm running into consistent failures during a shuffle read while trying to
do a group-by followed by a count aggregation (using the DataFrame API on
Spark 1.5.2).
The shuffle read (in stage 1) fails with
org.apache.spark.shuffle.FetchFailedException: Failed to send RPC
7719188499899260109
-shuffle-files-tp17605p25663.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org
or partition as suggested by some google
> search, but in vain..
> any idea?
>
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/FileNotFoundException-in-appcache-shuffle-files-tp17605p25663.html
> Sent from the Apache Spark Us
Hi all,
We are running a class with Pyspark notebook for data analysis. Some of
the books are fairly long and have a lot of operations. Through the
course of the notebook, the shuffle storage expands considerably and
often exceeds quota (e.g. 1.5GB input expands to 24GB in shuffle
files). Closing
+1? Or is the lifecycle of the shuffle files bound to the job that
created them?
2. when are shuffle files actually deleted? Is it TTL based or is it
cleaned when the job is over?
3. we have a very long batch application, and as it goes on, the number of
total tasks for each job gets larger
and that they act
as checkpoints.
I wonder if this is true only within a job, or across jobs. Please note
that I use the words job and stage carefully here.
1. can a shuffle created during JobN be used to skip many stages from
JobN+1? Or is the lifecycle of the shuffle files bound to the job
stages in the job UI. They are
periodically cleaned up based on available space of the configured
spark.local.dirs paths.
From: Thomas Gerber
Date: Monday, June 29, 2015 at 10:12 PM
To: user
Subject: Shuffle files lifecycle
Hello,
It is my understanding that shuffle are written
spark.local.dirs paths.
From: Thomas Gerber
Date: Monday, June 29, 2015 at 10:12 PM
To: user
Subject: Shuffle files lifecycle
Hello,
It is my understanding that shuffle are written on disk and that they act as
checkpoints.
I wonder if this is true only within a job, or across jobs. Please note that I
Hi TD,
That little experiment helped a bit. This time we did not see any
exceptions for about 16 hours but eventually it did throw the same
exceptions as before. The cleaning of the shuffle files also stopped much
before these exceptions happened - about 7-1/2 hours after startup.
I am not quite
Thanks for the response, Conor. I tried with those settings and for a while
it seemed like it was cleaning up shuffle files after itself. However,
after exactly 5 hours later it started throwing exceptions and eventually
stopped working again. A sample stack trace is below. What is curious about
5
shuffle files after itself.
However, after exactly 5 hours later it started throwing exceptions and
eventually stopped working again. A sample stack trace is below. What is
curious about 5 hours is that I set the cleaner ttl to 5 hours after
changing the max window size to 1 hour (down from 6
We already do have a cron job in place to clean just the shuffle files.
However, what I would really like to know is whether there is a proper
way of telling spark to clean up these files once its done with them?
Thanks
NB
On Mon, Apr 20, 2015 at 10:47 AM, Jeetendra Gangele gangele...@gmail.com
Hi all,
I had posed this query as part of a different thread but did not get a
response there. So creating a new thread hoping to catch someone's
attention.
We are experiencing this issue of shuffle files being left behind and not
being cleaned up by Spark. Since this is a Spark streaming
-*-*-* -prune -exec rm -rf {} \+
On 20 April 2015 at 23:12, N B nb.nos...@gmail.com wrote:
Hi all,
I had posed this query as part of a different thread but did not get a
response there. So creating a new thread hoping to catch someone's
attention.
We are experiencing this issue of shuffle files being
coalescing. Also, I don't see any dfferences between 'disk only' and
'memory and disk' storage levels- both of them are having the same
problems. I notice large shuffle files (30-40gb) that only seem to spill a
few hundred mb.
On Mon, Feb 23, 2015 at 4:28 PM, Anders Arpteg arp...@spotify.com
wrote
didn't seem to help the case where I am
coalescing. Also, I don't see any dfferences between 'disk only' and
'memory and disk' storage levels- both of them are having the same
problems. I notice large shuffle files (30-40gb) that only seem to spill a
few hundred mb.
On Mon, Feb 23, 2015 at 4:28 PM
persistence kick in at that point?
On Sat, Feb 21, 2015 at 11:20 AM, Anders Arpteg arp...@spotify.com
wrote:
For large jobs, the following error message is shown that seems to
indicate that shuffle files for some reason are missing. It's a rather
large job with many partitions. If the data size
I'm looking @ my yarn container logs for some of the executors which appear
to be failing (with the missing shuffle files). I see exceptions that say
client.TransportClientFactor: Found inactive connection to host/ip:port,
closing it.
Right after that I see shuffle.RetryingBlockFetcher: Exception
...
On Mon, Feb 23, 2015 at 9:54 PM, Corey Nolet cjno...@gmail.com wrote:
I'm looking @ my yarn container logs for some of the executors which
appear to be failing (with the missing shuffle files). I see exceptions
that say client.TransportClientFactor: Found inactive connection to
host/ip:port
between 'disk only' and 'memory and disk'
storage levels- both of them are having the same problems. I notice large
shuffle files (30-40gb) that only seem to spill a few hundred mb.
On Mon, Feb 23, 2015 at 4:28 PM, Anders Arpteg arp...@spotify.com wrote:
Sounds very similar to what I experienced Corey
to be failing (with the missing shuffle files). I see exceptions
that say client.TransportClientFactor: Found inactive connection to
host/ip:port, closing it.
Right after that I see shuffle.RetryingBlockFetcher: Exception while
beginning fetch of 1 outstanding blocks. java.io.IOException: Failed
executor was getting a single or a couple large partitions but
shouldn't the disk persistence kick in at that point?
On Sat, Feb 21, 2015 at 11:20 AM, Anders Arpteg arp...@spotify.com
wrote:
For large jobs, the following error message is shown that seems to
indicate that shuffle files for some
For large jobs, the following error message is shown that seems to indicate
that shuffle files for some reason are missing. It's a rather large job
with many partitions. If the data size is reduced, the problem disappears.
I'm running a build from Spark master post 1.2 (build at 2015-01-16
that
a single executor was getting a single or a couple large partitions but
shouldn't the disk persistence kick in at that point?
On Sat, Feb 21, 2015 at 11:20 AM, Anders Arpteg arp...@spotify.com wrote:
For large jobs, the following error message is shown that seems to
indicate that shuffle files
, Anders Arpteg arp...@spotify.com
mailto:arp...@spotify.com wrote:
For large jobs, the following error message is shown that seems to
indicate that shuffle files for some reason are missing. It's a
rather large job with many partitions. If the data size is
reduced, the problem
,
Lucio
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/FileNotFoundException-in-appcache-shuffle-files-tp17605p21077.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
-or-directory-tmp-spark-9e23f17e-2e23-4c26-9621-3cb4d8b832da-tmp3i3xno-td21076.html
Thank you,
Lucio
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/FileNotFoundException-in-appcache-shuffle-files-tp17605p21077.html
Sent from the Apache Spark User List mailing
/1609
[5] https://github.com/apache/spark/pull/1609#issuecomment-54393908
--
If you reply to this email, your message will be added to the discussion
below:
http://apache-spark-user-list.1001560.n3.nabble.com/FileNotFoundException-in-appcache-shuffle-files
shuffle files
Hi Ryan,
This is an issue from sort-based shuffle, not consolidated hash-based shuffle.
I guess mostly this issue occurs when Spark cluster is in abnormal situation,
maybe long time of GC pause or some others, you can check the system status or
if there’s any other exceptions beside
.
Thanks
Jerry
From: nobigdealst...@gmail.com [mailto:nobigdealst...@gmail.com] On Behalf Of
Ryan Williams
Sent: Wednesday, October 29, 2014 1:31 PM
To: user
Subject: FileNotFoundException in appcache shuffle files
My job is failing with the following error:
14/10/29 02:59:14 WARN
: Re: Shuffle files
- We set ulimit to 50. But I still get the same too many open files
warning.
- I tried setting consolidateFiles to True, but that did not help either.
I am using a Mesos cluster. Does Mesos have any limit on the number of
open files?
thanks
--
View
Cc: Sunny Khatri; Lisonbee, Todd; u...@spark.incubator.apache.org
Subject: Re: Shuffle files
My observation is opposite. When my job runs under default
spark.shuffle.manager, I don't see this exception. However, when it runs with
SORT based, I start seeing this error? How would that be possible
://apache-spark-user-list.1001560.n3.nabble.com/Shuffle-files-tp15185p15869.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional
Is it possible to store spark shuffle files on Tachyon ?
-on-reduceByKey-td2462.html
Thanks,
Todd
-Original Message-
From: SK [mailto:skrishna...@gmail.com]
Sent: Tuesday, October 7, 2014 2:12 PM
To: u...@spark.incubator.apache.org
Subject: Re: Shuffle files
- We set ulimit to 50. But I still get the same too many open files
warning.
- I tried
-Original Message-
From: SK [mailto:skrishna...@gmail.com]
Sent: Tuesday, October 7, 2014 2:12 PM
To: u...@spark.incubator.apache.org
Subject: Re: Shuffle files
- We set ulimit to 50. But I still get the same too many open files
warning.
- I tried setting consolidateFiles to True
Subject: Re: Shuffle files
- We set ulimit to 50. But I still get the same too many open files
warning.
- I tried setting consolidateFiles to True, but that did not help either.
I am using a Mesos cluster. Does Mesos have any limit on the number of
open files?
thanks
--
View
:
/tmp/spark-local-20140925215712-0319/12/shuffle_0_99_93138 (Too many open
files)
basically I think a lot of shuffle files are being created.
1) The tasks eventually fail and the job just hangs (after taking very long,
more than an hour). If I read these 30 files in a for loop, the same job
Hi SK,
For the problem with lots of shuffle files and the too many open files
exception there are a couple options:
1. The linux kernel has a limit on the number of open files at once. This
is set with ulimit -n, and can be set permanently in /etc/sysctl.conf or
/etc/sysctl.d/. Try increasing
...@tellapart.com]
*Sent:* Friday, June 13, 2014 10:15 AM
*To:* user@spark.apache.org
*Subject:* Re: Spilled shuffle files not being cleared
Bump
On Mon, Jun 9, 2014 at 3:22 PM, Michael Chang m...@tellapart.com wrote:
Hi all,
I'm seeing exceptions that look like the below in Spark 0.9.1
Bump
On Mon, Jun 9, 2014 at 3:22 PM, Michael Chang m...@tellapart.com wrote:
Hi all,
I'm seeing exceptions that look like the below in Spark 0.9.1. It looks
like I'm running out of inodes on my machines (I have around 300k each in a
12 machine cluster). I took a quick look and I'm seeing
is spark.cleaner.referenceTracking),
and it is enabled by default.
Thanks
Saisai
From: Michael Chang [mailto:m...@tellapart.com]
Sent: Friday, June 13, 2014 10:15 AM
To: user@spark.apache.org
Subject: Re: Spilled shuffle files not being cleared
Bump
On Mon, Jun 9, 2014 at 3:22 PM, Michael Chang
m
Hi all,
I'm seeing exceptions that look like the below in Spark 0.9.1. It looks
like I'm running out of inodes on my machines (I have around 300k each in a
12 machine cluster). I took a quick look and I'm seeing some shuffle spill
files that are around even around 12 minutes after they are
wrote:
Where on the filesystem does spark write the shuffle files?
--
...:::Aniket:::... Quetzalco@tl
Where on the filesystem does spark write the shuffle files?
66 matches
Mail list logo