Just wanted to point out- raising the memory-head (as I saw in the logs)
was the fix for this issue and I have not seen dying executors since this
calue was increased
On Tue, Feb 24, 2015 at 3:52 AM, Anders Arpteg wrote:
> If you thinking of the yarn memory overhead, then yes, I have increased
>
If you thinking of the yarn memory overhead, then yes, I have increased
that as well. However, I'm glad to say that my job finished successfully
finally. Besides the timeout and memory settings, performing repartitioning
(with shuffling) at the right time seems to be the key to make this large
job
I *think* this may have been related to the default memory overhead setting
being too low. I raised the value to 1G it and tried my job again but i had
to leave the office before it finished. It did get further but I'm not
exactly sure if that's just because i raised the memory. I'll see tomorrow-
I've got the opposite problem with regards to partitioning. I've got over
6000 partitions for some of these RDDs which immediately blows the heap
somehow- I'm still not exactly sure how. If I coalesce them down to about
600-800 partitions, I get the problems where the executors are dying
without an
Sounds very similar to what I experienced Corey. Something that seems to at
least help with my problems is to have more partitions. Am already fighting
between ending up with too many partitions in the end and having too few in
the beginning. By coalescing at late as possible and avoiding too few i
I'm looking @ my yarn container logs for some of the executors which appear
to be failing (with the missing shuffle files). I see exceptions that say
"client.TransportClientFactor: Found inactive connection to host/ip:port,
closing it."
Right after that I see "shuffle.RetryingBlockFetcher: Excepti
No, unfortunately we're not making use of dynamic allocation or the
external shuffle service. Hoping that we could reconfigure our cluster to
make use of it, but since it requires changes to the cluster itself (and
not just the Spark app), it could take some time.
Unsure if task 450 was acting as
Do you guys have dynamic allocation turned on for YARN?
Anders, was Task 450 in your job acting like a Reducer and fetching the Map
spill output data from a different node?
If a Reducer task can't read the remote data it needs, that could cause the
stage to fail. Sometimes this forces the previou
Could you try to turn on the external shuffle service?
spark.shuffle.service.enable= true
On 21.2.2015. 17:50, Corey Nolet wrote:
I'm experiencing the same issue. Upon closer inspection I'm noticing
that executors are being lost as well. Thing is, I can't figure out
how they are dying. I'm u
I'm experiencing the same issue. Upon closer inspection I'm noticing that
executors are being lost as well. Thing is, I can't figure out how they are
dying. I'm using MEMORY_AND_DISK_SER and i've got over 1.3TB of memory
allocated for the application. I was thinking perhaps it was possible that
a s
10 matches
Mail list logo