Re: Lost executor on YARN ALS iterations

Sandy Ryza Wed, 10 Sep 2014 09:59:16 -0700

That's right

On Tue, Sep 9, 2014 at 2:04 PM, Debasish Das <[email protected]>
wrote:


> Last time it did not show up on environment tab but I will give it another
> shot...Expected behavior is that this env variable will show up right ?
>
> On Tue, Sep 9, 2014 at 12:15 PM, Sandy Ryza <[email protected]>
> wrote:
>
>> I would expect 2 GB would be enough or more than enough for 16 GB
>> executors (unless ALS is using a bunch of off-heap memory?).  You mentioned
>> earlier in this thread that the property wasn't showing up in the
>> Environment tab.  Are you sure it's making it in?
>>
>> -Sandy
>>
>> On Tue, Sep 9, 2014 at 11:58 AM, Debasish Das <[email protected]>
>> wrote:
>>
>>> Hmm...I did try it increase to few gb but did not get a successful run
>>> yet...
>>>
>>> Any idea if I am using say 40 executors, each running 16GB, what's the
>>> typical spark.yarn.executor.memoryOverhead for say 100M x 10 M large
>>> matrices with say few billion ratings...
>>>
>>> On Tue, Sep 9, 2014 at 10:49 AM, Sandy Ryza <[email protected]>
>>> wrote:
>>>
>>>> Hi Deb,
>>>>
>>>> The current state of the art is to increase
>>>> spark.yarn.executor.memoryOverhead until the job stops failing.  We do have
>>>> plans to try to automatically scale this based on the amount of memory
>>>> requested, but it will still just be a heuristic.
>>>>
>>>> -Sandy
>>>>
>>>> On Tue, Sep 9, 2014 at 7:32 AM, Debasish Das <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Sandy,
>>>>>
>>>>> Any resolution for YARN failures ? It's a blocker for running spark on
>>>>> top of YARN.
>>>>>
>>>>> Thanks.
>>>>> Deb
>>>>>
>>>>> On Tue, Aug 19, 2014 at 11:29 PM, Xiangrui Meng <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi Deb,
>>>>>>
>>>>>> I think this may be the same issue as described in
>>>>>> https://issues.apache.org/jira/browse/SPARK-2121 . We know that the
>>>>>> container got killed by YARN because it used much more memory that it
>>>>>> requested. But we haven't figured out the root cause yet.
>>>>>>
>>>>>> +Sandy
>>>>>>
>>>>>> Best,
>>>>>> Xiangrui
>>>>>>
>>>>>> On Tue, Aug 19, 2014 at 8:51 PM, Debasish Das <
>>>>>> [email protected]> wrote:
>>>>>> > Hi,
>>>>>> >
>>>>>> > During the 4th ALS iteration, I am noticing that one of the
>>>>>> executor gets
>>>>>> > disconnected:
>>>>>> >
>>>>>> > 14/08/19 23:40:00 ERROR network.ConnectionManager: Corresponding
>>>>>> > SendingConnectionManagerId not found
>>>>>> >
>>>>>> > 14/08/19 23:40:00 INFO cluster.YarnClientSchedulerBackend: Executor
>>>>>> 5
>>>>>> > disconnected, so removing it
>>>>>> >
>>>>>> > 14/08/19 23:40:00 ERROR cluster.YarnClientClusterScheduler: Lost
>>>>>> executor 5
>>>>>> > on tblpmidn42adv-hdp.tdc.vzwcorp.com: remote Akka client
>>>>>> disassociated
>>>>>> >
>>>>>> > 14/08/19 23:40:00 INFO scheduler.DAGScheduler: Executor lost: 5
>>>>>> (epoch 12)
>>>>>> > Any idea if this is a bug related to akka on YARN ?
>>>>>> >
>>>>>> > I am using master
>>>>>> >
>>>>>> > Thanks.
>>>>>> > Deb
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Lost executor on YARN ALS iterations

Reply via email to