I would expect 2 GB would be enough or more than enough for 16 GB executors
(unless ALS is using a bunch of off-heap memory?).  You mentioned earlier
in this thread that the property wasn't showing up in the Environment tab.
 Are you sure it's making it in?

-Sandy

On Tue, Sep 9, 2014 at 11:58 AM, Debasish Das <debasish.da...@gmail.com>
wrote:

> Hmm...I did try it increase to few gb but did not get a successful run
> yet...
>
> Any idea if I am using say 40 executors, each running 16GB, what's the
> typical spark.yarn.executor.memoryOverhead for say 100M x 10 M large
> matrices with say few billion ratings...
>
> On Tue, Sep 9, 2014 at 10:49 AM, Sandy Ryza <sandy.r...@cloudera.com>
> wrote:
>
>> Hi Deb,
>>
>> The current state of the art is to increase
>> spark.yarn.executor.memoryOverhead until the job stops failing.  We do have
>> plans to try to automatically scale this based on the amount of memory
>> requested, but it will still just be a heuristic.
>>
>> -Sandy
>>
>> On Tue, Sep 9, 2014 at 7:32 AM, Debasish Das <debasish.da...@gmail.com>
>> wrote:
>>
>>> Hi Sandy,
>>>
>>> Any resolution for YARN failures ? It's a blocker for running spark on
>>> top of YARN.
>>>
>>> Thanks.
>>> Deb
>>>
>>> On Tue, Aug 19, 2014 at 11:29 PM, Xiangrui Meng <men...@gmail.com>
>>> wrote:
>>>
>>>> Hi Deb,
>>>>
>>>> I think this may be the same issue as described in
>>>> https://issues.apache.org/jira/browse/SPARK-2121 . We know that the
>>>> container got killed by YARN because it used much more memory that it
>>>> requested. But we haven't figured out the root cause yet.
>>>>
>>>> +Sandy
>>>>
>>>> Best,
>>>> Xiangrui
>>>>
>>>> On Tue, Aug 19, 2014 at 8:51 PM, Debasish Das <debasish.da...@gmail.com>
>>>> wrote:
>>>> > Hi,
>>>> >
>>>> > During the 4th ALS iteration, I am noticing that one of the executor
>>>> gets
>>>> > disconnected:
>>>> >
>>>> > 14/08/19 23:40:00 ERROR network.ConnectionManager: Corresponding
>>>> > SendingConnectionManagerId not found
>>>> >
>>>> > 14/08/19 23:40:00 INFO cluster.YarnClientSchedulerBackend: Executor 5
>>>> > disconnected, so removing it
>>>> >
>>>> > 14/08/19 23:40:00 ERROR cluster.YarnClientClusterScheduler: Lost
>>>> executor 5
>>>> > on tblpmidn42adv-hdp.tdc.vzwcorp.com: remote Akka client
>>>> disassociated
>>>> >
>>>> > 14/08/19 23:40:00 INFO scheduler.DAGScheduler: Executor lost: 5
>>>> (epoch 12)
>>>> > Any idea if this is a bug related to akka on YARN ?
>>>> >
>>>> > I am using master
>>>> >
>>>> > Thanks.
>>>> > Deb
>>>>
>>>
>>>
>>
>

Reply via email to