date:20151020

Re: MapStatus too large for drvier

2015-10-20 Thread yaoqin

I try to use org.apache.spark.util.collection.BitSet instead of
RoaringBitMap, and it can save about 20% memories but runs much slower.

For the 200K tasks job, 
RoaringBitMap uses 3 Long[1024] and 1 Short[3392]
=3*64*1024+16*3392=250880(bit)
BitSet uses 1 Long[3125] = 3125*64=20(bit)

Memory saved = (250880-20) / 250880 ≈20%



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/MapStatus-too-large-for-drvier-tp14704p14723.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: BUILD SYSTEM: builds are OOMing the jenkins workers, investigating. also need to reboot amp-jenkins-worker-06

2015-10-20 Thread shane knapp

well, it was -08, and ssh stopped working (according to the alerts)
just as i was logging in to kill off any errant processes.  i've taken
that worker offline in jenkins and will be rebooting it asap.

on a positive note, i was able to clear out -07 before anything
horrible happened to that one.

On Tue, Oct 20, 2015 at 3:46 PM, shane knapp  wrote:
> amp-jenkins-worker-06 is back up.
>
> my next bets are on -07 and -08...  :\
>
> https://amplab.cs.berkeley.edu/jenkins/computer/
>
> On Tue, Oct 20, 2015 at 3:39 PM, shane knapp  wrote:
>> here's the related stack trace from dmesg...  UID 500 is jenkins.
>>
>> Out of memory: Kill process 142764 (java) score 40 or sacrifice child
>> Killed process 142764, UID 500, (java) total-vm:24685036kB,
>> anon-rss:5730824kB, file-rss:64kB
>> Uhhuh. NMI received for unknown reason 21 on CPU 0.
>> Do you have a strange power saving mode enabled?
>> Dazed and confused, but trying to continue
>> java: page allocation failure. order:2, mode:0xd0
>> Pid: 142764, comm: java Not tainted 2.6.32-573.3.1.el6.x86_64 #1
>> Call Trace:
>>  [] ? __alloc_pages_nodemask+0x7dc/0x950
>>  [] ? copy_process+0x168/0x1530
>>  [] ? do_fork+0x96/0x4c0
>>  [] ? sys_futex+0x7b/0x170
>>  [] ? sys_clone+0x28/0x30
>>  [] ? stub_clone+0x13/0x20
>>  [] ? system_call_fastpath+0x16/0x1b
>>
>> On Tue, Oct 20, 2015 at 3:35 PM, shane knapp  wrote:
>>> -06 just kinda came back...
>>>
>>> [root@amp-jenkins-worker-06 ~]# uptime
>>>  15:29:07 up 26 days,  7:34,  2 users,  load average: 1137.91, 1485.69, 
>>> 1635.89
>>>
>>> the builds that, from looking at the process table, seem to be at
>>> fault are the Spark-Master-Maven-pre-yarn matrix builds, and possibly
>>> a Spark-Master-SBT matrix build.  look at the build history here:
>>> https://amplab.cs.berkeley.edu/jenkins/computer/amp-jenkins-worker-06/builds
>>>
>>> the load is dropping significantly and quickly, but swap is borked and
>>> i'm still going to reboot.
>>>
>>> On Tue, Oct 20, 2015 at 3:24 PM, shane knapp  wrote:
 starting this saturday (oct 17) we started getting alerts on the
 jenkins workers that various processes were dying (specifically ssh).

 since then, we've had half of our workers OOM due to java processes
 and have had now to reboot two of them (-05 and -06).

 if we look at the current machine that's wedged (amp-jenkins-worker-06), 
 we see:
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/3814/
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/HADOOP_VERSION=2.0.0-mr1-cdh4.1.2,label=spark-test/4508/
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/HADOOP_VERSION=1.2.1,label=spark-test/4508/
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/3868/
 https://amplab.cs.berkeley.edu/jenkins/job/Spark-Compile-Master-Maven-with-YARN/4510/

 have there been any changes to any of these builds that might have
 caused this?  anyone have any ideas?

 sadly, even though i saw that -06 was about to OOM and got a shell
 opened before SSH died, my command prompt is completely unresponsive.
 :(

 shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Set numExecutors by sparklaunch

2015-10-20 Thread qinggangwa...@gmail.com







Hi all, I want to launch spark job on yarn by java, but it seemes that there is 
no way to set numExecutors int the class SparkLauncher. Is there any way to set 
numExecutors ?Thanks


qinggangwa...@gmail.com

Fwd: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-20 Thread Reynold Xin

With Jerry's permission, sending this back to the dev list to close the
loop.


-- Forwarded message --
From: Jerry Lam 
Date: Tue, Oct 20, 2015 at 3:54 PM
Subject: Re: If you use Spark 1.5 and disabled Tungsten mode ...
To: Reynold Xin 


Yup, coarse grained mode works just fine. :)
The difference is that by default, coarse grained mode uses 1 core per
task. If I constraint 20 cores in total, there can be only 20 tasks running
at the same time. However, with fine grained, I cannot set the total number
of cores and therefore, it could be +200 tasks running at the same time (It
is dynamic). So it might be the calculation of how much memory to acquire
fail when the number of cores cannot be known ahead of time because you
cannot make the assumption that X tasks running in an executor? Just my
guess...


On Tue, Oct 20, 2015 at 6:24 PM, Reynold Xin  wrote:

> Can you try coarse-grained mode and see if it is the same?
>
>
> On Tue, Oct 20, 2015 at 3:20 PM, Jerry Lam  wrote:
>
>> Hi Reynold,
>>
>> Yes, I'm using 1.5.1. I see them quite often. Sometimes it recovers but
>> sometimes it does not. For one particular job, it failed all the time with
>> the acquire-memory issue. I'm using spark on mesos with fine grained mode.
>> Does it make a difference?
>>
>> Best Regards,
>>
>> Jerry
>>
>> On Tue, Oct 20, 2015 at 5:27 PM, Reynold Xin  wrote:
>>
>>> Jerry - I think that's been fixed in 1.5.1. Do you still see it?
>>>
>>> On Tue, Oct 20, 2015 at 2:11 PM, Jerry Lam  wrote:
>>>
 I disabled it because of the "Could not acquire 65536 bytes of memory".
 It happens to fail the job. So for now, I'm not touching it.

 On Tue, Oct 20, 2015 at 4:48 PM, charmee  wrote:

> We had disabled tungsten after we found few performance issues, but
> had to
> enable it back because we found that when we had large number of group
> by
> fields, if tungsten is disabled the shuffle keeps failing.
>
> Here is an excerpt from one of our engineers with his analysis.
>
> With Tungsten Enabled (default in spark 1.5):
> ~90 files of 0.5G each:
>
> Ingest (after applying broadcast lookups) : 54 min
> Aggregation (~30 fields in group by and another 40 in aggregation) :
> 18 min
>
> With Tungsten Disabled:
>
> Ingest : 30 min
> Aggregation : Erroring out
>
> On smaller tests we found that joins are slow with tungsten enabled.
> With
> GROUP BY, disabling tungsten is not working in the first place.
>
> Hope this helps.
>
> -Charmee
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/If-you-use-Spark-1-5-and-disabled-Tungsten-mode-tp14604p14711.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

>>>
>>
>

Re: BUILD SYSTEM: builds are OOMing the jenkins workers, investigating. also need to reboot amp-jenkins-worker-06

2015-10-20 Thread shane knapp

ok, based on the timing, i *think* this might be the culprit:

https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/3814/console

On Tue, Oct 20, 2015 at 3:35 PM, shane knapp  wrote:
> -06 just kinda came back...
>
> [root@amp-jenkins-worker-06 ~]# uptime
>  15:29:07 up 26 days,  7:34,  2 users,  load average: 1137.91, 1485.69, 
> 1635.89
>
> the builds that, from looking at the process table, seem to be at
> fault are the Spark-Master-Maven-pre-yarn matrix builds, and possibly
> a Spark-Master-SBT matrix build.  look at the build history here:
> https://amplab.cs.berkeley.edu/jenkins/computer/amp-jenkins-worker-06/builds
>
> the load is dropping significantly and quickly, but swap is borked and
> i'm still going to reboot.
>
> On Tue, Oct 20, 2015 at 3:24 PM, shane knapp  wrote:
>> starting this saturday (oct 17) we started getting alerts on the
>> jenkins workers that various processes were dying (specifically ssh).
>>
>> since then, we've had half of our workers OOM due to java processes
>> and have had now to reboot two of them (-05 and -06).
>>
>> if we look at the current machine that's wedged (amp-jenkins-worker-06), we 
>> see:
>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/3814/
>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/HADOOP_VERSION=2.0.0-mr1-cdh4.1.2,label=spark-test/4508/
>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/HADOOP_VERSION=1.2.1,label=spark-test/4508/
>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/3868/
>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Compile-Master-Maven-with-YARN/4510/
>>
>> have there been any changes to any of these builds that might have
>> caused this?  anyone have any ideas?
>>
>> sadly, even though i saw that -06 was about to OOM and got a shell
>> opened before SSH died, my command prompt is completely unresponsive.
>> :(
>>
>> shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: BUILD SYSTEM: builds are OOMing the jenkins workers, investigating. also need to reboot amp-jenkins-worker-06

2015-10-20 Thread shane knapp

amp-jenkins-worker-06 is back up.

my next bets are on -07 and -08...  :\

https://amplab.cs.berkeley.edu/jenkins/computer/

On Tue, Oct 20, 2015 at 3:39 PM, shane knapp  wrote:
> here's the related stack trace from dmesg...  UID 500 is jenkins.
>
> Out of memory: Kill process 142764 (java) score 40 or sacrifice child
> Killed process 142764, UID 500, (java) total-vm:24685036kB,
> anon-rss:5730824kB, file-rss:64kB
> Uhhuh. NMI received for unknown reason 21 on CPU 0.
> Do you have a strange power saving mode enabled?
> Dazed and confused, but trying to continue
> java: page allocation failure. order:2, mode:0xd0
> Pid: 142764, comm: java Not tainted 2.6.32-573.3.1.el6.x86_64 #1
> Call Trace:
>  [] ? __alloc_pages_nodemask+0x7dc/0x950
>  [] ? copy_process+0x168/0x1530
>  [] ? do_fork+0x96/0x4c0
>  [] ? sys_futex+0x7b/0x170
>  [] ? sys_clone+0x28/0x30
>  [] ? stub_clone+0x13/0x20
>  [] ? system_call_fastpath+0x16/0x1b
>
> On Tue, Oct 20, 2015 at 3:35 PM, shane knapp  wrote:
>> -06 just kinda came back...
>>
>> [root@amp-jenkins-worker-06 ~]# uptime
>>  15:29:07 up 26 days,  7:34,  2 users,  load average: 1137.91, 1485.69, 
>> 1635.89
>>
>> the builds that, from looking at the process table, seem to be at
>> fault are the Spark-Master-Maven-pre-yarn matrix builds, and possibly
>> a Spark-Master-SBT matrix build.  look at the build history here:
>> https://amplab.cs.berkeley.edu/jenkins/computer/amp-jenkins-worker-06/builds
>>
>> the load is dropping significantly and quickly, but swap is borked and
>> i'm still going to reboot.
>>
>> On Tue, Oct 20, 2015 at 3:24 PM, shane knapp  wrote:
>>> starting this saturday (oct 17) we started getting alerts on the
>>> jenkins workers that various processes were dying (specifically ssh).
>>>
>>> since then, we've had half of our workers OOM due to java processes
>>> and have had now to reboot two of them (-05 and -06).
>>>
>>> if we look at the current machine that's wedged (amp-jenkins-worker-06), we 
>>> see:
>>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/3814/
>>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/HADOOP_VERSION=2.0.0-mr1-cdh4.1.2,label=spark-test/4508/
>>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/HADOOP_VERSION=1.2.1,label=spark-test/4508/
>>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/3868/
>>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Compile-Master-Maven-with-YARN/4510/
>>>
>>> have there been any changes to any of these builds that might have
>>> caused this?  anyone have any ideas?
>>>
>>> sadly, even though i saw that -06 was about to OOM and got a shell
>>> opened before SSH died, my command prompt is completely unresponsive.
>>> :(
>>>
>>> shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: BUILD SYSTEM: builds are OOMing the jenkins workers, investigating. also need to reboot amp-jenkins-worker-06

2015-10-20 Thread shane knapp

here's the related stack trace from dmesg...  UID 500 is jenkins.

Out of memory: Kill process 142764 (java) score 40 or sacrifice child
Killed process 142764, UID 500, (java) total-vm:24685036kB,
anon-rss:5730824kB, file-rss:64kB
Uhhuh. NMI received for unknown reason 21 on CPU 0.
Do you have a strange power saving mode enabled?
Dazed and confused, but trying to continue
java: page allocation failure. order:2, mode:0xd0
Pid: 142764, comm: java Not tainted 2.6.32-573.3.1.el6.x86_64 #1
Call Trace:
 [] ? __alloc_pages_nodemask+0x7dc/0x950
 [] ? copy_process+0x168/0x1530
 [] ? do_fork+0x96/0x4c0
 [] ? sys_futex+0x7b/0x170
 [] ? sys_clone+0x28/0x30
 [] ? stub_clone+0x13/0x20
 [] ? system_call_fastpath+0x16/0x1b

On Tue, Oct 20, 2015 at 3:35 PM, shane knapp  wrote:
> -06 just kinda came back...
>
> [root@amp-jenkins-worker-06 ~]# uptime
>  15:29:07 up 26 days,  7:34,  2 users,  load average: 1137.91, 1485.69, 
> 1635.89
>
> the builds that, from looking at the process table, seem to be at
> fault are the Spark-Master-Maven-pre-yarn matrix builds, and possibly
> a Spark-Master-SBT matrix build.  look at the build history here:
> https://amplab.cs.berkeley.edu/jenkins/computer/amp-jenkins-worker-06/builds
>
> the load is dropping significantly and quickly, but swap is borked and
> i'm still going to reboot.
>
> On Tue, Oct 20, 2015 at 3:24 PM, shane knapp  wrote:
>> starting this saturday (oct 17) we started getting alerts on the
>> jenkins workers that various processes were dying (specifically ssh).
>>
>> since then, we've had half of our workers OOM due to java processes
>> and have had now to reboot two of them (-05 and -06).
>>
>> if we look at the current machine that's wedged (amp-jenkins-worker-06), we 
>> see:
>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/3814/
>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/HADOOP_VERSION=2.0.0-mr1-cdh4.1.2,label=spark-test/4508/
>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/HADOOP_VERSION=1.2.1,label=spark-test/4508/
>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/3868/
>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Compile-Master-Maven-with-YARN/4510/
>>
>> have there been any changes to any of these builds that might have
>> caused this?  anyone have any ideas?
>>
>> sadly, even though i saw that -06 was about to OOM and got a shell
>> opened before SSH died, my command prompt is completely unresponsive.
>> :(
>>
>> shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: BUILD SYSTEM: builds are OOMing the jenkins workers, investigating. also need to reboot amp-jenkins-worker-06

2015-10-20 Thread shane knapp

-06 just kinda came back...

[root@amp-jenkins-worker-06 ~]# uptime
 15:29:07 up 26 days,  7:34,  2 users,  load average: 1137.91, 1485.69, 1635.89

the builds that, from looking at the process table, seem to be at
fault are the Spark-Master-Maven-pre-yarn matrix builds, and possibly
a Spark-Master-SBT matrix build.  look at the build history here:
https://amplab.cs.berkeley.edu/jenkins/computer/amp-jenkins-worker-06/builds

the load is dropping significantly and quickly, but swap is borked and
i'm still going to reboot.

On Tue, Oct 20, 2015 at 3:24 PM, shane knapp  wrote:
> starting this saturday (oct 17) we started getting alerts on the
> jenkins workers that various processes were dying (specifically ssh).
>
> since then, we've had half of our workers OOM due to java processes
> and have had now to reboot two of them (-05 and -06).
>
> if we look at the current machine that's wedged (amp-jenkins-worker-06), we 
> see:
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/3814/
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/HADOOP_VERSION=2.0.0-mr1-cdh4.1.2,label=spark-test/4508/
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/HADOOP_VERSION=1.2.1,label=spark-test/4508/
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/3868/
> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Compile-Master-Maven-with-YARN/4510/
>
> have there been any changes to any of these builds that might have
> caused this?  anyone have any ideas?
>
> sadly, even though i saw that -06 was about to OOM and got a shell
> opened before SSH died, my command prompt is completely unresponsive.
> :(
>
> shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

BUILD SYSTEM: builds are OOMing the jenkins workers, investigating. also need to reboot amp-jenkins-worker-06

2015-10-20 Thread shane knapp

starting this saturday (oct 17) we started getting alerts on the
jenkins workers that various processes were dying (specifically ssh).

since then, we've had half of our workers OOM due to java processes
and have had now to reboot two of them (-05 and -06).

if we look at the current machine that's wedged (amp-jenkins-worker-06), we see:
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/3814/
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/HADOOP_VERSION=2.0.0-mr1-cdh4.1.2,label=spark-test/4508/
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/HADOOP_VERSION=1.2.1,label=spark-test/4508/
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/3868/
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Compile-Master-Maven-with-YARN/4510/

have there been any changes to any of these builds that might have
caused this?  anyone have any ideas?

sadly, even though i saw that -06 was about to OOM and got a shell
opened before SSH died, my command prompt is completely unresponsive.
:(

shane

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-20 Thread Jerry Lam

Hi Reynold,

Yes, I'm using 1.5.1. I see them quite often. Sometimes it recovers but
sometimes it does not. For one particular job, it failed all the time with
the acquire-memory issue. I'm using spark on mesos with fine grained mode.
Does it make a difference?

Best Regards,

Jerry

On Tue, Oct 20, 2015 at 5:27 PM, Reynold Xin  wrote:

> Jerry - I think that's been fixed in 1.5.1. Do you still see it?
>
> On Tue, Oct 20, 2015 at 2:11 PM, Jerry Lam  wrote:
>
>> I disabled it because of the "Could not acquire 65536 bytes of memory".
>> It happens to fail the job. So for now, I'm not touching it.
>>
>> On Tue, Oct 20, 2015 at 4:48 PM, charmee  wrote:
>>
>>> We had disabled tungsten after we found few performance issues, but had
>>> to
>>> enable it back because we found that when we had large number of group by
>>> fields, if tungsten is disabled the shuffle keeps failing.
>>>
>>> Here is an excerpt from one of our engineers with his analysis.
>>>
>>> With Tungsten Enabled (default in spark 1.5):
>>> ~90 files of 0.5G each:
>>>
>>> Ingest (after applying broadcast lookups) : 54 min
>>> Aggregation (~30 fields in group by and another 40 in aggregation) : 18
>>> min
>>>
>>> With Tungsten Disabled:
>>>
>>> Ingest : 30 min
>>> Aggregation : Erroring out
>>>
>>> On smaller tests we found that joins are slow with tungsten enabled. With
>>> GROUP BY, disabling tungsten is not working in the first place.
>>>
>>> Hope this helps.
>>>
>>> -Charmee
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-developers-list.1001551.n3.nabble.com/If-you-use-Spark-1-5-and-disabled-Tungsten-mode-tp14604p14711.html
>>> Sent from the Apache Spark Developers List mailing list archive at
>>> Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>
>

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-20 Thread Reynold Xin

Jerry - I think that's been fixed in 1.5.1. Do you still see it?

On Tue, Oct 20, 2015 at 2:11 PM, Jerry Lam  wrote:

> I disabled it because of the "Could not acquire 65536 bytes of memory". It
> happens to fail the job. So for now, I'm not touching it.
>
> On Tue, Oct 20, 2015 at 4:48 PM, charmee  wrote:
>
>> We had disabled tungsten after we found few performance issues, but had to
>> enable it back because we found that when we had large number of group by
>> fields, if tungsten is disabled the shuffle keeps failing.
>>
>> Here is an excerpt from one of our engineers with his analysis.
>>
>> With Tungsten Enabled (default in spark 1.5):
>> ~90 files of 0.5G each:
>>
>> Ingest (after applying broadcast lookups) : 54 min
>> Aggregation (~30 fields in group by and another 40 in aggregation) : 18
>> min
>>
>> With Tungsten Disabled:
>>
>> Ingest : 30 min
>> Aggregation : Erroring out
>>
>> On smaller tests we found that joins are slow with tungsten enabled. With
>> GROUP BY, disabling tungsten is not working in the first place.
>>
>> Hope this helps.
>>
>> -Charmee
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-developers-list.1001551.n3.nabble.com/If-you-use-Spark-1-5-and-disabled-Tungsten-mode-tp14604p14711.html
>> Sent from the Apache Spark Developers List mailing list archive at
>> Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-20 Thread Jerry Lam

I disabled it because of the "Could not acquire 65536 bytes of memory". It
happens to fail the job. So for now, I'm not touching it.

On Tue, Oct 20, 2015 at 4:48 PM, charmee  wrote:

> We had disabled tungsten after we found few performance issues, but had to
> enable it back because we found that when we had large number of group by
> fields, if tungsten is disabled the shuffle keeps failing.
>
> Here is an excerpt from one of our engineers with his analysis.
>
> With Tungsten Enabled (default in spark 1.5):
> ~90 files of 0.5G each:
>
> Ingest (after applying broadcast lookups) : 54 min
> Aggregation (~30 fields in group by and another 40 in aggregation) : 18 min
>
> With Tungsten Disabled:
>
> Ingest : 30 min
> Aggregation : Erroring out
>
> On smaller tests we found that joins are slow with tungsten enabled. With
> GROUP BY, disabling tungsten is not working in the first place.
>
> Hope this helps.
>
> -Charmee
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/If-you-use-Spark-1-5-and-disabled-Tungsten-mode-tp14604p14711.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: If you use Spark 1.5 and disabled Tungsten mode ...

2015-10-20 Thread charmee

We had disabled tungsten after we found few performance issues, but had to
enable it back because we found that when we had large number of group by
fields, if tungsten is disabled the shuffle keeps failing. 

Here is an excerpt from one of our engineers with his analysis. 

With Tungsten Enabled (default in spark 1.5): 
~90 files of 0.5G each: 

Ingest (after applying broadcast lookups) : 54 min 
Aggregation (~30 fields in group by and another 40 in aggregation) : 18 min 

With Tungsten Disabled: 

Ingest : 30 min 
Aggregation : Erroring out 

On smaller tests we found that joins are slow with tungsten enabled. With
GROUP BY, disabling tungsten is not working in the first place. 

Hope this helps. 

-Charmee



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/If-you-use-Spark-1-5-and-disabled-Tungsten-mode-tp14604p14711.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: Spark driver reducing total executors count even when Dynamic Allocation is disabled.

2015-10-20 Thread Saisai Shao

Hi Prakhar,

I start to know your problem, you expected that the killed exexcutor by
heartbeat mechanism should be launched again but seems not. This problem I
think is fixed in the version 1.5 of Spark, you could check this jira
https://issues.apache.org/jira/browse/SPARK-8119

Thanks
Saisai

2015年10月20日星期二，prakhar jauhari  写道：

> Thanks sai for the input,
>
> So the problem is : i start my job with some fixed number of executors,
> but when a host running my executors goes unreachable, driver reduces the
> total number of executors. And never increases it.
>
> I have a repro for the issue, attaching logs:
>  Running spark job is configured for 2 executors, dynamic allocation
> not enabled !!!
>
> AM starts requesting the 2 executors:
> 15/10/19 12:25:58 INFO yarn.YarnRMClient: Registering the ApplicationMaster
> 15/10/19 12:25:59 INFO yarn.YarnAllocator: Will request 2 executor
> containers, each with 1 cores and 1408 MB memory including 384 MB overhead
> 15/10/19 12:25:59 INFO yarn.YarnAllocator: Container request (host: Any,
> capability: )
> 15/10/19 12:25:59 INFO yarn.YarnAllocator: Container request (host: Any,
> capability: )
> 15/10/19 12:25:59 INFO yarn.ApplicationMaster: Started progress reporter
> thread - sleep time : 5000
>
> Executors launched:
> 15/10/19 12:26:04 INFO impl.AMRMClientImpl: Received new token for :
> DN-2:58739
> 15/10/19 12:26:04 INFO impl.AMRMClientImpl: Received new token for :
> DN-1:44591
> 15/10/19 12:26:04 INFO yarn.YarnAllocator: Launching container
> container_1444841612643_0014_01_02 for on host DN-2
> 15/10/19 12:26:04 INFO yarn.YarnAllocator: Launching ExecutorRunnable.
> driverUrl: akka.tcp://sparkDriver@NN-1:35115/user/CoarseGrainedScheduler,
> executorHostname: DN-2
> 15/10/19 12:26:04 INFO yarn.YarnAllocator: Launching container
> container_1444841612643_0014_01_03 for on host DN-1
> 15/10/19 12:26:04 INFO yarn.YarnAllocator: Launching ExecutorRunnable.
> driverUrl: akka.tcp://sparkDriver@NN-1:35115/user/CoarseGrainedScheduler,
> executorHostname: DN-1
>
> Now my AM and executor 1 are running on DN-2, DN-1 has executor 2 running
> on it. To reproduce this issue I removed IP from DN-1, until it was timed
> out by spark.
> 15/10/19 13:03:30 INFO yarn.YarnAllocator: Driver requested a total number
> of 1 executor(s).
> 15/10/19 13:03:30 INFO yarn.ApplicationMaster: Driver requested to kill
> executor(s) 2.
>
>
> So the driver has reduced the total number of executor to : 1
> And now even when the DN comes up and rejoins the cluster, this count is
> not increased.
> If I had executor 1 running on a separate DN (not the same as AM's DN),
> and that DN went unreachable, driver would reduce total number of executor
> to : 0 and the job hangs forever. And this is when i have not enabled
> Dynamic allocation. My cluster has other DN's available, AM should request
> the killed executors from yarn, and get it on some other DN's.
>
> Regards,
> Prakhar
>
>
> On Mon, Oct 19, 2015 at 2:47 PM, Saisai Shao  > wrote:
>
>> This is a deliberate killing request by heartbeat mechanism, have nothing
>> to do with dynamic allocation. Here because you're running on yarn mode, so
>> "supportDynamicAllocation" will be true, but actually there's no
>> relation to dynamic allocation.
>>
>> From my understanding "doRequestTotalExecutors" is to sync the current
>> total executor number with AM, AM will try to cancel some pending container
>> requests when current expected executor number is less. The actual
>> container killing command is issued by "doRequestTotalExecutors".
>>
>> Not sure what is your actual problem? is it unexpected?
>>
>> Thanks
>> Saisai
>>
>>
>> On Mon, Oct 19, 2015 at 3:51 PM, prakhar jauhari > > wrote:
>>
>>> Hey all,
>>>
>>> Thanks in advance. I ran into a situation where spark driver reduced the
>>> total executors count for my job even with dynamic allocation disabled,
>>> and
>>> caused the job to hang for ever.
>>>
>>> Setup:
>>> Spark-1.3.1 on hadoop-yarn-2.4.0 cluster.
>>> All servers in cluster running Linux version 2.6.32.
>>> Job in yarn-client mode.
>>>
>>> Scenario:
>>> 1. Application running with required number of executors.
>>> 2. One of the DN's losses connectivity and is timed out.
>>> 2. Spark issues a killExecutor for the executor on the DN which was timed
>>> out.
>>> 3. Even with dynamic allocation off, spark's driver reduces the
>>> "targetNumExecutors".
>>>
>>> On analysing the code (Spark 1.3.1):
>>>
>>> When my DN goes unreachable:
>>>
>>> Spark core's HeartbeatReceiver invokes expireDeadHosts(): which checks if
>>> Dynamic Allocation is supported and then invokes "sc.killExecutor()"
>>>
>>> /if (sc.supportDynamicAllocation) {
>>> sc.killExecutor(executorId)
>>> }/
>>>
>>> Surprisingly supportDynamicAllocation in sparkContext.scala is defined
>>> as,
>>> resulting "True" if dynamicAllocationTesting flag is enabled or spark is
>>> running over "yarn".
>>>
>>> /private[spark] def supportDynamicAl

Re: Spark driver reducing total executors count even when Dynamic Allocation is disabled.

2015-10-20 Thread prakhar jauhari

Thanks sai for the input,

So the problem is : i start my job with some fixed number of executors, but
when a host running my executors goes unreachable, driver reduces the total
number of executors. And never increases it.

I have a repro for the issue, attaching logs:
 Running spark job is configured for 2 executors, dynamic allocation
not enabled !!!

AM starts requesting the 2 executors:
15/10/19 12:25:58 INFO yarn.YarnRMClient: Registering the ApplicationMaster
15/10/19 12:25:59 INFO yarn.YarnAllocator: Will request 2 executor
containers, each with 1 cores and 1408 MB memory including 384 MB overhead
15/10/19 12:25:59 INFO yarn.YarnAllocator: Container request (host: Any,
capability: )
15/10/19 12:25:59 INFO yarn.YarnAllocator: Container request (host: Any,
capability: )
15/10/19 12:25:59 INFO yarn.ApplicationMaster: Started progress reporter
thread - sleep time : 5000

Executors launched:
15/10/19 12:26:04 INFO impl.AMRMClientImpl: Received new token for :
DN-2:58739
15/10/19 12:26:04 INFO impl.AMRMClientImpl: Received new token for :
DN-1:44591
15/10/19 12:26:04 INFO yarn.YarnAllocator: Launching container
container_1444841612643_0014_01_02 for on host DN-2
15/10/19 12:26:04 INFO yarn.YarnAllocator: Launching ExecutorRunnable.
driverUrl: akka.tcp://sparkDriver@NN-1:35115/user/CoarseGrainedScheduler,
executorHostname: DN-2
15/10/19 12:26:04 INFO yarn.YarnAllocator: Launching container
container_1444841612643_0014_01_03 for on host DN-1
15/10/19 12:26:04 INFO yarn.YarnAllocator: Launching ExecutorRunnable.
driverUrl: akka.tcp://sparkDriver@NN-1:35115/user/CoarseGrainedScheduler,
executorHostname: DN-1

Now my AM and executor 1 are running on DN-2, DN-1 has executor 2 running
on it. To reproduce this issue I removed IP from DN-1, until it was timed
out by spark.
15/10/19 13:03:30 INFO yarn.YarnAllocator: Driver requested a total number
of 1 executor(s).
15/10/19 13:03:30 INFO yarn.ApplicationMaster: Driver requested to kill
executor(s) 2.


So the driver has reduced the total number of executor to : 1
And now even when the DN comes up and rejoins the cluster, this count is
not increased.
If I had executor 1 running on a separate DN (not the same as AM's DN), and
that DN went unreachable, driver would reduce total number of executor to :
0 and the job hangs forever. And this is when i have not enabled Dynamic
allocation. My cluster has other DN's available, AM should request the
killed executors from yarn, and get it on some other DN's.

Regards,
Prakhar


On Mon, Oct 19, 2015 at 2:47 PM, Saisai Shao  wrote:

> This is a deliberate killing request by heartbeat mechanism, have nothing
> to do with dynamic allocation. Here because you're running on yarn mode, so
> "supportDynamicAllocation" will be true, but actually there's no relation
> to dynamic allocation.
>
> From my understanding "doRequestTotalExecutors" is to sync the current
> total executor number with AM, AM will try to cancel some pending container
> requests when current expected executor number is less. The actual
> container killing command is issued by "doRequestTotalExecutors".
>
> Not sure what is your actual problem? is it unexpected?
>
> Thanks
> Saisai
>
>
> On Mon, Oct 19, 2015 at 3:51 PM, prakhar jauhari 
> wrote:
>
>> Hey all,
>>
>> Thanks in advance. I ran into a situation where spark driver reduced the
>> total executors count for my job even with dynamic allocation disabled,
>> and
>> caused the job to hang for ever.
>>
>> Setup:
>> Spark-1.3.1 on hadoop-yarn-2.4.0 cluster.
>> All servers in cluster running Linux version 2.6.32.
>> Job in yarn-client mode.
>>
>> Scenario:
>> 1. Application running with required number of executors.
>> 2. One of the DN's losses connectivity and is timed out.
>> 2. Spark issues a killExecutor for the executor on the DN which was timed
>> out.
>> 3. Even with dynamic allocation off, spark's driver reduces the
>> "targetNumExecutors".
>>
>> On analysing the code (Spark 1.3.1):
>>
>> When my DN goes unreachable:
>>
>> Spark core's HeartbeatReceiver invokes expireDeadHosts(): which checks if
>> Dynamic Allocation is supported and then invokes "sc.killExecutor()"
>>
>> /if (sc.supportDynamicAllocation) {
>> sc.killExecutor(executorId)
>> }/
>>
>> Surprisingly supportDynamicAllocation in sparkContext.scala is defined as,
>> resulting "True" if dynamicAllocationTesting flag is enabled or spark is
>> running over "yarn".
>>
>> /private[spark] def supportDynamicAllocation =
>> master.contains("yarn") || dynamicAllocationTesting /
>>
>> "sc.killExecutor()" matches it to configured "schedulerBackend"
>> (CoarseGrainedSchedulerBackend in this case) and invokes
>> "killExecutors(executorIds)"
>>
>> CoarseGrainedSchedulerBackend calculates a "newTotal" for the total number
>> of executors required, and sends a update to application master by
>> invoking
>> "doRequestTotalExecutors(newTotal)"
>>
>> CoarseGrainedSchedulerBackend then invokes a
>

Re: MapStatus too large for drvier

2015-10-20 Thread yaoqin

In our case, we are dealing with 20TB text data which is separated to about
200k map tasks and 200k reduce tasks, and our driver's memory is 15G,.



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/MapStatus-too-large-for-drvier-tp14704p14707.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Re: MapStatus too large for drvier

2015-10-20 Thread Reynold Xin

How big is your driver heap size? And any reason why you'd need 200k map
and 200k reduce tasks?


On Mon, Oct 19, 2015 at 11:59 PM, yaoqin  wrote:

> Hi everyone,
>
> When I run a spark job contains quite a lot of tasks(in my case is
> 200,000*200,000), the driver occured OOM mainly caused by the object
> MapStatus,
>
> As is shown in the pic bellow, RoaringBitmap that used to mark which block
> is empty seems to use too many memories.
>
> Are there any data structue can replace RoaringBitmap to fix my
> problem?
>
>
>
> Thank you!
>
> Qin.
>
>

Ability to offer initial coefficients in ml.LogisticRegression

2015-10-20 Thread YiZhi Liu

Hi all,

I noticed that in ml.classification.LogisticRegression, users are not
allowed to set initial coefficients, while it is supported in
mllib.classification.LogisticRegressionWithSGD.

Sometimes we know specific coefficients are close to the final optima.
e.g., we usually pick yesterday's output model as init coefficients
since the data distribution between two days' training sample
shouldn't change much.

Is there any concern for not supporting this feature?

-- 
Yizhi Liu
Senior Software Engineer / Data Mining
www.mvad.com, Shanghai, China

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

MapStatus too large for drvier

2015-10-20 Thread yaoqin

Hi everyone,
When I run a spark job contains quite a lot of tasks(in my case is 
200,000*200,000), the driver occured OOM mainly caused by the object MapStatus,
As is shown in the pic bellow, RoaringBitmap that used to mark which block is 
empty seems to use too many memories.
Are there any data structue can replace RoaringBitmap to fix my problem?

Thank you!
Qin.
[cid:image001.png@01D10B46.7C248740]

Re: MapStatus too large for drvier

Re: BUILD SYSTEM: builds are OOMing the jenkins workers, investigating. also need to reboot amp-jenkins-worker-06

Set numExecutors by sparklaunch

Fwd: If you use Spark 1.5 and disabled Tungsten mode ...

Re: BUILD SYSTEM: builds are OOMing the jenkins workers, investigating. also need to reboot amp-jenkins-worker-06

Re: BUILD SYSTEM: builds are OOMing the jenkins workers, investigating. also need to reboot amp-jenkins-worker-06

Re: BUILD SYSTEM: builds are OOMing the jenkins workers, investigating. also need to reboot amp-jenkins-worker-06

Re: BUILD SYSTEM: builds are OOMing the jenkins workers, investigating. also need to reboot amp-jenkins-worker-06

BUILD SYSTEM: builds are OOMing the jenkins workers, investigating. also need to reboot amp-jenkins-worker-06

Re: If you use Spark 1.5 and disabled Tungsten mode ...

Re: If you use Spark 1.5 and disabled Tungsten mode ...

Re: If you use Spark 1.5 and disabled Tungsten mode ...

Re: If you use Spark 1.5 and disabled Tungsten mode ...

Re: Spark driver reducing total executors count even when Dynamic Allocation is disabled.

Re: Spark driver reducing total executors count even when Dynamic Allocation is disabled.

Re: MapStatus too large for drvier

Re: MapStatus too large for drvier

Ability to offer initial coefficients in ml.LogisticRegression

MapStatus too large for drvier

19 matches

Site Navigation

Mail list logo

Footer information