Re: MapStatus too large for drvier
I try to use org.apache.spark.util.collection.BitSet instead of RoaringBitMap, and it can save about 20% memories but runs much slower. For the 200K tasks job, RoaringBitMap uses 3 Long[1024] and 1 Short[3392] =3*64*1024+16*3392=250880(bit) BitSet uses 1 Long[3125] = 3125*64=20(bit) Memory saved = (250880-20) / 250880 ≈20% -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MapStatus-too-large-for-drvier-tp14704p14723.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: BUILD SYSTEM: builds are OOMing the jenkins workers, investigating. also need to reboot amp-jenkins-worker-06
well, it was -08, and ssh stopped working (according to the alerts) just as i was logging in to kill off any errant processes. i've taken that worker offline in jenkins and will be rebooting it asap. on a positive note, i was able to clear out -07 before anything horrible happened to that one. On Tue, Oct 20, 2015 at 3:46 PM, shane knapp wrote: > amp-jenkins-worker-06 is back up. > > my next bets are on -07 and -08... :\ > > https://amplab.cs.berkeley.edu/jenkins/computer/ > > On Tue, Oct 20, 2015 at 3:39 PM, shane knapp wrote: >> here's the related stack trace from dmesg... UID 500 is jenkins. >> >> Out of memory: Kill process 142764 (java) score 40 or sacrifice child >> Killed process 142764, UID 500, (java) total-vm:24685036kB, >> anon-rss:5730824kB, file-rss:64kB >> Uhhuh. NMI received for unknown reason 21 on CPU 0. >> Do you have a strange power saving mode enabled? >> Dazed and confused, but trying to continue >> java: page allocation failure. order:2, mode:0xd0 >> Pid: 142764, comm: java Not tainted 2.6.32-573.3.1.el6.x86_64 #1 >> Call Trace: >> [] ? __alloc_pages_nodemask+0x7dc/0x950 >> [] ? copy_process+0x168/0x1530 >> [] ? do_fork+0x96/0x4c0 >> [] ? sys_futex+0x7b/0x170 >> [] ? sys_clone+0x28/0x30 >> [] ? stub_clone+0x13/0x20 >> [] ? system_call_fastpath+0x16/0x1b >> >> On Tue, Oct 20, 2015 at 3:35 PM, shane knapp wrote: >>> -06 just kinda came back... >>> >>> [root@amp-jenkins-worker-06 ~]# uptime >>> 15:29:07 up 26 days, 7:34, 2 users, load average: 1137.91, 1485.69, >>> 1635.89 >>> >>> the builds that, from looking at the process table, seem to be at >>> fault are the Spark-Master-Maven-pre-yarn matrix builds, and possibly >>> a Spark-Master-SBT matrix build. look at the build history here: >>> https://amplab.cs.berkeley.edu/jenkins/computer/amp-jenkins-worker-06/builds >>> >>> the load is dropping significantly and quickly, but swap is borked and >>> i'm still going to reboot. >>> >>> On Tue, Oct 20, 2015 at 3:24 PM, shane knapp wrote: starting this saturday (oct 17) we started getting alerts on the jenkins workers that various processes were dying (specifically ssh). since then, we've had half of our workers OOM due to java processes and have had now to reboot two of them (-05 and -06). if we look at the current machine that's wedged (amp-jenkins-worker-06), we see: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/3814/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/HADOOP_VERSION=2.0.0-mr1-cdh4.1.2,label=spark-test/4508/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/HADOOP_VERSION=1.2.1,label=spark-test/4508/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/3868/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-Compile-Master-Maven-with-YARN/4510/ have there been any changes to any of these builds that might have caused this? anyone have any ideas? sadly, even though i saw that -06 was about to OOM and got a shell opened before SSH died, my command prompt is completely unresponsive. :( shane - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Set numExecutors by sparklaunch
Hi all, I want to launch spark job on yarn by java, but it seemes that there is no way to set numExecutors int the class SparkLauncher. Is there any way to set numExecutors ?Thanks qinggangwa...@gmail.com
Fwd: If you use Spark 1.5 and disabled Tungsten mode ...
With Jerry's permission, sending this back to the dev list to close the loop. -- Forwarded message -- From: Jerry Lam Date: Tue, Oct 20, 2015 at 3:54 PM Subject: Re: If you use Spark 1.5 and disabled Tungsten mode ... To: Reynold Xin Yup, coarse grained mode works just fine. :) The difference is that by default, coarse grained mode uses 1 core per task. If I constraint 20 cores in total, there can be only 20 tasks running at the same time. However, with fine grained, I cannot set the total number of cores and therefore, it could be +200 tasks running at the same time (It is dynamic). So it might be the calculation of how much memory to acquire fail when the number of cores cannot be known ahead of time because you cannot make the assumption that X tasks running in an executor? Just my guess... On Tue, Oct 20, 2015 at 6:24 PM, Reynold Xin wrote: > Can you try coarse-grained mode and see if it is the same? > > > On Tue, Oct 20, 2015 at 3:20 PM, Jerry Lam wrote: > >> Hi Reynold, >> >> Yes, I'm using 1.5.1. I see them quite often. Sometimes it recovers but >> sometimes it does not. For one particular job, it failed all the time with >> the acquire-memory issue. I'm using spark on mesos with fine grained mode. >> Does it make a difference? >> >> Best Regards, >> >> Jerry >> >> On Tue, Oct 20, 2015 at 5:27 PM, Reynold Xin wrote: >> >>> Jerry - I think that's been fixed in 1.5.1. Do you still see it? >>> >>> On Tue, Oct 20, 2015 at 2:11 PM, Jerry Lam wrote: >>> I disabled it because of the "Could not acquire 65536 bytes of memory". It happens to fail the job. So for now, I'm not touching it. On Tue, Oct 20, 2015 at 4:48 PM, charmee wrote: > We had disabled tungsten after we found few performance issues, but > had to > enable it back because we found that when we had large number of group > by > fields, if tungsten is disabled the shuffle keeps failing. > > Here is an excerpt from one of our engineers with his analysis. > > With Tungsten Enabled (default in spark 1.5): > ~90 files of 0.5G each: > > Ingest (after applying broadcast lookups) : 54 min > Aggregation (~30 fields in group by and another 40 in aggregation) : > 18 min > > With Tungsten Disabled: > > Ingest : 30 min > Aggregation : Erroring out > > On smaller tests we found that joins are slow with tungsten enabled. > With > GROUP BY, disabling tungsten is not working in the first place. > > Hope this helps. > > -Charmee > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/If-you-use-Spark-1-5-and-disabled-Tungsten-mode-tp14604p14711.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > > >>> >> >
Re: BUILD SYSTEM: builds are OOMing the jenkins workers, investigating. also need to reboot amp-jenkins-worker-06
ok, based on the timing, i *think* this might be the culprit: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/3814/console On Tue, Oct 20, 2015 at 3:35 PM, shane knapp wrote: > -06 just kinda came back... > > [root@amp-jenkins-worker-06 ~]# uptime > 15:29:07 up 26 days, 7:34, 2 users, load average: 1137.91, 1485.69, > 1635.89 > > the builds that, from looking at the process table, seem to be at > fault are the Spark-Master-Maven-pre-yarn matrix builds, and possibly > a Spark-Master-SBT matrix build. look at the build history here: > https://amplab.cs.berkeley.edu/jenkins/computer/amp-jenkins-worker-06/builds > > the load is dropping significantly and quickly, but swap is borked and > i'm still going to reboot. > > On Tue, Oct 20, 2015 at 3:24 PM, shane knapp wrote: >> starting this saturday (oct 17) we started getting alerts on the >> jenkins workers that various processes were dying (specifically ssh). >> >> since then, we've had half of our workers OOM due to java processes >> and have had now to reboot two of them (-05 and -06). >> >> if we look at the current machine that's wedged (amp-jenkins-worker-06), we >> see: >> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/3814/ >> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/HADOOP_VERSION=2.0.0-mr1-cdh4.1.2,label=spark-test/4508/ >> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/HADOOP_VERSION=1.2.1,label=spark-test/4508/ >> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/3868/ >> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Compile-Master-Maven-with-YARN/4510/ >> >> have there been any changes to any of these builds that might have >> caused this? anyone have any ideas? >> >> sadly, even though i saw that -06 was about to OOM and got a shell >> opened before SSH died, my command prompt is completely unresponsive. >> :( >> >> shane - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: BUILD SYSTEM: builds are OOMing the jenkins workers, investigating. also need to reboot amp-jenkins-worker-06
amp-jenkins-worker-06 is back up. my next bets are on -07 and -08... :\ https://amplab.cs.berkeley.edu/jenkins/computer/ On Tue, Oct 20, 2015 at 3:39 PM, shane knapp wrote: > here's the related stack trace from dmesg... UID 500 is jenkins. > > Out of memory: Kill process 142764 (java) score 40 or sacrifice child > Killed process 142764, UID 500, (java) total-vm:24685036kB, > anon-rss:5730824kB, file-rss:64kB > Uhhuh. NMI received for unknown reason 21 on CPU 0. > Do you have a strange power saving mode enabled? > Dazed and confused, but trying to continue > java: page allocation failure. order:2, mode:0xd0 > Pid: 142764, comm: java Not tainted 2.6.32-573.3.1.el6.x86_64 #1 > Call Trace: > [] ? __alloc_pages_nodemask+0x7dc/0x950 > [] ? copy_process+0x168/0x1530 > [] ? do_fork+0x96/0x4c0 > [] ? sys_futex+0x7b/0x170 > [] ? sys_clone+0x28/0x30 > [] ? stub_clone+0x13/0x20 > [] ? system_call_fastpath+0x16/0x1b > > On Tue, Oct 20, 2015 at 3:35 PM, shane knapp wrote: >> -06 just kinda came back... >> >> [root@amp-jenkins-worker-06 ~]# uptime >> 15:29:07 up 26 days, 7:34, 2 users, load average: 1137.91, 1485.69, >> 1635.89 >> >> the builds that, from looking at the process table, seem to be at >> fault are the Spark-Master-Maven-pre-yarn matrix builds, and possibly >> a Spark-Master-SBT matrix build. look at the build history here: >> https://amplab.cs.berkeley.edu/jenkins/computer/amp-jenkins-worker-06/builds >> >> the load is dropping significantly and quickly, but swap is borked and >> i'm still going to reboot. >> >> On Tue, Oct 20, 2015 at 3:24 PM, shane knapp wrote: >>> starting this saturday (oct 17) we started getting alerts on the >>> jenkins workers that various processes were dying (specifically ssh). >>> >>> since then, we've had half of our workers OOM due to java processes >>> and have had now to reboot two of them (-05 and -06). >>> >>> if we look at the current machine that's wedged (amp-jenkins-worker-06), we >>> see: >>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/3814/ >>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/HADOOP_VERSION=2.0.0-mr1-cdh4.1.2,label=spark-test/4508/ >>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/HADOOP_VERSION=1.2.1,label=spark-test/4508/ >>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/3868/ >>> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Compile-Master-Maven-with-YARN/4510/ >>> >>> have there been any changes to any of these builds that might have >>> caused this? anyone have any ideas? >>> >>> sadly, even though i saw that -06 was about to OOM and got a shell >>> opened before SSH died, my command prompt is completely unresponsive. >>> :( >>> >>> shane - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: BUILD SYSTEM: builds are OOMing the jenkins workers, investigating. also need to reboot amp-jenkins-worker-06
here's the related stack trace from dmesg... UID 500 is jenkins. Out of memory: Kill process 142764 (java) score 40 or sacrifice child Killed process 142764, UID 500, (java) total-vm:24685036kB, anon-rss:5730824kB, file-rss:64kB Uhhuh. NMI received for unknown reason 21 on CPU 0. Do you have a strange power saving mode enabled? Dazed and confused, but trying to continue java: page allocation failure. order:2, mode:0xd0 Pid: 142764, comm: java Not tainted 2.6.32-573.3.1.el6.x86_64 #1 Call Trace: [] ? __alloc_pages_nodemask+0x7dc/0x950 [] ? copy_process+0x168/0x1530 [] ? do_fork+0x96/0x4c0 [] ? sys_futex+0x7b/0x170 [] ? sys_clone+0x28/0x30 [] ? stub_clone+0x13/0x20 [] ? system_call_fastpath+0x16/0x1b On Tue, Oct 20, 2015 at 3:35 PM, shane knapp wrote: > -06 just kinda came back... > > [root@amp-jenkins-worker-06 ~]# uptime > 15:29:07 up 26 days, 7:34, 2 users, load average: 1137.91, 1485.69, > 1635.89 > > the builds that, from looking at the process table, seem to be at > fault are the Spark-Master-Maven-pre-yarn matrix builds, and possibly > a Spark-Master-SBT matrix build. look at the build history here: > https://amplab.cs.berkeley.edu/jenkins/computer/amp-jenkins-worker-06/builds > > the load is dropping significantly and quickly, but swap is borked and > i'm still going to reboot. > > On Tue, Oct 20, 2015 at 3:24 PM, shane knapp wrote: >> starting this saturday (oct 17) we started getting alerts on the >> jenkins workers that various processes were dying (specifically ssh). >> >> since then, we've had half of our workers OOM due to java processes >> and have had now to reboot two of them (-05 and -06). >> >> if we look at the current machine that's wedged (amp-jenkins-worker-06), we >> see: >> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/3814/ >> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/HADOOP_VERSION=2.0.0-mr1-cdh4.1.2,label=spark-test/4508/ >> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/HADOOP_VERSION=1.2.1,label=spark-test/4508/ >> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/3868/ >> https://amplab.cs.berkeley.edu/jenkins/job/Spark-Compile-Master-Maven-with-YARN/4510/ >> >> have there been any changes to any of these builds that might have >> caused this? anyone have any ideas? >> >> sadly, even though i saw that -06 was about to OOM and got a shell >> opened before SSH died, my command prompt is completely unresponsive. >> :( >> >> shane - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: BUILD SYSTEM: builds are OOMing the jenkins workers, investigating. also need to reboot amp-jenkins-worker-06
-06 just kinda came back... [root@amp-jenkins-worker-06 ~]# uptime 15:29:07 up 26 days, 7:34, 2 users, load average: 1137.91, 1485.69, 1635.89 the builds that, from looking at the process table, seem to be at fault are the Spark-Master-Maven-pre-yarn matrix builds, and possibly a Spark-Master-SBT matrix build. look at the build history here: https://amplab.cs.berkeley.edu/jenkins/computer/amp-jenkins-worker-06/builds the load is dropping significantly and quickly, but swap is borked and i'm still going to reboot. On Tue, Oct 20, 2015 at 3:24 PM, shane knapp wrote: > starting this saturday (oct 17) we started getting alerts on the > jenkins workers that various processes were dying (specifically ssh). > > since then, we've had half of our workers OOM due to java processes > and have had now to reboot two of them (-05 and -06). > > if we look at the current machine that's wedged (amp-jenkins-worker-06), we > see: > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/3814/ > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/HADOOP_VERSION=2.0.0-mr1-cdh4.1.2,label=spark-test/4508/ > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/HADOOP_VERSION=1.2.1,label=spark-test/4508/ > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/3868/ > https://amplab.cs.berkeley.edu/jenkins/job/Spark-Compile-Master-Maven-with-YARN/4510/ > > have there been any changes to any of these builds that might have > caused this? anyone have any ideas? > > sadly, even though i saw that -06 was about to OOM and got a shell > opened before SSH died, my command prompt is completely unresponsive. > :( > > shane - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
BUILD SYSTEM: builds are OOMing the jenkins workers, investigating. also need to reboot amp-jenkins-worker-06
starting this saturday (oct 17) we started getting alerts on the jenkins workers that various processes were dying (specifically ssh). since then, we've had half of our workers OOM due to java processes and have had now to reboot two of them (-05 and -06). if we look at the current machine that's wedged (amp-jenkins-worker-06), we see: https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=spark-test/3814/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/HADOOP_VERSION=2.0.0-mr1-cdh4.1.2,label=spark-test/4508/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/HADOOP_VERSION=1.2.1,label=spark-test/4508/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-with-YARN/3868/ https://amplab.cs.berkeley.edu/jenkins/job/Spark-Compile-Master-Maven-with-YARN/4510/ have there been any changes to any of these builds that might have caused this? anyone have any ideas? sadly, even though i saw that -06 was about to OOM and got a shell opened before SSH died, my command prompt is completely unresponsive. :( shane - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: If you use Spark 1.5 and disabled Tungsten mode ...
Hi Reynold, Yes, I'm using 1.5.1. I see them quite often. Sometimes it recovers but sometimes it does not. For one particular job, it failed all the time with the acquire-memory issue. I'm using spark on mesos with fine grained mode. Does it make a difference? Best Regards, Jerry On Tue, Oct 20, 2015 at 5:27 PM, Reynold Xin wrote: > Jerry - I think that's been fixed in 1.5.1. Do you still see it? > > On Tue, Oct 20, 2015 at 2:11 PM, Jerry Lam wrote: > >> I disabled it because of the "Could not acquire 65536 bytes of memory". >> It happens to fail the job. So for now, I'm not touching it. >> >> On Tue, Oct 20, 2015 at 4:48 PM, charmee wrote: >> >>> We had disabled tungsten after we found few performance issues, but had >>> to >>> enable it back because we found that when we had large number of group by >>> fields, if tungsten is disabled the shuffle keeps failing. >>> >>> Here is an excerpt from one of our engineers with his analysis. >>> >>> With Tungsten Enabled (default in spark 1.5): >>> ~90 files of 0.5G each: >>> >>> Ingest (after applying broadcast lookups) : 54 min >>> Aggregation (~30 fields in group by and another 40 in aggregation) : 18 >>> min >>> >>> With Tungsten Disabled: >>> >>> Ingest : 30 min >>> Aggregation : Erroring out >>> >>> On smaller tests we found that joins are slow with tungsten enabled. With >>> GROUP BY, disabling tungsten is not working in the first place. >>> >>> Hope this helps. >>> >>> -Charmee >>> >>> >>> >>> -- >>> View this message in context: >>> http://apache-spark-developers-list.1001551.n3.nabble.com/If-you-use-Spark-1-5-and-disabled-Tungsten-mode-tp14604p14711.html >>> Sent from the Apache Spark Developers List mailing list archive at >>> Nabble.com. >>> >>> - >>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >>> For additional commands, e-mail: dev-h...@spark.apache.org >>> >>> >> >
Re: If you use Spark 1.5 and disabled Tungsten mode ...
Jerry - I think that's been fixed in 1.5.1. Do you still see it? On Tue, Oct 20, 2015 at 2:11 PM, Jerry Lam wrote: > I disabled it because of the "Could not acquire 65536 bytes of memory". It > happens to fail the job. So for now, I'm not touching it. > > On Tue, Oct 20, 2015 at 4:48 PM, charmee wrote: > >> We had disabled tungsten after we found few performance issues, but had to >> enable it back because we found that when we had large number of group by >> fields, if tungsten is disabled the shuffle keeps failing. >> >> Here is an excerpt from one of our engineers with his analysis. >> >> With Tungsten Enabled (default in spark 1.5): >> ~90 files of 0.5G each: >> >> Ingest (after applying broadcast lookups) : 54 min >> Aggregation (~30 fields in group by and another 40 in aggregation) : 18 >> min >> >> With Tungsten Disabled: >> >> Ingest : 30 min >> Aggregation : Erroring out >> >> On smaller tests we found that joins are slow with tungsten enabled. With >> GROUP BY, disabling tungsten is not working in the first place. >> >> Hope this helps. >> >> -Charmee >> >> >> >> -- >> View this message in context: >> http://apache-spark-developers-list.1001551.n3.nabble.com/If-you-use-Spark-1-5-and-disabled-Tungsten-mode-tp14604p14711.html >> Sent from the Apache Spark Developers List mailing list archive at >> Nabble.com. >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org >> For additional commands, e-mail: dev-h...@spark.apache.org >> >> >
Re: If you use Spark 1.5 and disabled Tungsten mode ...
I disabled it because of the "Could not acquire 65536 bytes of memory". It happens to fail the job. So for now, I'm not touching it. On Tue, Oct 20, 2015 at 4:48 PM, charmee wrote: > We had disabled tungsten after we found few performance issues, but had to > enable it back because we found that when we had large number of group by > fields, if tungsten is disabled the shuffle keeps failing. > > Here is an excerpt from one of our engineers with his analysis. > > With Tungsten Enabled (default in spark 1.5): > ~90 files of 0.5G each: > > Ingest (after applying broadcast lookups) : 54 min > Aggregation (~30 fields in group by and another 40 in aggregation) : 18 min > > With Tungsten Disabled: > > Ingest : 30 min > Aggregation : Erroring out > > On smaller tests we found that joins are slow with tungsten enabled. With > GROUP BY, disabling tungsten is not working in the first place. > > Hope this helps. > > -Charmee > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/If-you-use-Spark-1-5-and-disabled-Tungsten-mode-tp14604p14711.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >
Re: If you use Spark 1.5 and disabled Tungsten mode ...
We had disabled tungsten after we found few performance issues, but had to enable it back because we found that when we had large number of group by fields, if tungsten is disabled the shuffle keeps failing. Here is an excerpt from one of our engineers with his analysis. With Tungsten Enabled (default in spark 1.5): ~90 files of 0.5G each: Ingest (after applying broadcast lookups) : 54 min Aggregation (~30 fields in group by and another 40 in aggregation) : 18 min With Tungsten Disabled: Ingest : 30 min Aggregation : Erroring out On smaller tests we found that joins are slow with tungsten enabled. With GROUP BY, disabling tungsten is not working in the first place. Hope this helps. -Charmee -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/If-you-use-Spark-1-5-and-disabled-Tungsten-mode-tp14604p14711.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Spark driver reducing total executors count even when Dynamic Allocation is disabled.
Hi Prakhar, I start to know your problem, you expected that the killed exexcutor by heartbeat mechanism should be launched again but seems not. This problem I think is fixed in the version 1.5 of Spark, you could check this jira https://issues.apache.org/jira/browse/SPARK-8119 Thanks Saisai 2015年10月20日星期二,prakhar jauhari 写道: > Thanks sai for the input, > > So the problem is : i start my job with some fixed number of executors, > but when a host running my executors goes unreachable, driver reduces the > total number of executors. And never increases it. > > I have a repro for the issue, attaching logs: > Running spark job is configured for 2 executors, dynamic allocation > not enabled !!! > > AM starts requesting the 2 executors: > 15/10/19 12:25:58 INFO yarn.YarnRMClient: Registering the ApplicationMaster > 15/10/19 12:25:59 INFO yarn.YarnAllocator: Will request 2 executor > containers, each with 1 cores and 1408 MB memory including 384 MB overhead > 15/10/19 12:25:59 INFO yarn.YarnAllocator: Container request (host: Any, > capability: ) > 15/10/19 12:25:59 INFO yarn.YarnAllocator: Container request (host: Any, > capability: ) > 15/10/19 12:25:59 INFO yarn.ApplicationMaster: Started progress reporter > thread - sleep time : 5000 > > Executors launched: > 15/10/19 12:26:04 INFO impl.AMRMClientImpl: Received new token for : > DN-2:58739 > 15/10/19 12:26:04 INFO impl.AMRMClientImpl: Received new token for : > DN-1:44591 > 15/10/19 12:26:04 INFO yarn.YarnAllocator: Launching container > container_1444841612643_0014_01_02 for on host DN-2 > 15/10/19 12:26:04 INFO yarn.YarnAllocator: Launching ExecutorRunnable. > driverUrl: akka.tcp://sparkDriver@NN-1:35115/user/CoarseGrainedScheduler, > executorHostname: DN-2 > 15/10/19 12:26:04 INFO yarn.YarnAllocator: Launching container > container_1444841612643_0014_01_03 for on host DN-1 > 15/10/19 12:26:04 INFO yarn.YarnAllocator: Launching ExecutorRunnable. > driverUrl: akka.tcp://sparkDriver@NN-1:35115/user/CoarseGrainedScheduler, > executorHostname: DN-1 > > Now my AM and executor 1 are running on DN-2, DN-1 has executor 2 running > on it. To reproduce this issue I removed IP from DN-1, until it was timed > out by spark. > 15/10/19 13:03:30 INFO yarn.YarnAllocator: Driver requested a total number > of 1 executor(s). > 15/10/19 13:03:30 INFO yarn.ApplicationMaster: Driver requested to kill > executor(s) 2. > > > So the driver has reduced the total number of executor to : 1 > And now even when the DN comes up and rejoins the cluster, this count is > not increased. > If I had executor 1 running on a separate DN (not the same as AM's DN), > and that DN went unreachable, driver would reduce total number of executor > to : 0 and the job hangs forever. And this is when i have not enabled > Dynamic allocation. My cluster has other DN's available, AM should request > the killed executors from yarn, and get it on some other DN's. > > Regards, > Prakhar > > > On Mon, Oct 19, 2015 at 2:47 PM, Saisai Shao > wrote: > >> This is a deliberate killing request by heartbeat mechanism, have nothing >> to do with dynamic allocation. Here because you're running on yarn mode, so >> "supportDynamicAllocation" will be true, but actually there's no >> relation to dynamic allocation. >> >> From my understanding "doRequestTotalExecutors" is to sync the current >> total executor number with AM, AM will try to cancel some pending container >> requests when current expected executor number is less. The actual >> container killing command is issued by "doRequestTotalExecutors". >> >> Not sure what is your actual problem? is it unexpected? >> >> Thanks >> Saisai >> >> >> On Mon, Oct 19, 2015 at 3:51 PM, prakhar jauhari > > wrote: >> >>> Hey all, >>> >>> Thanks in advance. I ran into a situation where spark driver reduced the >>> total executors count for my job even with dynamic allocation disabled, >>> and >>> caused the job to hang for ever. >>> >>> Setup: >>> Spark-1.3.1 on hadoop-yarn-2.4.0 cluster. >>> All servers in cluster running Linux version 2.6.32. >>> Job in yarn-client mode. >>> >>> Scenario: >>> 1. Application running with required number of executors. >>> 2. One of the DN's losses connectivity and is timed out. >>> 2. Spark issues a killExecutor for the executor on the DN which was timed >>> out. >>> 3. Even with dynamic allocation off, spark's driver reduces the >>> "targetNumExecutors". >>> >>> On analysing the code (Spark 1.3.1): >>> >>> When my DN goes unreachable: >>> >>> Spark core's HeartbeatReceiver invokes expireDeadHosts(): which checks if >>> Dynamic Allocation is supported and then invokes "sc.killExecutor()" >>> >>> /if (sc.supportDynamicAllocation) { >>> sc.killExecutor(executorId) >>> }/ >>> >>> Surprisingly supportDynamicAllocation in sparkContext.scala is defined >>> as, >>> resulting "True" if dynamicAllocationTesting flag is enabled or spark is >>> running over "yarn". >>> >>> /private[spark] def supportDynamicAl
Re: Spark driver reducing total executors count even when Dynamic Allocation is disabled.
Thanks sai for the input, So the problem is : i start my job with some fixed number of executors, but when a host running my executors goes unreachable, driver reduces the total number of executors. And never increases it. I have a repro for the issue, attaching logs: Running spark job is configured for 2 executors, dynamic allocation not enabled !!! AM starts requesting the 2 executors: 15/10/19 12:25:58 INFO yarn.YarnRMClient: Registering the ApplicationMaster 15/10/19 12:25:59 INFO yarn.YarnAllocator: Will request 2 executor containers, each with 1 cores and 1408 MB memory including 384 MB overhead 15/10/19 12:25:59 INFO yarn.YarnAllocator: Container request (host: Any, capability: ) 15/10/19 12:25:59 INFO yarn.YarnAllocator: Container request (host: Any, capability: ) 15/10/19 12:25:59 INFO yarn.ApplicationMaster: Started progress reporter thread - sleep time : 5000 Executors launched: 15/10/19 12:26:04 INFO impl.AMRMClientImpl: Received new token for : DN-2:58739 15/10/19 12:26:04 INFO impl.AMRMClientImpl: Received new token for : DN-1:44591 15/10/19 12:26:04 INFO yarn.YarnAllocator: Launching container container_1444841612643_0014_01_02 for on host DN-2 15/10/19 12:26:04 INFO yarn.YarnAllocator: Launching ExecutorRunnable. driverUrl: akka.tcp://sparkDriver@NN-1:35115/user/CoarseGrainedScheduler, executorHostname: DN-2 15/10/19 12:26:04 INFO yarn.YarnAllocator: Launching container container_1444841612643_0014_01_03 for on host DN-1 15/10/19 12:26:04 INFO yarn.YarnAllocator: Launching ExecutorRunnable. driverUrl: akka.tcp://sparkDriver@NN-1:35115/user/CoarseGrainedScheduler, executorHostname: DN-1 Now my AM and executor 1 are running on DN-2, DN-1 has executor 2 running on it. To reproduce this issue I removed IP from DN-1, until it was timed out by spark. 15/10/19 13:03:30 INFO yarn.YarnAllocator: Driver requested a total number of 1 executor(s). 15/10/19 13:03:30 INFO yarn.ApplicationMaster: Driver requested to kill executor(s) 2. So the driver has reduced the total number of executor to : 1 And now even when the DN comes up and rejoins the cluster, this count is not increased. If I had executor 1 running on a separate DN (not the same as AM's DN), and that DN went unreachable, driver would reduce total number of executor to : 0 and the job hangs forever. And this is when i have not enabled Dynamic allocation. My cluster has other DN's available, AM should request the killed executors from yarn, and get it on some other DN's. Regards, Prakhar On Mon, Oct 19, 2015 at 2:47 PM, Saisai Shao wrote: > This is a deliberate killing request by heartbeat mechanism, have nothing > to do with dynamic allocation. Here because you're running on yarn mode, so > "supportDynamicAllocation" will be true, but actually there's no relation > to dynamic allocation. > > From my understanding "doRequestTotalExecutors" is to sync the current > total executor number with AM, AM will try to cancel some pending container > requests when current expected executor number is less. The actual > container killing command is issued by "doRequestTotalExecutors". > > Not sure what is your actual problem? is it unexpected? > > Thanks > Saisai > > > On Mon, Oct 19, 2015 at 3:51 PM, prakhar jauhari > wrote: > >> Hey all, >> >> Thanks in advance. I ran into a situation where spark driver reduced the >> total executors count for my job even with dynamic allocation disabled, >> and >> caused the job to hang for ever. >> >> Setup: >> Spark-1.3.1 on hadoop-yarn-2.4.0 cluster. >> All servers in cluster running Linux version 2.6.32. >> Job in yarn-client mode. >> >> Scenario: >> 1. Application running with required number of executors. >> 2. One of the DN's losses connectivity and is timed out. >> 2. Spark issues a killExecutor for the executor on the DN which was timed >> out. >> 3. Even with dynamic allocation off, spark's driver reduces the >> "targetNumExecutors". >> >> On analysing the code (Spark 1.3.1): >> >> When my DN goes unreachable: >> >> Spark core's HeartbeatReceiver invokes expireDeadHosts(): which checks if >> Dynamic Allocation is supported and then invokes "sc.killExecutor()" >> >> /if (sc.supportDynamicAllocation) { >> sc.killExecutor(executorId) >> }/ >> >> Surprisingly supportDynamicAllocation in sparkContext.scala is defined as, >> resulting "True" if dynamicAllocationTesting flag is enabled or spark is >> running over "yarn". >> >> /private[spark] def supportDynamicAllocation = >> master.contains("yarn") || dynamicAllocationTesting / >> >> "sc.killExecutor()" matches it to configured "schedulerBackend" >> (CoarseGrainedSchedulerBackend in this case) and invokes >> "killExecutors(executorIds)" >> >> CoarseGrainedSchedulerBackend calculates a "newTotal" for the total number >> of executors required, and sends a update to application master by >> invoking >> "doRequestTotalExecutors(newTotal)" >> >> CoarseGrainedSchedulerBackend then invokes a >
Re: MapStatus too large for drvier
In our case, we are dealing with 20TB text data which is separated to about 200k map tasks and 200k reduce tasks, and our driver's memory is 15G,. -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/MapStatus-too-large-for-drvier-tp14704p14707.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: MapStatus too large for drvier
How big is your driver heap size? And any reason why you'd need 200k map and 200k reduce tasks? On Mon, Oct 19, 2015 at 11:59 PM, yaoqin wrote: > Hi everyone, > > When I run a spark job contains quite a lot of tasks(in my case is > 200,000*200,000), the driver occured OOM mainly caused by the object > MapStatus, > > As is shown in the pic bellow, RoaringBitmap that used to mark which block > is empty seems to use too many memories. > > Are there any data structue can replace RoaringBitmap to fix my > problem? > > > > Thank you! > > Qin. > >
Ability to offer initial coefficients in ml.LogisticRegression
Hi all, I noticed that in ml.classification.LogisticRegression, users are not allowed to set initial coefficients, while it is supported in mllib.classification.LogisticRegressionWithSGD. Sometimes we know specific coefficients are close to the final optima. e.g., we usually pick yesterday's output model as init coefficients since the data distribution between two days' training sample shouldn't change much. Is there any concern for not supporting this feature? -- Yizhi Liu Senior Software Engineer / Data Mining www.mvad.com, Shanghai, China - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
MapStatus too large for drvier
Hi everyone, When I run a spark job contains quite a lot of tasks(in my case is 200,000*200,000), the driver occured OOM mainly caused by the object MapStatus, As is shown in the pic bellow, RoaringBitmap that used to mark which block is empty seems to use too many memories. Are there any data structue can replace RoaringBitmap to fix my problem? Thank you! Qin. [cid:image001.png@01D10B46.7C248740]