Re: Quick one... AWS SDK version?

2017-10-08 Thread Jonathan Kelly
Tushar,

Yes, the hadoop-aws jar installed on an emr-5.8.0 cluster was built with
AWS Java SDK 1.11.160, if that’s what you mean.

~ Jonathan
On Sun, Oct 8, 2017 at 8:42 AM Tushar Sudake <etusha...@gmail.com> wrote:

> Hi Jonathan,
>
> Does that mean Hadoop-AWS 2.7.3 too is built against AWS SDK 1.11.160 and
> not 1.7.4?
>
> Thanks.
>
>
> On Oct 7, 2017 3:50 PM, "Jean Georges Perrin" <j...@jgp.net> wrote:
>
>
> Hey Marco,
>
> I am actually reading from S3 and I use 2.7.3, but I inherited the project
> and they use some AWS API from Amazon SDK, which version is like from
> yesterday :) so it’s confused and AMZ is changing its version like crazy so
> it’s a little difficult to follow. Right now I went back to 2.7.3 and SDK
> 1.7.4...
>
> jg
>
>
> On Oct 7, 2017, at 15:34, Marco Mistroni <mmistr...@gmail.com> wrote:
>
> Hi JG
>  out of curiosity what's ur usecase? are you writing to S3? you could use
> Spark to do that , e.g using hadoop package
> org.apache.hadoop:hadoop-aws:2.7.1 ..that will download the aws client
> which is in line with hadoop 2.7.1?
>
> hth
>  marco
>
> On Fri, Oct 6, 2017 at 10:58 PM, Jonathan Kelly <jonathaka...@gmail.com>
> wrote:
>
>> Note: EMR builds Hadoop, Spark, et al, from source against specific
>> versions of certain packages like the AWS Java SDK, httpclient/core,
>> Jackson, etc., sometimes requiring some patches in these applications in
>> order to work with versions of these dependencies that differ from what the
>> applications may support upstream.
>>
>> For emr-5.8.0, we have built Hadoop and Spark (the Spark Kinesis
>> connector, that is, since that's the only part of Spark that actually
>> depends upon the AWS Java SDK directly) against AWS Java SDK 1.11.160
>> instead of the much older version that vanilla Hadoop 2.7.3 would otherwise
>> depend upon.
>>
>> ~ Jonathan
>>
>> On Wed, Oct 4, 2017 at 7:17 AM Steve Loughran <ste...@hortonworks.com>
>> wrote:
>>
>>> On 3 Oct 2017, at 21:37, JG Perrin <jper...@lumeris.com> wrote:
>>>
>>> Sorry Steve – I may not have been very clear: thinking about
>>> aws-java-sdk-z.yy.xxx.jar. To the best of my knowledge, none is bundled
>>> with Spark.
>>>
>>>
>>>
>>> I know, but if you are talking to s3 via the s3a client, you will need
>>> the SDK version to match the hadoop-aws JAR of the same version of Hadoop
>>> your JARs have. Similarly, if you were using spark-kinesis, it needs to be
>>> in sync there.
>>>
>>>
>>> *From:* Steve Loughran [mailto:ste...@hortonworks.com
>>> <ste...@hortonworks.com>]
>>> *Sent:* Tuesday, October 03, 2017 2:20 PM
>>> *To:* JG Perrin <jper...@lumeris.com>
>>> *Cc:* user@spark.apache.org
>>> *Subject:* Re: Quick one... AWS SDK version?
>>>
>>>
>>>
>>> On 3 Oct 2017, at 02:28, JG Perrin <jper...@lumeris.com> wrote:
>>>
>>> Hey Sparkians,
>>>
>>> What version of AWS Java SDK do you use with Spark 2.2? Do you stick
>>> with the Hadoop 2.7.3 libs?
>>>
>>>
>>> You generally to have to stick with the version which hadoop was built
>>> with I'm afraid...very brittle dependency.
>>>
>>>
>
>


Re: Quick one... AWS SDK version?

2017-10-06 Thread Jonathan Kelly
Note: EMR builds Hadoop, Spark, et al, from source against specific
versions of certain packages like the AWS Java SDK, httpclient/core,
Jackson, etc., sometimes requiring some patches in these applications in
order to work with versions of these dependencies that differ from what the
applications may support upstream.

For emr-5.8.0, we have built Hadoop and Spark (the Spark Kinesis connector,
that is, since that's the only part of Spark that actually depends upon the
AWS Java SDK directly) against AWS Java SDK 1.11.160 instead of the much
older version that vanilla Hadoop 2.7.3 would otherwise depend upon.

~ Jonathan

On Wed, Oct 4, 2017 at 7:17 AM Steve Loughran 
wrote:

> On 3 Oct 2017, at 21:37, JG Perrin  wrote:
>
> Sorry Steve – I may not have been very clear: thinking about
> aws-java-sdk-z.yy.xxx.jar. To the best of my knowledge, none is bundled
> with Spark.
>
>
>
> I know, but if you are talking to s3 via the s3a client, you will need the
> SDK version to match the hadoop-aws JAR of the same version of Hadoop your
> JARs have. Similarly, if you were using spark-kinesis, it needs to be in
> sync there.
>
>
> *From:* Steve Loughran [mailto:ste...@hortonworks.com
> ]
> *Sent:* Tuesday, October 03, 2017 2:20 PM
> *To:* JG Perrin 
> *Cc:* user@spark.apache.org
> *Subject:* Re: Quick one... AWS SDK version?
>
>
>
> On 3 Oct 2017, at 02:28, JG Perrin  wrote:
>
> Hey Sparkians,
>
> What version of AWS Java SDK do you use with Spark 2.2? Do you stick with
> the Hadoop 2.7.3 libs?
>
>
> You generally to have to stick with the version which hadoop was built
> with I'm afraid...very brittle dependency.
>
>


Re: RDD blocks on Spark Driver

2017-02-28 Thread Jonathan Kelly
Prithish,

It would be helpful for you to share the spark-submit command you are
running.

~ Jonathan

On Sun, Feb 26, 2017 at 8:29 AM Prithish  wrote:

> Thanks for the responses, I am running this on Amazon EMR which runs the
> Yarn cluster manager.
>
> On Sat, Feb 25, 2017 at 4:45 PM, liangyhg...@gmail.com <
> liangyhg...@gmail.com> wrote:
>
> Hi,
>
>  I think you are using the local model of Spark. There are mainly four 
> models, which are local, standalone,  yarn and Mesos. Also, "blocks" is 
> relative to hdfs, "partitions" is relative to spark.
>
> liangyihuai
>
> ---Original---
> *From:* "Jacek Laskowski "
> *Date:* 2017/2/25 02:45:20
> *To:* "prithish";
> *Cc:* "user";
> *Subject:* Re: RDD blocks on Spark Driver
>
> Hi,
>
> Guess you're use local mode which has only one executor called driver. Is
> my guessing correct?
>
> Jacek
>
> On 23 Feb 2017 2:03 a.m.,  wrote:
>
> Hello,
>
> Had a question. When I look at the executors tab in Spark UI, I notice
> that some RDD blocks are assigned to the driver as well. Can someone please
> tell me why?
>
> Thanks for the help.
>
>
>


Re: Custom log4j.properties on AWS EMR

2017-02-28 Thread Jonathan Kelly
Prithish,

I saw you posted this on SO, so I responded there just now. See
http://stackoverflow.com/questions/42452622/custom-log4j-properties-on-aws-emr/42516161#42516161

In short, an hdfs:// path can't be used to configure log4j because log4j
knows nothing about hdfs. Instead, since you are using EMR, you should use
the Configuration API when creating your cluster to configure the
spark-log4j configuration classification. See
http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-configure-apps.html for
more info.

~ Jonathan

On Sun, Feb 26, 2017 at 8:17 PM Prithish  wrote:

> Steve, I tried that, but didn't work. Any other ideas?
>
> On Mon, Feb 27, 2017 at 1:42 AM, Steve Loughran 
> wrote:
>
> try giving a resource of a file in the JAR, e.g add a file
> "log4j-debugging.properties into the jar, and give a config option of
> -Dlog4j.configuration=/log4j-debugging.properties   (maybe also try without
> the "/")
>
>
> On 26 Feb 2017, at 16:31, Prithish  wrote:
>
> Hoping someone can answer this.
>
> I am unable to override and use a Custom log4j.properties on Amazon EMR. I
> am running Spark on EMR (Yarn) and have tried all the below combinations in
> the Spark-Submit to try and use the custom log4j.
>
> In Client mode
> --driver-java-options "-Dlog4j.configuration=
> hdfs://host:port/user/hadoop/log4j.properties"
>
> In Cluster mode
> --conf "spark.driver.extraJavaOptions=-Dlog4j.configuration=
> hdfs://host:port/user/hadoop/log4j.properties"
>
> I have also tried picking from local filesystem using file: instead
> of hdfs. None of this seem to work. However, I can get this working when
> running on my local Yarn setup.
>
> Any ideas?
>
> I have also posted on Stackoverflow (link below)
>
> http://stackoverflow.com/questions/42452622/custom-log4j-properties-on-aws-emr
>
>
>
>


Re: [Erorr:]vieiwng Web UI on EMR cluster

2016-09-13 Thread Jonathan Kelly
Yes, Spark on EMR runs on YARN, so there is only a Spark UI when a Spark
app is running. To expand on what Natu says, the best way to view the Spark
UI for both running and completed Spark apps is to start from the YARN
ResourceManager UI (port 8088) and to click the "Application Master" link
(for running apps) or "History" link (for completed apps).

~ Jonathan

On Tue, Sep 13, 2016 at 2:30 AM Natu Lauchande <nlaucha...@gmail.com> wrote:

> Hi,
>
> I think the spark UI will be accessible whenever you launch a spark app in
> the cluster it should be the Application Tracker link.
>
>
> Regards,
> Natu
>
> On Tue, Sep 13, 2016 at 9:37 AM, Divya Gehlot <divya.htco...@gmail.com>
> wrote:
>
>> Hi ,
>> Thank you all..
>> Hurray ...I am able to view the hadoop web UI now  @ 8088 . even Spark
>> Hisroty server Web UI @ 18080
>> But unable to figure out the Spark UI web port ...
>> Tried with 4044,4040.. ..
>> getting below error
>> This site can’t be reached
>> How can I find out the Spark port ?
>>
>> Would really appreciate the help.
>>
>> Thanks,
>> Divya
>>
>>
>> On 13 September 2016 at 15:09, Divya Gehlot <divya.htco...@gmail.com>
>> wrote:
>>
>>> Hi,
>>> Thanks all for your prompt response.
>>> I followed the instruction in the docs EMR SSH tunnel
>>> <https://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-ssh-tunnel.html>
>>> shared by Jonathan.
>>> I am on MAC and set up foxy proxy in my chrome browser
>>>
>>> Divyas-MacBook-Pro:.ssh divyag$ ssh  -N -D 8157
>>> had...@ec2-xx-xxx-xxx-xx.ap-southeast-1.compute.amazonaws.com
>>>
>>> channel 3: open failed: connect failed: Connection refused
>>>
>>> channel 3: open failed: connect failed: Connection refused
>>>
>>> channel 4: open failed: connect failed: Connection refused
>>>
>>> channel 3: open failed: connect failed: Connection refused
>>>
>>> channel 4: open failed: connect failed: Connection refused
>>>
>>> channel 3: open failed: connect failed: Connection refused
>>>
>>> channel 3: open failed: connect failed: Connection refused
>>>
>>> channel 4: open failed: connect failed: Connection refused
>>>
>>> channel 5: open failed: connect failed: Connection refused
>>>
>>> channel 22: open failed: connect failed: Connection refused
>>>
>>> channel 23: open failed: connect failed: Connection refused
>>>
>>> channel 22: open failed: connect failed: Connection refused
>>>
>>> channel 23: open failed: connect failed: Connection refused
>>>
>>> channel 22: open failed: connect failed: Connection refused
>>>
>>> channel 8: open failed: administratively prohibited: open failed
>>>
>>>
>>> What am I missing now ?
>>>
>>>
>>> Thanks,
>>>
>>> Divya
>>>
>>> On 13 September 2016 at 14:23, Jonathan Kelly <jonathaka...@gmail.com>
>>> wrote:
>>>
>>>> I would not recommend opening port 50070 on your cluster, as that would
>>>> give the entire world access to your data on HDFS. Instead, you should
>>>> follow the instructions found here to create a secure tunnel to the
>>>> cluster, through which you can proxy requests to the UIs using a browser
>>>> plugin like FoxyProxy:
>>>> https://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-ssh-tunnel.html
>>>>
>>>> ~ Jonathan
>>>>
>>>> On Mon, Sep 12, 2016 at 10:40 PM Mohammad Tariq <donta...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Divya,
>>>>>
>>>>> Do you you have inbounds enabled on port 50070 of your NN machine.
>>>>> Also, it's a good idea to have the public DNS in your /etc/hosts for 
>>>>> proper
>>>>> name resolution.
>>>>>
>>>>>
>>>>> [image: --]
>>>>>
>>>>> Tariq, Mohammad
>>>>> [image: https://]about.me/mti
>>>>>
>>>>> <https://about.me/mti?promo=email_sig_source=email_sig_medium=external_link_campaign=chrome_ext>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> [image: http://] <http://about.me/mti>
>>>>> Tariq, Mohammad
>>>>> about.me/mti
>>>>> [image: http://]
>>>>> <http://about.me/mti>
>>>>>
>>>>> On Tue, Sep 13, 2016 at 9:28 AM, Divya Gehlot <divya.htco...@gmail.com
>>>>> > wrote:
>>>>>
>>>>>> Hi,
>>>>>> I am on EMR 4.7 with Spark 1.6.1   and Hadoop 2.7.2
>>>>>> When I am trying to view Any of the web UI of the cluster either
>>>>>> hadoop or Spark ,I am getting below error
>>>>>> "
>>>>>> This site can’t be reached
>>>>>>
>>>>>> "
>>>>>> Has anybody using EMR and able to view WebUI .
>>>>>> Could you please share the steps.
>>>>>>
>>>>>> Would really appreciate the help.
>>>>>>
>>>>>> Thanks,
>>>>>> Divya
>>>>>>
>>>>>
>>>>>
>>>
>>
>


Re: [Erorr:]vieiwng Web UI on EMR cluster

2016-09-13 Thread Jonathan Kelly
I would not recommend opening port 50070 on your cluster, as that would
give the entire world access to your data on HDFS. Instead, you should
follow the instructions found here to create a secure tunnel to the
cluster, through which you can proxy requests to the UIs using a browser
plugin like FoxyProxy:
https://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-ssh-tunnel.html

~ Jonathan

On Mon, Sep 12, 2016 at 10:40 PM Mohammad Tariq  wrote:

> Hi Divya,
>
> Do you you have inbounds enabled on port 50070 of your NN machine. Also,
> it's a good idea to have the public DNS in your /etc/hosts for proper name
> resolution.
>
>
> [image: --]
>
> Tariq, Mohammad
> [image: https://]about.me/mti
>
> 
>
>
>
>
> [image: http://] 
> Tariq, Mohammad
> about.me/mti
> [image: http://]
> 
>
> On Tue, Sep 13, 2016 at 9:28 AM, Divya Gehlot 
> wrote:
>
>> Hi,
>> I am on EMR 4.7 with Spark 1.6.1   and Hadoop 2.7.2
>> When I am trying to view Any of the web UI of the cluster either hadoop
>> or Spark ,I am getting below error
>> "
>> This site can’t be reached
>>
>> "
>> Has anybody using EMR and able to view WebUI .
>> Could you please share the steps.
>>
>> Would really appreciate the help.
>>
>> Thanks,
>> Divya
>>
>
>


Re: Unsubscribe - 3rd time

2016-06-29 Thread Jonathan Kelly
If at first you don't succeed, try, try again. But please don't. :)

See the "unsubscribe" link here: http://spark.apache.org/community.html

I'm not sure I've ever come across an email list that allows you to
unsubscribe by responding to the list with "unsubscribe". At least, all of
the Apache ones have a separate address to which you send
subscribe/unsubscribe messages. And yet people try to send "unsubscribe"
messages to the actual list almost every day.

On Wed, Jun 29, 2016 at 9:03 AM Mich Talebzadeh 
wrote:

> LOL. Bravely said Joaquin.
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 29 June 2016 at 16:54, Joaquin Alzola 
> wrote:
>
>> And 3rd time is not enough to know that unsubscribe is done through à
>> user-unsubscr...@spark.apache.org
>>
>>
>>
>> *From:* Steve Florence [mailto:sflore...@ypm.com]
>> *Sent:* 29 June 2016 16:47
>> *To:* user@spark.apache.org
>> *Subject:* Unsubscribe - 3rd time
>>
>>
>>
>>
>> This email is confidential and may be subject to privilege. If you are
>> not the intended recipient, please do not copy or disclose its content but
>> contact the sender immediately upon receipt.
>>
>
>


Re: Logging trait in Spark 2.0

2016-06-24 Thread Jonathan Kelly
Ted, how is that thread related to Paolo's question?

On Fri, Jun 24, 2016 at 1:50 PM Ted Yu  wrote:

> See this related thread:
>
>
> http://search-hadoop.com/m/q3RTtEor1vYWbsW=RE+Configuring+Log4J+Spark+1+5+on+EMR+4+1+
>
> On Fri, Jun 24, 2016 at 6:07 AM, Paolo Patierno 
> wrote:
>
>> Hi,
>>
>> developing a Spark Streaming custom receiver I noticed that the Logging
>> trait isn't accessible anymore in Spark 2.0.
>>
>> trait Logging in package internal cannot be accessed in package
>> org.apache.spark.internal
>>
>> For developing a custom receiver what is the preferred way for logging ?
>> Just using log4j dependency as any other Java/Scala library/application ?
>>
>> Thanks,
>> Paolo
>>
>> *Paolo Patierno*
>>
>> *Senior Software Engineer (IoT) @ Red Hat**Microsoft MVP on **Windows
>> Embedded & IoT*
>> *Microsoft Azure Advisor*
>>
>> Twitter : @ppatierno 
>> Linkedin : paolopatierno 
>> Blog : DevExperience 
>>
>
>


Re: Spark 2.0 on YARN - Files in config archive not ending up on executor classpath

2016-06-20 Thread Jonathan Kelly
OK, JIRA created: https://issues.apache.org/jira/browse/SPARK-16080

Also, after looking at the code a bit I think I see the reason. If I'm
correct, it may actually be a very easy fix.

On Mon, Jun 20, 2016 at 1:21 PM Marcelo Vanzin <van...@cloudera.com> wrote:

> It doesn't hurt to have a bug tracking it, in case anyone else has
> time to look at it before I do.
>
> On Mon, Jun 20, 2016 at 1:20 PM, Jonathan Kelly <jonathaka...@gmail.com>
> wrote:
> > Thanks for the confirmation! Shall I cut a JIRA issue?
> >
> > On Mon, Jun 20, 2016 at 10:42 AM Marcelo Vanzin <van...@cloudera.com>
> wrote:
> >>
> >> I just tried this locally and can see the wrong behavior you mention.
> >> I'm running a somewhat old build of 2.0, but I'll take a look.
> >>
> >> On Mon, Jun 20, 2016 at 7:04 AM, Jonathan Kelly <jonathaka...@gmail.com
> >
> >> wrote:
> >> > Does anybody have any thoughts on this?
> >> >
> >> > On Fri, Jun 17, 2016 at 6:36 PM Jonathan Kelly <
> jonathaka...@gmail.com>
> >> > wrote:
> >> >>
> >> >> I'm trying to debug a problem in Spark 2.0.0-SNAPSHOT (commit
> >> >> bdf5fe4143e5a1a393d97d0030e76d35791ee248) where Spark's
> >> >> log4j.properties is
> >> >> not getting picked up in the executor classpath (and driver classpath
> >> >> for
> >> >> yarn-cluster mode), so Hadoop's log4j.properties file is taking
> >> >> precedence
> >> >> in the YARN containers.
> >> >>
> >> >> Spark's log4j.properties file is correctly being bundled into the
> >> >> __spark_conf__.zip file and getting added to the DistributedCache,
> but
> >> >> it is
> >> >> not in the classpath of the executor, as evidenced by the following
> >> >> command,
> >> >> which I ran in spark-shell:
> >> >>
> >> >> scala> sc.parallelize(Seq(1)).map(_ =>
> >> >> getClass().getResource("/log4j.properties")).first
> >> >> res3: java.net.URL = file:/etc/hadoop/conf.empty/log4j.properties
> >> >>
> >> >> I then ran the following in spark-shell to verify the classpath of
> the
> >> >> executors:
> >> >>
> >> >> scala> sc.parallelize(Seq(1)).map(_ =>
> >> >> System.getProperty("java.class.path")).flatMap(_.split(':')).filter(e
> >> >> =>
> >> >> !e.endsWith(".jar") && !e.endsWith("*")).collect.foreach(println)
> >> >> ...
> >> >>
> >> >>
> >> >>
> /mnt/yarn/usercache/hadoop/appcache/application_1466208403287_0003/container_1466208403287_0003_01_03
> >> >>
> >> >>
> >> >>
> /mnt/yarn/usercache/hadoop/appcache/application_1466208403287_0003/container_1466208403287_0003_01_03/__spark_conf__
> >> >> /etc/hadoop/conf
> >> >> ...
> >> >>
> >> >> So the JVM has this nonexistent __spark_conf__ directory in the
> >> >> classpath
> >> >> when it should really be __spark_conf__.zip (which is actually a
> >> >> symlink to
> >> >> a directory, despite the .zip filename).
> >> >>
> >> >> % sudo ls -l
> >> >>
> >> >>
> /mnt/yarn/usercache/hadoop/appcache/application_1466208403287_0003/container_1466208403287_0003_01_03
> >> >> total 20
> >> >> -rw-r--r-- 1 yarn yarn   88 Jun 18 01:26 container_tokens
> >> >> -rwx-- 1 yarn yarn  594 Jun 18 01:26
> >> >> default_container_executor_session.sh
> >> >> -rwx-- 1 yarn yarn  648 Jun 18 01:26
> default_container_executor.sh
> >> >> -rwx-- 1 yarn yarn 4419 Jun 18 01:26 launch_container.sh
> >> >> lrwxrwxrwx 1 yarn yarn   59 Jun 18 01:26 __spark_conf__.zip ->
> >> >> /mnt1/yarn/usercache/hadoop/filecache/17/__spark_conf__.zip
> >> >> lrwxrwxrwx 1 yarn yarn   77 Jun 18 01:26 __spark_libs__ ->
> >> >>
> >> >>
> /mnt/yarn/usercache/hadoop/filecache/16/__spark_libs__4490748779530764463.zip
> >> >> drwx--x--- 2 yarn yarn   46 Jun 18 01:26 tmp
> >> >>
> >> >> Does anybody know why this is happening? Is this a bug in Spark, or
> is
> >> >> it
> >> >> the JVM doing this (possibly because the extension is .zip)?
> >> >>
> >> >> Thanks,
> >> >> Jonathan
> >>
> >>
> >>
> >> --
> >> Marcelo
>
>
>
> --
> Marcelo
>


Re: Spark 2.0 on YARN - Files in config archive not ending up on executor classpath

2016-06-20 Thread Jonathan Kelly
Thanks for the confirmation! Shall I cut a JIRA issue?

On Mon, Jun 20, 2016 at 10:42 AM Marcelo Vanzin <van...@cloudera.com> wrote:

> I just tried this locally and can see the wrong behavior you mention.
> I'm running a somewhat old build of 2.0, but I'll take a look.
>
> On Mon, Jun 20, 2016 at 7:04 AM, Jonathan Kelly <jonathaka...@gmail.com>
> wrote:
> > Does anybody have any thoughts on this?
> >
> > On Fri, Jun 17, 2016 at 6:36 PM Jonathan Kelly <jonathaka...@gmail.com>
> > wrote:
> >>
> >> I'm trying to debug a problem in Spark 2.0.0-SNAPSHOT (commit
> >> bdf5fe4143e5a1a393d97d0030e76d35791ee248) where Spark's
> log4j.properties is
> >> not getting picked up in the executor classpath (and driver classpath
> for
> >> yarn-cluster mode), so Hadoop's log4j.properties file is taking
> precedence
> >> in the YARN containers.
> >>
> >> Spark's log4j.properties file is correctly being bundled into the
> >> __spark_conf__.zip file and getting added to the DistributedCache, but
> it is
> >> not in the classpath of the executor, as evidenced by the following
> command,
> >> which I ran in spark-shell:
> >>
> >> scala> sc.parallelize(Seq(1)).map(_ =>
> >> getClass().getResource("/log4j.properties")).first
> >> res3: java.net.URL = file:/etc/hadoop/conf.empty/log4j.properties
> >>
> >> I then ran the following in spark-shell to verify the classpath of the
> >> executors:
> >>
> >> scala> sc.parallelize(Seq(1)).map(_ =>
> >> System.getProperty("java.class.path")).flatMap(_.split(':')).filter(e =>
> >> !e.endsWith(".jar") && !e.endsWith("*")).collect.foreach(println)
> >> ...
> >>
> >>
> /mnt/yarn/usercache/hadoop/appcache/application_1466208403287_0003/container_1466208403287_0003_01_03
> >>
> >>
> /mnt/yarn/usercache/hadoop/appcache/application_1466208403287_0003/container_1466208403287_0003_01_03/__spark_conf__
> >> /etc/hadoop/conf
> >> ...
> >>
> >> So the JVM has this nonexistent __spark_conf__ directory in the
> classpath
> >> when it should really be __spark_conf__.zip (which is actually a
> symlink to
> >> a directory, despite the .zip filename).
> >>
> >> % sudo ls -l
> >>
> /mnt/yarn/usercache/hadoop/appcache/application_1466208403287_0003/container_1466208403287_0003_01_03
> >> total 20
> >> -rw-r--r-- 1 yarn yarn   88 Jun 18 01:26 container_tokens
> >> -rwx-- 1 yarn yarn  594 Jun 18 01:26
> >> default_container_executor_session.sh
> >> -rwx-- 1 yarn yarn  648 Jun 18 01:26 default_container_executor.sh
> >> -rwx-- 1 yarn yarn 4419 Jun 18 01:26 launch_container.sh
> >> lrwxrwxrwx 1 yarn yarn   59 Jun 18 01:26 __spark_conf__.zip ->
> >> /mnt1/yarn/usercache/hadoop/filecache/17/__spark_conf__.zip
> >> lrwxrwxrwx 1 yarn yarn   77 Jun 18 01:26 __spark_libs__ ->
> >>
> /mnt/yarn/usercache/hadoop/filecache/16/__spark_libs__4490748779530764463.zip
> >> drwx--x--- 2 yarn yarn   46 Jun 18 01:26 tmp
> >>
> >> Does anybody know why this is happening? Is this a bug in Spark, or is
> it
> >> the JVM doing this (possibly because the extension is .zip)?
> >>
> >> Thanks,
> >> Jonathan
>
>
>
> --
> Marcelo
>


Re: Spark 2.0 on YARN - Files in config archive not ending up on executor classpath

2016-06-20 Thread Jonathan Kelly
Does anybody have any thoughts on this?
On Fri, Jun 17, 2016 at 6:36 PM Jonathan Kelly <jonathaka...@gmail.com>
wrote:

> I'm trying to debug a problem in Spark 2.0.0-SNAPSHOT
> (commit bdf5fe4143e5a1a393d97d0030e76d35791ee248) where Spark's
> log4j.properties is not getting picked up in the executor classpath (and
> driver classpath for yarn-cluster mode), so Hadoop's log4j.properties file
> is taking precedence in the YARN containers.
>
> Spark's log4j.properties file is correctly being bundled into the
> __spark_conf__.zip file and getting added to the DistributedCache, but it
> is not in the classpath of the executor, as evidenced by the following
> command, which I ran in spark-shell:
>
> scala> sc.parallelize(Seq(1)).map(_ =>
> getClass().getResource("/log4j.properties")).first
> res3: java.net.URL = file:/etc/hadoop/conf.empty/log4j.properties
>
> I then ran the following in spark-shell to verify the classpath of the
> executors:
>
> scala> sc.parallelize(Seq(1)).map(_ =>
> System.getProperty("java.class.path")).flatMap(_.split(':')).filter(e =>
> !e.endsWith(".jar") && !e.endsWith("*")).collect.foreach(println)
> ...
>
> /mnt/yarn/usercache/hadoop/appcache/application_1466208403287_0003/container_1466208403287_0003_01_03
>
> /mnt/yarn/usercache/hadoop/appcache/application_1466208403287_0003/container_1466208403287_0003_01_03/__spark_conf__
> /etc/hadoop/conf
> ...
>
> So the JVM has this nonexistent __spark_conf__ directory in the classpath
> when it should really be __spark_conf__.zip (which is actually a symlink
> to a directory, despite the .zip filename).
>
> % sudo ls -l
> /mnt/yarn/usercache/hadoop/appcache/application_1466208403287_0003/container_1466208403287_0003_01_03
> total 20
> -rw-r--r-- 1 yarn yarn   88 Jun 18 01:26 container_tokens
> -rwx-- 1 yarn yarn  594 Jun 18 01:26
> default_container_executor_session.sh
> -rwx-- 1 yarn yarn  648 Jun 18 01:26 default_container_executor.sh
> -rwx-- 1 yarn yarn 4419 Jun 18 01:26 launch_container.sh
> lrwxrwxrwx 1 yarn yarn   59 Jun 18 01:26 __spark_conf__.zip ->
> /mnt1/yarn/usercache/hadoop/filecache/17/__spark_conf__.zip
> lrwxrwxrwx 1 yarn yarn   77 Jun 18 01:26 __spark_libs__ ->
> /mnt/yarn/usercache/hadoop/filecache/16/__spark_libs__4490748779530764463.zip
> drwx--x--- 2 yarn yarn   46 Jun 18 01:26 tmp
>
> Does anybody know why this is happening? Is this a bug in Spark, or is it
> the JVM doing this (possibly because the extension is .zip)?
>
> Thanks,
> Jonathan
>


Re: Running Spark in local mode

2016-06-19 Thread Jonathan Kelly
Mich, what Jacek is saying is not that you implied that YARN relies on two
masters. He's just clarifying that yarn-client and yarn-cluster modes are
really both using the same (type of) master (simply "yarn"). In fact, if
you specify "--master yarn-client" or "--master yarn-cluster", spark-submit
will translate that into using a master URL of "yarn" and a deploy-mode of
"client" or "cluster".

And thanks, Jacek, for the tips on the "less-common master URLs". I had no
idea that was an option!

~ Jonathan

On Sun, Jun 19, 2016 at 4:13 AM Mich Talebzadeh 
wrote:

> Good points but I am an experimentalist
>
> In Local mode I have this
>
> In local mode with:
>
> --master local
>
>
>
> This will start with one thread or equivalent to –master local[1]. You can
> also start by more than one thread by specifying the number of threads *k*
> in –master local[k]. You can also start using all available threads with 
> –master
> local[*]which in mine would be local[12].
>
> The important thing about Local mode is that number of JVM thrown is
> controlled by you and you can start as many spark-submit as you wish within
> constraint of what you get
>
> ${SPARK_HOME}/bin/spark-submit \
>
> --packages com.databricks:spark-csv_2.11:1.3.0 \
>
> --driver-memory 2G \
>
> --num-executors 1 \
>
> --executor-memory 2G \
>
> --master local \
>
> --executor-cores 2 \
>
> --conf "spark.scheduler.mode=FIFO" \
>
> --conf
> "spark.executor.extraJavaOptions=-XX:+PrintGCDetails
> -XX:+PrintGCTimeStamps" \
>
> --jars
> /home/hduser/jars/spark-streaming-kafka-assembly_2.10-1.6.1.jar \
>
> --class "${FILE_NAME}" \
>
> --conf "spark.ui.port=4040” \
>
> ${JAR_FILE} \
>
> >> ${LOG_FILE}
>
> Now that does work fine although some of those parameters are implicit
> (for example cheduler.mode = FIFOR or FAIR and I can start different spark
> jobs in Local mode. Great for testing.
>
> With regard to your comments on Standalone
>
> Spark Standalone – a simple cluster manager included with Spark that
> makes it easy to set up a cluster.
>
> s/simple/built-in
> What is stated as "included" implies that, i.e. it comes as part of
> running Spark in standalone mode.
>
> Your other points on YARN cluster mode and YARN client mode
>
> I'd say there's only one YARN master, i.e. --master yarn. You could
> however say where the driver runs, be it on your local machine where
> you executed spark-submit or on one node in a YARN cluster.
>
>
> Yes that is I believe what the text implied. I would be very surprised if
> YARN as a resource manager relies on two masters :)
>
>
> HTH
>
>
>
>
>
>
>
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 19 June 2016 at 11:46, Jacek Laskowski  wrote:
>
>> On Sun, Jun 19, 2016 at 12:30 PM, Mich Talebzadeh
>>  wrote:
>>
>> > Spark Local - Spark runs on the local host. This is the simplest set up
>> and
>> > best suited for learners who want to understand different concepts of
>> Spark
>> > and those performing unit testing.
>>
>> There are also the less-common master URLs:
>>
>> * local[n, maxRetries] or local[*, maxRetries] — local mode with n
>> threads and maxRetries number of failures.
>> * local-cluster[n, cores, memory] for simulating a Spark local cluster
>> with n workers, # cores per worker, and # memory per worker.
>>
>> As of Spark 2.0.0, you could also have your own scheduling system -
>> see https://issues.apache.org/jira/browse/SPARK-13904 - with the only
>> known implementation of the ExternalClusterManager contract in Spark
>> being YarnClusterManager, i.e. whenever you call Spark with --master
>> yarn.
>>
>> > Spark Standalone – a simple cluster manager included with Spark that
>> makes
>> > it easy to set up a cluster.
>>
>> s/simple/built-in
>>
>> > YARN Cluster Mode, the Spark driver runs inside an application master
>> > process which is managed by YARN on the cluster, and the client can go
>> away
>> > after initiating the application. This is invoked with –master yarn and
>> > --deploy-mode cluster
>> >
>> > YARN Client Mode, the driver runs in the client process, and the
>> application
>> > master is only used for requesting resources from YARN. Unlike Spark
>> > standalone mode, in which the master’s address is specified in the
>> --master
>> > parameter, in YARN mode the ResourceManager’s address is picked up from
>> the
>> > Hadoop configuration. Thus, the --master parameter is yarn. This is
>> invoked
>> > with --deploy-mode client
>>
>> I'd say there's only one YARN master, i.e. --master yarn. You could
>> 

Spark 2.0 on YARN - Files in config archive not ending up on executor classpath

2016-06-17 Thread Jonathan Kelly
I'm trying to debug a problem in Spark 2.0.0-SNAPSHOT
(commit bdf5fe4143e5a1a393d97d0030e76d35791ee248) where Spark's
log4j.properties is not getting picked up in the executor classpath (and
driver classpath for yarn-cluster mode), so Hadoop's log4j.properties file
is taking precedence in the YARN containers.

Spark's log4j.properties file is correctly being bundled into the
__spark_conf__.zip file and getting added to the DistributedCache, but it
is not in the classpath of the executor, as evidenced by the following
command, which I ran in spark-shell:

scala> sc.parallelize(Seq(1)).map(_ =>
getClass().getResource("/log4j.properties")).first
res3: java.net.URL = file:/etc/hadoop/conf.empty/log4j.properties

I then ran the following in spark-shell to verify the classpath of the
executors:

scala> sc.parallelize(Seq(1)).map(_ =>
System.getProperty("java.class.path")).flatMap(_.split(':')).filter(e =>
!e.endsWith(".jar") && !e.endsWith("*")).collect.foreach(println)
...
/mnt/yarn/usercache/hadoop/appcache/application_1466208403287_0003/container_1466208403287_0003_01_03
/mnt/yarn/usercache/hadoop/appcache/application_1466208403287_0003/container_1466208403287_0003_01_03/__spark_conf__
/etc/hadoop/conf
...

So the JVM has this nonexistent __spark_conf__ directory in the classpath
when it should really be __spark_conf__.zip (which is actually a symlink to
a directory, despite the .zip filename).

% sudo ls -l
/mnt/yarn/usercache/hadoop/appcache/application_1466208403287_0003/container_1466208403287_0003_01_03
total 20
-rw-r--r-- 1 yarn yarn   88 Jun 18 01:26 container_tokens
-rwx-- 1 yarn yarn  594 Jun 18 01:26
default_container_executor_session.sh
-rwx-- 1 yarn yarn  648 Jun 18 01:26 default_container_executor.sh
-rwx-- 1 yarn yarn 4419 Jun 18 01:26 launch_container.sh
lrwxrwxrwx 1 yarn yarn   59 Jun 18 01:26 __spark_conf__.zip ->
/mnt1/yarn/usercache/hadoop/filecache/17/__spark_conf__.zip
lrwxrwxrwx 1 yarn yarn   77 Jun 18 01:26 __spark_libs__ ->
/mnt/yarn/usercache/hadoop/filecache/16/__spark_libs__4490748779530764463.zip
drwx--x--- 2 yarn yarn   46 Jun 18 01:26 tmp

Does anybody know why this is happening? Is this a bug in Spark, or is it
the JVM doing this (possibly because the extension is .zip)?

Thanks,
Jonathan


Re: Configure Spark Resource on AWS CLI Not Working

2016-03-01 Thread Jonathan Kelly
Weiwei,

Please see this documentation for configuring Spark and other apps on EMR
4.x:
http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-configure-apps.html
This documentation about what has changed between 3.x and 4.x should also
be helpful:
http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-release-differences.html

~ Jonathan

On Fri, Feb 26, 2016 at 6:38 PM Weiwei Zhang 
wrote:

> Hi there,
>
> I am trying to configure memory for spark using AWS CLI. However, I got
> the following message:
>
> *A client error (ValidationException) occurred when calling the RunJobFlow
> operation: Cannot specify args for application 'Spark' when release label
> is used.*
>
> In the aws 'create-cluster' command, I have '--release-label emr-4.0.0
> --applications Name=Hadoop
> Name=Spark,Args=[-d,num-executors=4,spark.executor.memory=3000M,spark.driver.memory=4000M]'
> and it seems like I cannot specify args when there is '--release-label'.
> How do I get around this?
>
> I also tried using a JSON configuration file saved in a S3 bucket and add
> "--configurations http://path/bucket/config.json; to the command but it
> gave me an 403 error (access denied). But when I did "aws s3 ls
> (s3://bucket)" I could see that bucket and the config.json in the bucket.
>
> Please advise. Thank you very much.
>
> Best Regards,
> Vivian
>


Re: scikit learn on EMR PySpark

2016-03-01 Thread Jonathan Kelly
Hi, Myles,

We do not install scikit-learn or spark-sklearn on EMR clusters by default,
but you may install them yourself by just doing "sudo pip install
scikit-learn spark-sklearn" (either by ssh'ing to the master instance and
running this manually, or by running it as an EMR Step).

~ Jonathan

On Tue, Mar 1, 2016 at 3:20 PM Gartland, Myles 
wrote:

> New to Spark and MLlib. Coming from sickit learn.
>
> I am launching my Spark 1.6 instance through AWS EMR and pyspark. All the
> examples using Mllib work fine.
>
> But I have seen a couple examples where you can combine scikit learn
> packages and syntax with mllib.
>
> Like in this example-
> https://databricks.com/blog/2016/02/08/auto-scaling-scikit-learn-with-spark.html
>
> However, it does not seem that Pyspark on AWS EMR comes with scikit (or
> other standard pydata packages) loaded.
>
> Is this something you can/should load on pyspark and how would you do it?
>
> Thanks for assisting.
>
>
> Myles
>


Re: Spark-avro issue in 1.5.2

2016-02-24 Thread Jonathan Kelly
This error is likely due to EMR including some Hadoop lib dirs in
spark.{driver,executor}.extraClassPath. (Hadoop bundles an older version of
Avro than what Spark uses, so you are probably getting bitten by this Avro
mismatch.)

We determined that these Hadoop dirs are not actually necessary to include
in the Spark classpath and in fact seem to be *causing* several problems
such as this one, so we have removed these directories from the
extraClassPath settings for the next EMR release.

For now, you may do the same yourself by using a configuration like the
following when creating your cluster:

[
  {
"classification":"spark-defaults",
"properties": {
  "spark.executor.extraClassPath":
"/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*",
  "spark.driver.extraClassPath":
"/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*"
}
  }
]

(For reference, the removed dirs are /usr/lib/hadoop/*,
/usr/lib/hadoop-hdfs/* and /usr/lib/hadoop-yarn/*.)

Hope this helps!
~ Jonathan

On Wed, Feb 24, 2016 at 1:14 PM  wrote:

> Hadoop 2.6.0 included?
> spark-assembly-1.5.2-hadoop2.6.0.jar
>
> On Feb 24, 2016, at 4:08 PM, Koert Kuipers  wrote:
>
> does your spark version come with batteries (hadoop included) or is it
> build with hadoop provided and you are adding hadoop binaries to classpath
>
> On Wed, Feb 24, 2016 at 3:08 PM,  wrote:
>
>> I’m trying to save a data frame in Avro format but am getting the
>> following error:
>>
>>
>> java.lang.NoSuchMethodError: 
>> org.apache.avro.generic.GenericData.createDatumWriter(Lorg/apache/avro/Schema;)Lorg/apache/avro/io/DatumWriter;
>>
>> I found the following workaround
>> https://github.com/databricks/spark-avro/issues/91
>> 
>>  -
>> which seems to say that this is from a mismatch in Avro versions. I have
>> tried following both solutions detailed to no avail:
>>  - Manually downloading avro-1.7.7.jar and including it in
>> /usr/lib/hadoop-mapreduce/
>>  - Adding avro-1.7.7.jar to spark.driver.extraClassPath and
>> spark.executor.extraClassPath
>>  - The same with avro-1.6.6
>>
>> I am still getting the same error, and now I am just stabbing in the
>> dark. Anyone else still running into this issue?
>>
>>
>> I am using Pyspark 1.5.2 on EMR.
>>
>
>
>


Re: Error :Type mismatch error when passing hdfs file path to spark-csv load method

2016-02-21 Thread Jonathan Kelly
On the line preceding the one that the compiler is complaining about (which
doesn't actually have a problem in itself), you declare df as
"df"+fileName, making it a string. Then you try to assign a DataFrame to
df, but it's already a string. I don't quite understand your intent with
that previous line, but I'm guessing you didn't mean to assign a string to
df.

~ Jonathan
On Sun, Feb 21, 2016 at 8:45 PM Divya Gehlot 
wrote:

> Hi,
> I am trying to dynamically create Dataframe by reading subdirectories
> under parent directory
>
> My code looks like
>
>> import org.apache.spark._
>> import org.apache.spark.sql._
>> val hadoopConf = new org.apache.hadoop.conf.Configuration()
>> val hdfsConn = org.apache.hadoop.fs.FileSystem.get(new
>> java.net.URI("hdfs://xxx.xx.xx.xxx:8020"), hadoopConf)
>> hdfsConn.listStatus(new
>> org.apache.hadoop.fs.Path("/TestDivya/Spark/ParentDir/")).foreach{
>> fileStatus =>
>>val filePathName = fileStatus.getPath().toString()
>>val fileName = fileStatus.getPath().getName().toLowerCase()
>>var df =  "df"+fileName
>>df =
>> sqlContext.read.format("com.databricks.spark.csv").option("header",
>> "true").option("inferSchema", "true").load(filePathName)
>> }
>
>
> getting below error
>
>> :35: error: type mismatch;
>>  found   : org.apache.spark.sql.DataFrame
>>  required: String
>>  df =
>> sqlContext.read.format("com.databricks.spark.csv").option("header",
>> "true").option("inferSchema", "true").load(filePathName)
>
>
> Am I missing something ?
>
> Would really appreciate the help .
>
>
> Thanks,
> Divya
>
>


Re: Memory issues on spark

2016-02-17 Thread Jonathan Kelly
(I'm not 100% sure, but...) I think the SPARK_EXECUTOR_* environment
variables are intended to be used with Spark Standalone. Even if not, I'd
recommend setting the corresponding properties in spark-defaults.conf
rather than in spark-env.sh.

For example, you may use the following Configuration object to set the
properties to the values you provided in your initial message:

[{
  "classification": "spark-defaults",
  "properties": {
"spark.executor.instances": "16",
"spark.executor.cores": "16",
"spark.executor.memory": "15G",
"spark.driver.memory": "12G",
"spark.kryoserializer.buffer.max": "1024m"
  }
}]

However, these values won't quite work for a 3 node m3.2xlarge cluster.
m3.2xlarge instances have 23040m available to YARN, so you'd only be able
to fit one 15G executor per instance. Also, assuming that when you say "3
node" cluster you mean "1 master instance and 2 core instances", this means
that you only have two instances running executors (the master instance
doesn't run a YARN NodeManager and thus does not run a Spark executor), and
each can only run one executor. In other words, this could very well be why
you are only seeing two executors.

~ Jonathan

On Wed, Feb 17, 2016 at 5:02 PM  wrote:

> Hi All,
>
> I have been facing memory issues in spark. im using spark-sql on AWS EMR.
> i have around 50GB file in AWS S3. I want to read this file in BI tool
> connected to spark-sql on thrift-server over OBDC. I'm executing select *
> from table in BI tool(qlikview,tableau).
> I run into OOM error sometimes and some time the LOST_EXECUTOR. I'm really
> confused.
> The spark runs fine for smaller data set.
>
> I have 3 node EMR cluster with m3.2xlarge.
>
> I have set below conf on spark.
>
> export SPARK_EXECUTOR_INSTANCES=16
> export SPARK_EXECUTOR_CORES=16
> export SPARK_EXECUTOR_MEMORY=15G
> export SPARK_DRIVER_MEMORY=12G
> spark.kryoserializer.buffer.max 1024m
>
> Even after setting SPARK_EXECUTOR_INSTANCES as 16, only 2 executors come
> up.
>
> This is been road block since long time. Any help would be appreciated.
>
> Thanks
> Arun.
>
> This e-mail and any files transmitted with it are for the sole use of the
> intended recipient(s) and may contain confidential and privileged
> information. If you are not the intended recipient(s), please reply to the
> sender and destroy all copies of the original message. Any unauthorized
> review, use, disclosure, dissemination, forwarding, printing or copying of
> this email, and/or any action taken in reliance on the contents of this
> e-mail is strictly prohibited and may be unlawful. Where permitted by
> applicable law, this e-mail and other e-mail communications sent to and
> from Cognizant e-mail addresses may be monitored.
>


Re: AM creation in yarn-client mode

2016-02-09 Thread Jonathan Kelly
In yarn-client mode, the driver is separate from the AM. The AM is created
in YARN, and YARN controls where it goes (though you can somewhat control
it using YARN node labels--I just learned earlier today in a different
thread on this list that this can be controlled by
spark.yarn.am.labelExpression). Then what I understand is that the driver
talks to the AM in order to request additional YARN containers in which to
run executors.

In yarn-cluster mode, the SparkSubmit process outside of the cluster
creates the AM in YARN, and then what I understand is that the AM *becomes*
the driver (by invoking the driver's main method), and then it requests the
executor containers.

So yes, one difference between yarn-client and yarn-cluster mode is that in
yarn-client mode the driver and AM are separate, whereas they are the same
in yarn-cluster.

~ Jonathan
On Tue, Feb 9, 2016 at 9:57 PM praveen S  wrote:

> Can you explain what happens in yarn client mode?
>
> Regards,
> Praveen
> On 10 Feb 2016 10:55, "ayan guha"  wrote:
>
>> It depends on yarn-cluster and yarn-client mode.
>>
>> On Wed, Feb 10, 2016 at 3:42 PM, praveen S  wrote:
>>
>>> Hi,
>>>
>>> I have 2 questions when running the spark jobs on yarn in client mode :
>>>
>>> 1) Where is the AM(application master) created :
>>>
>>> A) is it created on the client where the job was submitted? i.e driver
>>> and AM on the same client?
>>> Or
>>> B) yarn decides where the the AM should be created?
>>>
>>> 2) Driver and AM run in different processes : is my assumption correct?
>>>
>>> Regards,
>>> Praveen
>>>
>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>


Re: Dataframe, Spark SQL - Drops First 8 Characters of String on Amazon EMR

2016-01-28 Thread Jonathan Kelly
Just FYI, Spark 1.6 was released on emr-4.3.0 a couple days ago:
https://aws.amazon.com/blogs/aws/emr-4-3-0-new-updated-applications-command-line-export/
On Thu, Jan 28, 2016 at 7:30 PM Andrew Zurn  wrote:

> Hey Daniel,
>
> Thanks for the response.
>
> After playing around for a bit, it looks like it's probably the something
> similar to the first situation you mentioned, with the Parquet format
> causing issues. Both programmatically created dataset and a dataset pulled
> off the internet (rather than out of S3 and put into HDFS/Hive) acted with
> DataFrames as one would expect (printed out everything, grouped properly,
> etc.)
>
> It looks like there is more than likely an outstanding bug that causes
> issues with data coming from S3 and is converted in the parquet format
> (found an article here highlighting it was around in 1.4, and I guess it
> wouldn't be out of the realm of things for it still to exist. Link to
> article:
> https://www.appsflyer.com/blog/the-bleeding-edge-spark-parquet-and-s3/
>
> Hopefully a little more stability will come out with the upcoming Spark
> 1.6 release on EMR (I think that is happening sometime soon).
>
> Thanks again for the advice on where to dig further into. Much appreciated.
>
> Andrew
>
> On Tue, Jan 26, 2016 at 9:18 AM, Daniel Darabos <
> daniel.dara...@lynxanalytics.com> wrote:
>
>> Have you tried setting spark.emr.dropCharacters to a lower value? (It
>> defaults to 8.)
>>
>> :) Just joking, sorry! Fantastic bug.
>>
>> What data source do you have for this DataFrame? I could imagine for
>> example that it's a Parquet file and on EMR you are running with two wrong
>> version of the Parquet library and it messes up strings. It should be easy
>> enough to try a different data format. You could also try what happens if
>> you just create the DataFrame programmatically, e.g.
>> sc.parallelize(Seq("asdfasdfasdf")).toDF.
>>
>> To understand better at which point the characters are lost you could try
>> grouping by a string attribute. I see "education" ends up either as ""
>> (empty string) or "y" in the printed output. But are the characters already
>> lost when you try grouping by the attribute? Will there be a single ""
>> category, or will you have separate categories for "primary" and "tertiary"?
>>
>> I think the correct output through the RDD suggests that the issue
>> happens at the very end. So it will probably happen also with different
>> data sources, and grouping will create separate groups for "primary" and
>> "tertiary" even though they are printed as the same string at the end. You
>> should also check the data from "take(10)" to rule out any issues with
>> printing. You could try the same "groupBy" trick after "take(10)". Or you
>> could print the lengths of the strings.
>>
>> Good luck!
>>
>> On Tue, Jan 26, 2016 at 3:53 AM, awzurn  wrote:
>>
>>> Sorry for the bump, but wondering if anyone else has seen this before.
>>> We're
>>> hoping to either resolve this soon, or move on with further steps to move
>>> this into an issue.
>>>
>>> Thanks in advance,
>>>
>>> Andrew Zurn
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/Dataframe-Spark-SQL-Drops-First-8-Characters-of-String-on-Amazon-EMR-tp26022p26065.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: Terminating Spark Steps in AWS

2016-01-26 Thread Jonathan Kelly
Daniel,

The "hadoop job -list" command is a deprecated form of "mapred job -list",
which is only for Hadoop MapReduce jobs. For Spark jobs, which run on YARN,
you instead want "yarn application -list".

Hope this helps,
Jonathan (from the EMR team)

On Tue, Jan 26, 2016 at 10:05 AM Daniel Imberman 
wrote:

> Hi all,
>
> I want to set up a series of spark steps on an EMR spark cluster, and
> terminate the current step if it's taking too long. However, when I ssh
> into
> the master node and run hadoop jobs -list, the master node seems to believe
> that there is no jobs running. I don't want to terminate the cluster,
> because doing so would force me to buy a whole new hour of whatever cluster
> I'm running. Does anyone have any suggestions that would allow me to
> terminate a spark-step in EMR without terminating the entire cluster?
>
> Thank you,
>
> Daniel
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Terminating-Spark-Steps-in-AWS-tp26076.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Read from AWS s3 with out having to hard-code sensitive keys

2016-01-11 Thread Jonathan Kelly
Yes, IAM roles are actually required now for EMR. If you use Spark on EMR
(vs. just EC2), you get S3 configuration for free (it goes by the name
EMRFS), and it will use your IAM role for communicating with S3. Here is
the corresponding documentation:
http://docs.aws.amazon.com/ElasticMapReduce/latest/ManagementGuide/emr-fs.html

On Mon, Jan 11, 2016 at 11:37 AM Matei Zaharia 
wrote:

> In production, I'd recommend using IAM roles to avoid having keys
> altogether. Take a look at
> http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/iam-roles-for-amazon-ec2.html
> .
>
> Matei
>
> On Jan 11, 2016, at 11:32 AM, Sabarish Sasidharan <
> sabarish.sasidha...@manthan.com> wrote:
>
> If you are on EMR, these can go into your hdfs site config. And will work
> with Spark on YARN by default.
>
> Regards
> Sab
> On 11-Jan-2016 5:16 pm, "Krishna Rao"  wrote:
>
>> Hi all,
>>
>> Is there a method for reading from s3 without having to hard-code keys?
>> The only 2 ways I've found both require this:
>>
>> 1. Set conf in code e.g.:
>> sc.hadoopConfiguration().set("fs.s3.awsAccessKeyId", "")
>> sc.hadoopConfiguration().set("fs.s3.awsSecretAccessKey",
>> "")
>>
>> 2. Set keys in URL, e.g.:
>> sc.textFile("s3n://@/bucket/test/testdata")
>>
>>
>> Both if which I'm reluctant to do within production code!
>>
>>
>> Cheers
>>
>
>


Re: Discover SparkUI port for spark streaming job running in cluster mode

2015-12-14 Thread Jonathan Kelly
Oh, nice, I did not know about that property. Thanks!

On Mon, Dec 14, 2015 at 4:28 PM, Ted Yu <yuzhih...@gmail.com> wrote:

> w.r.t. getting application Id, please take a look at the following
> in SparkContext :
>
>   /**
>* A unique identifier for the Spark application.
>* Its format depends on the scheduler implementation.
>* (i.e.
>*  in case of local spark app something like 'local-1433865536131'
>*  in case of YARN something like 'application_1433865536131_34483'
>* )
>*/
>   def applicationId: String = _applicationId
>
> On Mon, Dec 14, 2015 at 2:33 PM, Jonathan Kelly <jonathaka...@gmail.com>
> wrote:
>
>> Are you running Spark on YARN? If so, you can get to the Spark UI via the
>> YARN ResourceManager. Each running Spark application will have a link on
>> the YARN ResourceManager labeled "ApplicationMaster". If you click that, it
>> will take you to the Spark UI, even if it is running on a slave node in the
>> case of yarn-cluster mode. It does this by proxying the Spark UI through
>> the YARN Proxy Server on the master node.
>>
>> For completed applications, the link will be labeled "History" and will
>> take you to the Spark History Server (provided you have
>> set spark.yarn.historyServer.address in spark-defaults.conf).
>>
>> As for getting the URL programmatically, the URL using the YARN
>> ProxyServer is easy to determine. It's just http://> address>:/proxy/. (e.g.,
>> http://ip-10-150-65-11.ec2.internal:20888/proxy/application_1450128858020_0001/)
>> Then again, I'm not sure how easy it is to get the YARN application ID for
>> a Spark application without parsing the spark-submit logs. Or at least I
>> think I remember some other thread where that was mentioned.
>>
>> ~ Jonathan
>>
>> On Mon, Dec 14, 2015 at 1:57 PM, Ashish Nigam <ashnigamt...@gmail.com>
>> wrote:
>>
>>> Hi,
>>> I run spark streaming job in cluster mode. This means that driver can
>>> run in any data node. And Spark UI can run in any dynamic port.
>>> At present, I know about the port by looking at container logs that look
>>> something like this -
>>>
>>> server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:50571
>>> INFO util.Utils: Successfully started service 'SparkUI' on port 50571.
>>> INFO ui.SparkUI: Started SparkUI at http://xxx:50571
>>>
>>>
>>> Is there any way to know about the UI port automatically using some API?
>>>
>>> Thanks
>>> Ashish
>>>
>>
>>
>


Re: Discover SparkUI port for spark streaming job running in cluster mode

2015-12-14 Thread Jonathan Kelly
Are you running Spark on YARN? If so, you can get to the Spark UI via the
YARN ResourceManager. Each running Spark application will have a link on
the YARN ResourceManager labeled "ApplicationMaster". If you click that, it
will take you to the Spark UI, even if it is running on a slave node in the
case of yarn-cluster mode. It does this by proxying the Spark UI through
the YARN Proxy Server on the master node.

For completed applications, the link will be labeled "History" and will
take you to the Spark History Server (provided you have
set spark.yarn.historyServer.address in spark-defaults.conf).

As for getting the URL programmatically, the URL using the YARN ProxyServer
is easy to determine. It's just http://:/proxy/. (e.g.,
http://ip-10-150-65-11.ec2.internal:20888/proxy/application_1450128858020_0001/)
Then again, I'm not sure how easy it is to get the YARN application ID for
a Spark application without parsing the spark-submit logs. Or at least I
think I remember some other thread where that was mentioned.

~ Jonathan

On Mon, Dec 14, 2015 at 1:57 PM, Ashish Nigam 
wrote:

> Hi,
> I run spark streaming job in cluster mode. This means that driver can run
> in any data node. And Spark UI can run in any dynamic port.
> At present, I know about the port by looking at container logs that look
> something like this -
>
> server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:50571
> INFO util.Utils: Successfully started service 'SparkUI' on port 50571.
> INFO ui.SparkUI: Started SparkUI at http://xxx:50571
>
>
> Is there any way to know about the UI port automatically using some API?
>
> Thanks
> Ashish
>


Re: spark-ec2 vs. EMR

2015-12-04 Thread Jonathan Kelly
Sending this to the list again because I'm pretty sure it didn't work the
first time. A colleague just realized he was having the same problem with
the list not accepting his posts, but unsubscribing and re-subscribing
seemed to fix the issue for him. I've just unsubscribed and re-subscribed
too, so hopefully this works...

On Wednesday, December 2, 2015, Jonathan Kelly <jonathaka...@gmail.com>
wrote:

> EMR is currently running a private preview of an upcoming feature allowing
> EMR clusters to be launched in VPC private subnets. This will allow you to
> launch a cluster in a subnet without an Internet Gateway attached. Please
> contact jonfr...@amazon.com
> <javascript:_e(%7B%7D,'cvml','jonfr...@amazon.com');> if you would like
> more information.
>
> ~ Jonathan
>
> Note: jonfr...@amazon.com
> <javascript:_e(%7B%7D,'cvml','jonfr...@amazon.com');> is not me. I'm a
> different Jonathan. :)
>
> On Wed, Dec 2, 2015 at 10:21 AM, Jerry Lam <chiling...@gmail.com
> <javascript:_e(%7B%7D,'cvml','chiling...@gmail.com');>> wrote:
>
>> Hi Dana,
>>
>> Yes, we get VPC + EMR working but I'm not the person who deploys it. It
>> is related to subnet as Alex points out.
>>
>> Just to want to add another point, spark-ec2 is nice to keep and improve
>> because it allows users to any version of spark (nightly-build for
>> example). EMR does not allow you to do that without manual process.
>>
>> Best Regards,
>>
>> Jerry
>>
>> On Wed, Dec 2, 2015 at 1:02 PM, Alexander Pivovarov <apivova...@gmail.com
>> <javascript:_e(%7B%7D,'cvml','apivova...@gmail.com');>> wrote:
>>
>>> Do you think it's a security issue if EMR started in VPC with a subnet
>>> having Auto-assign Public IP: Yes
>>>
>>> you can remove all Inbound rules having 0.0.0.0/0 Source in master and
>>> slave Security Group
>>> So, master and slave boxes will be accessible only for users who are on
>>> VPN
>>>
>>>
>>>
>>>
>>> On Wed, Dec 2, 2015 at 9:44 AM, Dana Powers <dana.pow...@gmail.com
>>> <javascript:_e(%7B%7D,'cvml','dana.pow...@gmail.com');>> wrote:
>>>
>>>> EMR was a pain to configure on a private VPC last I tried. Has anyone
>>>> had success with that? I found spark-ec2 easier to use w private
>>>> networking, but also agree that I would use for prod.
>>>>
>>>> -Dana
>>>> On Dec 1, 2015 12:29 PM, "Alexander Pivovarov" <apivova...@gmail.com
>>>> <javascript:_e(%7B%7D,'cvml','apivova...@gmail.com');>> wrote:
>>>>
>>>>> 1. Emr 4.2.0 has Zeppelin as an alternative to DataBricks Notebooks
>>>>>
>>>>> 2. Emr has Ganglia 3.6.0
>>>>>
>>>>> 3. Emr has hadoop fs settings to make s3 work fast
>>>>> (direct.EmrFileSystem)
>>>>>
>>>>> 4. EMR has s3 keys in hadoop configs
>>>>>
>>>>> 5. EMR allows to resize cluster on fly.
>>>>>
>>>>> 6. EMR has aws sdk in spark classpath. Helps to reduce app assembly
>>>>> jar size
>>>>>
>>>>> 7. ec2 script installs all in /root, EMR has dedicated users: hadoop,
>>>>> zeppelin, etc. EMR is similar to Cloudera or Hortonworks
>>>>>
>>>>> 8. There are at least 3 spark-ec2 projects. (in apache/spark, in
>>>>> mesos, in amplab). Master branch in spark has outdated ec2 script. Other
>>>>> projects have broken links in readme. WHAT A MESS!
>>>>>
>>>>> 9. ec2 script has bad documentation and non informative error
>>>>> messages. e.g. readme does not say anything about --private-ips option. If
>>>>> you did not add the flag it will connect to empty string host (localhost)
>>>>> instead of master. Fixed only last week. Not sure if fixed in all branches
>>>>>
>>>>> 10. I think Amazon will include spark-jobserver to EMR soon.
>>>>>
>>>>> 11. You do not need to be aws expert to start EMR cluster. Users can
>>>>> use EMR web ui to start cluster to run some jobs or work in Zeppelun 
>>>>> during
>>>>> the day
>>>>>
>>>>> 12. EMR cluster starts in abour 8 min. Ec2 script works longer and you
>>>>> need to be online.
>>>>> On Dec 1, 2015 9:22 AM, "Jerry Lam" <chiling...@gmail.com
>>>>> <javascript:_e(%7B%7D,'cvml','chiling...@gmail.com');>> wrote:
>

Re: Spark Tasks on second node never return in Yarn when I have more than 1 task node

2015-11-19 Thread Jonathan Kelly
I don't know if this actually has anything to do with why your job is
hanging, but since you are using EMR you should probably not set those
fs.s3 properties but rather let it use EMRFS, EMR's optimized Hadoop
FileSystem implementation for interacting with S3. One benefit is that it
will automatically pick up your AWS credentials from your EC2 instance role
rather than you having to configure them manually (since doing so is
insecure because you have to get the secret access key onto your instance).

If simply making that change does not fix the issue, a jstack of the hung
process would help you figure out what it is doing. You should also look at
the YARN container logs (which automatically get uploaded to your S3 logs
bucket if you have this enabled).

~ Jonathan

On Thu, Nov 19, 2015 at 1:32 PM, Shuai Zheng  wrote:

> Hi All,
>
>
>
> I face a very weird case. I have already simplify the scenario to the most
> so everyone can replay the scenario.
>
>
>
> My env:
>
>
>
> AWS EMR 4.1.0, Spark1.5
>
>
>
> My code can run without any problem when I run it in a local mode, and it
> has no problem when it run on a EMR cluster with one master and one task
> node.
>
>
>
> But when I try to run a multiple node (more than 1 task node, which means
> 3 nodes cluster), the tasks will never return from one of it.
>
>
>
> The log as below:
>
>
>
> 15/11/19 21:19:07 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID
> 0, ip-10-165-121-188.ec2.internal, PROCESS_LOCAL, 2241 bytes)
>
> 15/11/19 21:19:07 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID
> 1, ip-10-155-160-147.ec2.internal, PROCESS_LOCAL, 2241 bytes)
>
> 15/11/19 21:19:07 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID
> 2, ip-10-165-121-188.ec2.internal, PROCESS_LOCAL, 2241 bytes)
>
> 15/11/19 21:19:07 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID
> 3, ip-10-155-160-147.ec2.internal, PROCESS_LOCAL, 2241 bytes)
>
>
>
> So you can see the task will alternatively submitted to two instances, one
> is ip-10-165-121-188 and another is ip-10-155-160-147.
>
> And later only the tasks runs on the ip-10-165-121-188.ec2 will finish
> will always just wait there, ip-10-155-160-147.ec2 never return.
>
>
>
> The data and code has been tested in local mode, single spark cluster
> mode, so it should not be an issue on logic or data.
>
>
>
> And I have attached my test case here (I believe it is simple enough and
> no any business logic is involved):
>
>
>
>*public* *void* createSiteGridExposure2() {
>
>   JavaSparkContext ctx = *this*.createSparkContextTest("Test"
> );
>
>   ctx.textFile(siteEncodeLocation).flatMapToPair(*new* 
> *PairFlatMapFunction String, String>()* {
>
>  @Override
>
>  *public* Iterable>
> call(String line) *throws* Exception {
>
>List> res = *new*
> ArrayList>();
>
>*return* res;
>
>  }
>
>   }).collectAsMap();
>
>   ctx.stop();
>
>}
>
>
>
> *protected* JavaSparkContext createSparkContextTest(String appName) {
>
>   SparkConf sparkConf = *new* SparkConf().setAppName(appName);
>
>
>
>   JavaSparkContext ctx = *new* JavaSparkContext(sparkConf);
>
>   Configuration hadoopConf = ctx.hadoopConfiguration();
>
>   *if* (awsAccessKeyId != *null*) {
>
>
>
>  hadoopConf.set("fs.s3.impl",
> "org.apache.hadoop.fs.s3native.NativeS3FileSystem");
>
>  hadoopConf.set("fs.s3.awsAccessKeyId", awsAccessKeyId
> );
>
>  hadoopConf.set("fs.s3.awsSecretAccessKey",
> awsSecretAccessKey);
>
>
>
>  hadoopConf.set("fs.s3n.impl",
> "org.apache.hadoop.fs.s3native.NativeS3FileSystem");
>
>  hadoopConf.set("fs.s3n.awsAccessKeyId",
> awsAccessKeyId);
>
>  hadoopConf.set("fs.s3n.awsSecretAccessKey",
> awsSecretAccessKey);
>
>   }
>
>   *return* ctx;
>
>}
>
>
>
>
>
> Anyone has any idea why this happened? I am a bit lost because the code
> works in local mode and 2 node (1 master 1 task) clusters, but when it move
> a multiple task nodes cluster, I have this issue. No error no exception,
> not even timeout (because I wait more than 1 hours and there is no timeout
> also).
>
>
>
> Regards,
>
>
>
> Shuai
>


Re: spark-submit stuck and no output in console

2015-11-16 Thread Jonathan Kelly
He means for you to use jstack to obtain a stacktrace of all of the
threads. Or are you saying that the Java process never even starts?

On Mon, Nov 16, 2015 at 7:48 AM, Kayode Odeyemi  wrote:

> Spark 1.5.1
>
> The fact is that there's no stack trace. No output from that command at
> all to the console.
>
> This is all I get:
>
> hadoop-user@yks-hadoop-m01:/usr/local/spark/bin$ tail -1
> /tmp/spark-profile-job.log
> nohup: ignoring input
> /usr/local/spark/bin/spark-class: line 76: 29516 Killed
>  "$RUNNER" -cp "$LAUNCH_CLASSPATH" org.apache.spark.launcher.Main "$@"
>
>
> On Mon, Nov 16, 2015 at 5:22 PM, Ted Yu  wrote:
>
>> Which release of Spark are you using ?
>>
>> Can you take stack trace and pastebin it ?
>>
>> Thanks
>>
>> On Mon, Nov 16, 2015 at 5:50 AM, Kayode Odeyemi 
>> wrote:
>>
>>> ./spark-submit --class com.migration.UpdateProfiles --executor-memory 8g
>>> ~/migration-profiles-0.1-SNAPSHOT.jar
>>>
>>> is stuck and outputs nothing to the console.
>>>
>>> What could be the cause of this? Current max heap size is 1.75g and it's
>>> only using 1g.
>>>
>>>
>>
>
>
> --
> Odeyemi 'Kayode O.
> http://ng.linkedin.com/in/kayodeodeyemi. t: @charyorde
>


Re: Spark EC2 script on Large clusters

2015-11-05 Thread Jonathan Kelly
Christian,

Is there anything preventing you from using EMR, which will manage your
cluster for you? Creating large clusters would take mins on EMR instead of
hours. Also, EMR supports growing your cluster easily and recently added
support for shrinking your cluster gracefully (even while jobs are running).

~ Jonathan

On Thu, Nov 5, 2015 at 9:48 AM, Nicholas Chammas  wrote:

> Yeah, as Shivaram mentioned, this issue is well-known. It's documented in
> SPARK-5189  and a bunch
> of related issues. Unfortunately, it's hard to resolve this issue in
> spark-ec2 without rewriting large parts of the project. But if you take a
> crack at it and succeed I'm sure a lot of people will be happy.
>
> I've started a separate project  --
> which Shivaram also mentioned -- which aims to solve the problem of long
> launch times and other issues
>  with spark-ec2. It's
> still very young and lacks several critical features, but we are making
> steady progress.
>
> Nick
>
> On Thu, Nov 5, 2015 at 12:30 PM Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>> It is a known limitation that spark-ec2 is very slow for large
>> clusters and as you mention most of this is due to the use of rsync to
>> transfer things from the master to all the slaves.
>>
>> Nick cc'd has been working on an alternative approach at
>> https://github.com/nchammas/flintrock that is more scalable.
>>
>> Thanks
>> Shivaram
>>
>> On Thu, Nov 5, 2015 at 8:12 AM, Christian  wrote:
>> > For starters, thanks for the awesome product!
>> >
>> > When creating ec2-clusters of 20-40 nodes, things work great. When we
>> create
>> > a cluster with the provided spark-ec2 script, it takes hours. When
>> creating
>> > a 200 node cluster, it takes 2 1/2 hours and for a 500 node cluster it
>> takes
>> > over 5 hours. One other problem we are having is that some nodes don't
>> come
>> > up when the other ones do, the process seems to just move on, skipping
>> the
>> > rsync and any installs on those ones.
>> >
>> > My guess as to why it takes so long to set up a large cluster is
>> because of
>> > the use of rsync. What if instead of using rsync, you synched to s3 and
>> then
>> > did a pdsh to pull it down on all of the machines. This is a big deal
>> for us
>> > and if we can come up with a good plan, we might be able help out with
>> the
>> > required changes.
>> >
>> > Are there any suggestions on how to deal with some of the nodes not
>> being
>> > ready when the process starts?
>> >
>> > Thanks for your time,
>> > Christian
>> >
>>
>


Re: Why is the Spark Web GUI failing with JavaScript "Uncaught SyntaxError"?

2015-10-14 Thread Jonathan Kelly
Ah, yes, it will use private IPs, so you may need to update your FoxyProxy
settings to include the private IPs in the regex as well as the public IPs.

Also, yes, for completed applications you may use the Spark History Server
on port 18080. The YARN ProxyServer will automatically redirect to the
Spark History Server once a job has completed, so you can still start from
the YARN ResourceManager. In the case of completed applications, the link
on the YARN ResourceManager will be "History" instead of
"ApplicationMaster".

~ Jonathan

On Wed, Oct 14, 2015 at 12:57 AM, Joshua Fox <jos...@twiggle.com> wrote:

> Thank you!
>
> It seems that the the history server at port 18080 also gives access to
> the Spark GUI as below
>
> Following your tip, I see that the  YARN ResourceManager GUI on 8088
> indeed has that ApplicationMaster link, though to a private rather than
> public IP; replacing IPs brings me to the same Spark GUI.
>
> Joshua
> [image: Inline image 3]
>
>
>
>
> On Tue, Oct 13, 2015 at 6:23 PM, Jonathan Kelly <jonathaka...@gmail.com>
> wrote:
>
>> Joshua,
>>
>> Since Spark is configured to run on YARN in EMR, instead of viewing the
>> Spark application UI at port 4040, you should instead start from the YARN
>> ResourceManager (on port 8088), then click on the ApplicationMaster link
>> for the Spark application you are interested in. This will take you to the
>> YARN ProxyServer on port 20888, which will proxy you through to the Spark
>> UI for the application (which renders correctly when viewed this way). This
>> works even if the Spark UI is running on a port other than 4040 and even in
>> yarn-cluster mode when the Spark driver is running on a slave node.
>>
>> Hope this helps,
>> Jonathan
>>
>> On Tue, Oct 13, 2015 at 7:19 AM, Jean-Baptiste Onofré <j...@nanthrax.net>
>> wrote:
>>
>>> Thanks for the update Joshua.
>>>
>>> Let me try with Spark 1.4.1.
>>>
>>> I keep you posted.
>>>
>>> Regards
>>> JB
>>>
>>> On 10/13/2015 04:17 PM, Joshua Fox wrote:
>>>
>>>>   * Spark 1.4.1, part of EMR emr-4.0.0
>>>>   * Chrome Version 41.0.2272.118 (64-bit) on Ubuntu
>>>>
>>>>
>>>> On Tue, Oct 13, 2015 at 3:27 PM, Jean-Baptiste Onofré <j...@nanthrax.net
>>>> <mailto:j...@nanthrax.net>> wrote:
>>>>
>>>> Hi Joshua,
>>>>
>>>> What's the Spark version and what's your browser ?
>>>>
>>>> I just tried on Spark 1.6-SNAPSHOT with firefox and it works fine.
>>>>
>>>> Thanks
>>>> Regards
>>>> JB
>>>>
>>>> On 10/13/2015 02:17 PM, Joshua Fox wrote:
>>>>
>>>> I am accessing the Spark Jobs Web GUI, running on AWS EMR.
>>>>
>>>> I can access this webapp (port 4040 as per default), but it only
>>>> half-renders, producing "Uncaught SyntaxError: Unexpected token
>>>> <"
>>>>
>>>> Here is a screenshot <http://i.imgur.com/qP2rH46.png> including
>>>> Chrome
>>>> Developer Console.
>>>>
>>>> Screenshot <http://i.stack.imgur.com/cf8gp.png>
>>>>
>>>> Here are some of the error messages in my Chrome console.
>>>>
>>>> /Uncaught SyntaxError: Unexpected token <
>>>> (index):3 Resource interpreted as Script but transferred with
>>>> MIME type
>>>> text/html:
>>>> "
>>>> http://ec2-52-89-59-167.us-west-2.compute.amazonaws.com:4040/jobs/;.
>>>> (index):74 Uncaught ReferenceError: drawApplicationTimeline is
>>>> not defined
>>>> (index):12 Resource interpreted as Image but transferred with
>>>> MIME type
>>>> text/html:
>>>> "
>>>> http://ec2-52-89-59-167.us-west-2.compute.amazonaws.com:4040/jobs/;
>>>>
>>>> /
>>>> Note that the History GUI at port 18080 and the Hadoop GUI at
>>>> port 8088
>>>> work fine, and the Spark jobs GUI does partly render. So, it
>>>> seems that
>>>> my browser proxy is not the cause of this problem.
>>>>
>>>> Joshua
>>>>
>>>>
>>>> --
>>>> Jean-Baptiste Onofré
>>>> jbono...@apache.org <mailto:jbono...@apache.org>
>>>> http://blog.nanthrax.net
>>>> Talend - http://www.talend.com
>>>>
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> <mailto:user-unsubscr...@spark.apache.org>
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>> <mailto:user-h...@spark.apache.org>
>>>>
>>>>
>>>>
>>> --
>>> Jean-Baptiste Onofré
>>> jbono...@apache.org
>>> http://blog.nanthrax.net
>>> Talend - http://www.talend.com
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>
>


Re: Why is the Spark Web GUI failing with JavaScript "Uncaught SyntaxError"?

2015-10-13 Thread Jonathan Kelly
Joshua,

Since Spark is configured to run on YARN in EMR, instead of viewing the
Spark application UI at port 4040, you should instead start from the YARN
ResourceManager (on port 8088), then click on the ApplicationMaster link
for the Spark application you are interested in. This will take you to the
YARN ProxyServer on port 20888, which will proxy you through to the Spark
UI for the application (which renders correctly when viewed this way). This
works even if the Spark UI is running on a port other than 4040 and even in
yarn-cluster mode when the Spark driver is running on a slave node.

Hope this helps,
Jonathan

On Tue, Oct 13, 2015 at 7:19 AM, Jean-Baptiste Onofré 
wrote:

> Thanks for the update Joshua.
>
> Let me try with Spark 1.4.1.
>
> I keep you posted.
>
> Regards
> JB
>
> On 10/13/2015 04:17 PM, Joshua Fox wrote:
>
>>   * Spark 1.4.1, part of EMR emr-4.0.0
>>   * Chrome Version 41.0.2272.118 (64-bit) on Ubuntu
>>
>>
>> On Tue, Oct 13, 2015 at 3:27 PM, Jean-Baptiste Onofré > > wrote:
>>
>> Hi Joshua,
>>
>> What's the Spark version and what's your browser ?
>>
>> I just tried on Spark 1.6-SNAPSHOT with firefox and it works fine.
>>
>> Thanks
>> Regards
>> JB
>>
>> On 10/13/2015 02:17 PM, Joshua Fox wrote:
>>
>> I am accessing the Spark Jobs Web GUI, running on AWS EMR.
>>
>> I can access this webapp (port 4040 as per default), but it only
>> half-renders, producing "Uncaught SyntaxError: Unexpected token <"
>>
>> Here is a screenshot  including
>> Chrome
>> Developer Console.
>>
>> Screenshot 
>>
>> Here are some of the error messages in my Chrome console.
>>
>> /Uncaught SyntaxError: Unexpected token <
>> (index):3 Resource interpreted as Script but transferred with
>> MIME type
>> text/html:
>> "
>> http://ec2-52-89-59-167.us-west-2.compute.amazonaws.com:4040/jobs/;.
>> (index):74 Uncaught ReferenceError: drawApplicationTimeline is
>> not defined
>> (index):12 Resource interpreted as Image but transferred with
>> MIME type
>> text/html:
>> "
>> http://ec2-52-89-59-167.us-west-2.compute.amazonaws.com:4040/jobs/;
>>
>> /
>> Note that the History GUI at port 18080 and the Hadoop GUI at
>> port 8088
>> work fine, and the Spark jobs GUI does partly render. So, it
>> seems that
>> my browser proxy is not the cause of this problem.
>>
>> Joshua
>>
>>
>> --
>> Jean-Baptiste Onofré
>> jbono...@apache.org 
>> http://blog.nanthrax.net
>> Talend - http://www.talend.com
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> 
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 
>>
>>
>>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark 1.5.0 on YARN dynamicAllocation - Initial job has not accepted any resources

2015-09-24 Thread Jonathan Kelly
I cut https://issues.apache.org/jira/browse/SPARK-10790 for this issue.

On Wed, Sep 23, 2015 at 8:38 PM, Jonathan Kelly <jonathaka...@gmail.com>
wrote:

> AHA! I figured it out, but it required some tedious remote debugging of
> the Spark ApplicationMaster. (But now I understand the Spark codebase a
> little better than before, so I guess I'm not too put out. =P)
>
> Here's what's happening...
>
> I am setting spark.dynamicAllocation.minExecutors=1 but am not setting
> spark.dynamicAllocation.initialExecutors, so it's remaining at the default
> of spark.dynamicAllocation.minExecutors. However, ExecutorAllocationManager
> doesn't actually request any executors while the application is still
> initializing (see comment here
> <https://github.com/apache/spark/blob/v1.5.0/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L292>),
> but it still sets numExecutorsTarget to
> spark.dynamicAllocation.initialExecutors (i.e., 1).
>
> The JavaWordCount example I've been trying to run is only operating on a
> very small file, so its first stage only has a single task and thus should
> request a single executor once the polling loop comes along.
>
> Then on this line
> <https://github.com/apache/spark/blob/v1.5.0/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L308>,
> it returns numExecutorsTarget (1) - oldNumExecutorsTarget (still 1, even
> though there aren't any executors running yet) = 0, for the number of
> executors it should request. Then the app hangs forever because it never
> requests any executors.
>
> I verified this further by setting
> spark.dynamicAllocation.minExecutors=100 and trying to run my SparkPi
> example I mentioned earlier (which runs 100 tasks in its first stage
> because that's the number I'm passing to the driver). Then it would hang in
> the same way as my JavaWordCount example. If I run it again, passing 101
> (so that it has 101 tasks), it works, and if I pass 99, it hangs again.
>
> So it seems that I have found a bug in that if you set
> spark.dynamicAllocation.minExecutors (or, presumably,
> spark.dynamicAllocation.initialExecutors), and the number of tasks in your
> first stage is less than or equal to this min/init number of executors, it
> won't actually request any executors and will just hang indefinitely.
>
> I can't seem to find a JIRA for this, so shall I file one, or has anybody
> else seen anything like this?
>
> ~ Jonathan
>
> On Wed, Sep 23, 2015 at 7:08 PM, Jonathan Kelly <jonathaka...@gmail.com>
> wrote:
>
>> Another update that doesn't make much sense:
>>
>> The SparkPi example does work on yarn-cluster mode with dynamicAllocation.
>>
>> That is, the following command works (as well as with yarn-client mode):
>>
>> spark-submit --deploy-mode cluster --class
>> org.apache.spark.examples.SparkPi spark-examples.jar 100
>>
>> But the following one does not work (nor does it work for yarn-client
>> mode):
>>
>> spark-submit --deploy-mode cluster --class
>> org.apache.spark.examples.JavaWordCount spark-examples.jar
>> /tmp/word-count-input.txt
>>
>> So this JavaWordCount example hangs on requesting executors, while
>> SparkPi and spark-shell do work.
>>
>> ~ Jonathan
>>
>> On Wed, Sep 23, 2015 at 6:22 PM, Jonathan Kelly <jonathaka...@gmail.com>
>> wrote:
>>
>>> Thanks for the quick response!
>>>
>>> spark-shell is indeed using yarn-client. I forgot to mention that I also
>>> have "spark.master yarn-client" in my spark-defaults.conf file too.
>>>
>>> The working spark-shell and my non-working example application both
>>> display spark.scheduler.mode=FIFO on the Spark UI. Is that what you are
>>> asking about? I haven't actually messed around with different scheduler
>>> modes yet.
>>>
>>> One more thing I should mention is that the YARN ResourceManager tells
>>> me the following on my 5-node cluster, with one node being the master and
>>> not running a NodeManager:
>>> Memory Used: 1.50 GB (this is the running ApplicationMaster that's
>>> waiting and waiting for the executors to start up)
>>> Memory Total: 45 GB (11.25 from each of the 4 slave nodes)
>>> VCores Used: 1
>>> VCores Total: 32
>>> Active Nodes: 4
>>>
>>> ~ Jonathan
>>>
>>> On Wed, Sep 23, 2015 at 6:10 PM, Andrew Duffy <andrewedu...@gmail.com>
>>> wrote:
>>>
>>>> What pool is the spark shell being put into? (You can see this through
>>>> the YARN UI under scheduler)
>>>>
>>>>

Re: Spark 1.5.0 on YARN dynamicAllocation - Initial job has not accepted any resources

2015-09-23 Thread Jonathan Kelly
AHA! I figured it out, but it required some tedious remote debugging of the
Spark ApplicationMaster. (But now I understand the Spark codebase a little
better than before, so I guess I'm not too put out. =P)

Here's what's happening...

I am setting spark.dynamicAllocation.minExecutors=1 but am not setting
spark.dynamicAllocation.initialExecutors, so it's remaining at the default
of spark.dynamicAllocation.minExecutors. However, ExecutorAllocationManager
doesn't actually request any executors while the application is still
initializing (see comment here
<https://github.com/apache/spark/blob/v1.5.0/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L292>),
but it still sets numExecutorsTarget to
spark.dynamicAllocation.initialExecutors (i.e., 1).

The JavaWordCount example I've been trying to run is only operating on a
very small file, so its first stage only has a single task and thus should
request a single executor once the polling loop comes along.

Then on this line
<https://github.com/apache/spark/blob/v1.5.0/core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala#L308>,
it returns numExecutorsTarget (1) - oldNumExecutorsTarget (still 1, even
though there aren't any executors running yet) = 0, for the number of
executors it should request. Then the app hangs forever because it never
requests any executors.

I verified this further by setting spark.dynamicAllocation.minExecutors=100
and trying to run my SparkPi example I mentioned earlier (which runs 100
tasks in its first stage because that's the number I'm passing to the
driver). Then it would hang in the same way as my JavaWordCount example. If
I run it again, passing 101 (so that it has 101 tasks), it works, and if I
pass 99, it hangs again.

So it seems that I have found a bug in that if you set
spark.dynamicAllocation.minExecutors (or, presumably,
spark.dynamicAllocation.initialExecutors), and the number of tasks in your
first stage is less than or equal to this min/init number of executors, it
won't actually request any executors and will just hang indefinitely.

I can't seem to find a JIRA for this, so shall I file one, or has anybody
else seen anything like this?

~ Jonathan

On Wed, Sep 23, 2015 at 7:08 PM, Jonathan Kelly <jonathaka...@gmail.com>
wrote:

> Another update that doesn't make much sense:
>
> The SparkPi example does work on yarn-cluster mode with dynamicAllocation.
>
> That is, the following command works (as well as with yarn-client mode):
>
> spark-submit --deploy-mode cluster --class
> org.apache.spark.examples.SparkPi spark-examples.jar 100
>
> But the following one does not work (nor does it work for yarn-client
> mode):
>
> spark-submit --deploy-mode cluster --class
> org.apache.spark.examples.JavaWordCount spark-examples.jar
> /tmp/word-count-input.txt
>
> So this JavaWordCount example hangs on requesting executors, while SparkPi
> and spark-shell do work.
>
> ~ Jonathan
>
> On Wed, Sep 23, 2015 at 6:22 PM, Jonathan Kelly <jonathaka...@gmail.com>
> wrote:
>
>> Thanks for the quick response!
>>
>> spark-shell is indeed using yarn-client. I forgot to mention that I also
>> have "spark.master yarn-client" in my spark-defaults.conf file too.
>>
>> The working spark-shell and my non-working example application both
>> display spark.scheduler.mode=FIFO on the Spark UI. Is that what you are
>> asking about? I haven't actually messed around with different scheduler
>> modes yet.
>>
>> One more thing I should mention is that the YARN ResourceManager tells me
>> the following on my 5-node cluster, with one node being the master and not
>> running a NodeManager:
>> Memory Used: 1.50 GB (this is the running ApplicationMaster that's
>> waiting and waiting for the executors to start up)
>> Memory Total: 45 GB (11.25 from each of the 4 slave nodes)
>> VCores Used: 1
>> VCores Total: 32
>> Active Nodes: 4
>>
>> ~ Jonathan
>>
>> On Wed, Sep 23, 2015 at 6:10 PM, Andrew Duffy <andrewedu...@gmail.com>
>> wrote:
>>
>>> What pool is the spark shell being put into? (You can see this through
>>> the YARN UI under scheduler)
>>>
>>> Are you certain you're starting spark-shell up on YARN? By default it
>>> uses a local spark executor, so if it "just works" then it's because it's
>>> not using dynamic allocation.
>>>
>>>
>>> On Wed, Sep 23, 2015 at 18:04 Jonathan Kelly <jonathaka...@gmail.com>
>>> wrote:
>>>
>>>> I'm running into a problem with YARN dynamicAllocation on Spark 1.5.0
>>>> after using it successfully on an identically configured cluster with Spark
>>>> 1.4.1.
>>>>
>>>> I'm g

Re: Spark 1.5.0 on YARN dynamicAllocation - Initial job has not accepted any resources

2015-09-23 Thread Jonathan Kelly
Thanks for the quick response!

spark-shell is indeed using yarn-client. I forgot to mention that I also
have "spark.master yarn-client" in my spark-defaults.conf file too.

The working spark-shell and my non-working example application both display
spark.scheduler.mode=FIFO on the Spark UI. Is that what you are asking
about? I haven't actually messed around with different scheduler modes yet.

One more thing I should mention is that the YARN ResourceManager tells me
the following on my 5-node cluster, with one node being the master and not
running a NodeManager:
Memory Used: 1.50 GB (this is the running ApplicationMaster that's waiting
and waiting for the executors to start up)
Memory Total: 45 GB (11.25 from each of the 4 slave nodes)
VCores Used: 1
VCores Total: 32
Active Nodes: 4

~ Jonathan

On Wed, Sep 23, 2015 at 6:10 PM, Andrew Duffy <andrewedu...@gmail.com>
wrote:

> What pool is the spark shell being put into? (You can see this through the
> YARN UI under scheduler)
>
> Are you certain you're starting spark-shell up on YARN? By default it uses
> a local spark executor, so if it "just works" then it's because it's not
> using dynamic allocation.
>
>
> On Wed, Sep 23, 2015 at 18:04 Jonathan Kelly <jonathaka...@gmail.com>
> wrote:
>
>> I'm running into a problem with YARN dynamicAllocation on Spark 1.5.0
>> after using it successfully on an identically configured cluster with Spark
>> 1.4.1.
>>
>> I'm getting the dreaded warning "YarnClusterScheduler: Initial job has
>> not accepted any resources; check your cluster UI to ensure that workers
>> are registered and have sufficient resources", though there's nothing else
>> running on my cluster, and the nodes should have plenty of resources to run
>> my application.
>>
>> Here are the applicable properties in spark-defaults.conf:
>> spark.dynamicAllocation.enabled  true
>> spark.dynamicAllocation.minExecutors 1
>> spark.shuffle.service.enabled true
>>
>> When trying out my example application (just the JavaWordCount example
>> that comes with Spark), I had not actually set spark.executor.memory or any
>> CPU core-related properties, but setting the spark.executor.memory to a low
>> value like 64m doesn't help either.
>>
>> I've tried a 5-node cluster and 1-node cluster of m3.xlarges, so each
>> node has 15.0GB and 4 cores.
>>
>> I've also tried both yarn-cluster and yarn-client mode and get the same
>> behavior for both, except that for yarn-client mode the application never
>> even shows up in the YARN ResourceManager. However, spark-shell seems to
>> work just fine (when I run commands, it starts up executors dynamically
>> just fine), which makes no sense to me.
>>
>> What settings/logs should I look at to debug this, and what more
>> information can I provide? Your help would be very much appreciated!
>>
>> Thanks,
>> Jonathan
>>
>


Re: Spark 1.5.0 on YARN dynamicAllocation - Initial job has not accepted any resources

2015-09-23 Thread Jonathan Kelly
Another update that doesn't make much sense:

The SparkPi example does work on yarn-cluster mode with dynamicAllocation.

That is, the following command works (as well as with yarn-client mode):

spark-submit --deploy-mode cluster --class
org.apache.spark.examples.SparkPi spark-examples.jar 100

But the following one does not work (nor does it work for yarn-client mode):

spark-submit --deploy-mode cluster --class
org.apache.spark.examples.JavaWordCount spark-examples.jar
/tmp/word-count-input.txt

So this JavaWordCount example hangs on requesting executors, while SparkPi
and spark-shell do work.

~ Jonathan

On Wed, Sep 23, 2015 at 6:22 PM, Jonathan Kelly <jonathaka...@gmail.com>
wrote:

> Thanks for the quick response!
>
> spark-shell is indeed using yarn-client. I forgot to mention that I also
> have "spark.master yarn-client" in my spark-defaults.conf file too.
>
> The working spark-shell and my non-working example application both
> display spark.scheduler.mode=FIFO on the Spark UI. Is that what you are
> asking about? I haven't actually messed around with different scheduler
> modes yet.
>
> One more thing I should mention is that the YARN ResourceManager tells me
> the following on my 5-node cluster, with one node being the master and not
> running a NodeManager:
> Memory Used: 1.50 GB (this is the running ApplicationMaster that's waiting
> and waiting for the executors to start up)
> Memory Total: 45 GB (11.25 from each of the 4 slave nodes)
> VCores Used: 1
> VCores Total: 32
> Active Nodes: 4
>
> ~ Jonathan
>
> On Wed, Sep 23, 2015 at 6:10 PM, Andrew Duffy <andrewedu...@gmail.com>
> wrote:
>
>> What pool is the spark shell being put into? (You can see this through
>> the YARN UI under scheduler)
>>
>> Are you certain you're starting spark-shell up on YARN? By default it
>> uses a local spark executor, so if it "just works" then it's because it's
>> not using dynamic allocation.
>>
>>
>> On Wed, Sep 23, 2015 at 18:04 Jonathan Kelly <jonathaka...@gmail.com>
>> wrote:
>>
>>> I'm running into a problem with YARN dynamicAllocation on Spark 1.5.0
>>> after using it successfully on an identically configured cluster with Spark
>>> 1.4.1.
>>>
>>> I'm getting the dreaded warning "YarnClusterScheduler: Initial job has
>>> not accepted any resources; check your cluster UI to ensure that workers
>>> are registered and have sufficient resources", though there's nothing else
>>> running on my cluster, and the nodes should have plenty of resources to run
>>> my application.
>>>
>>> Here are the applicable properties in spark-defaults.conf:
>>> spark.dynamicAllocation.enabled  true
>>> spark.dynamicAllocation.minExecutors 1
>>> spark.shuffle.service.enabled true
>>>
>>> When trying out my example application (just the JavaWordCount example
>>> that comes with Spark), I had not actually set spark.executor.memory or any
>>> CPU core-related properties, but setting the spark.executor.memory to a low
>>> value like 64m doesn't help either.
>>>
>>> I've tried a 5-node cluster and 1-node cluster of m3.xlarges, so each
>>> node has 15.0GB and 4 cores.
>>>
>>> I've also tried both yarn-cluster and yarn-client mode and get the same
>>> behavior for both, except that for yarn-client mode the application never
>>> even shows up in the YARN ResourceManager. However, spark-shell seems to
>>> work just fine (when I run commands, it starts up executors dynamically
>>> just fine), which makes no sense to me.
>>>
>>> What settings/logs should I look at to debug this, and what more
>>> information can I provide? Your help would be very much appreciated!
>>>
>>> Thanks,
>>> Jonathan
>>>
>>
>


Spark 1.5.0 on YARN dynamicAllocation - Initial job has not accepted any resources

2015-09-23 Thread Jonathan Kelly
I'm running into a problem with YARN dynamicAllocation on Spark 1.5.0 after
using it successfully on an identically configured cluster with Spark 1.4.1.

I'm getting the dreaded warning "YarnClusterScheduler: Initial job has not
accepted any resources; check your cluster UI to ensure that workers are
registered and have sufficient resources", though there's nothing else
running on my cluster, and the nodes should have plenty of resources to run
my application.

Here are the applicable properties in spark-defaults.conf:
spark.dynamicAllocation.enabled  true
spark.dynamicAllocation.minExecutors 1
spark.shuffle.service.enabled true

When trying out my example application (just the JavaWordCount example that
comes with Spark), I had not actually set spark.executor.memory or any CPU
core-related properties, but setting the spark.executor.memory to a low
value like 64m doesn't help either.

I've tried a 5-node cluster and 1-node cluster of m3.xlarges, so each node
has 15.0GB and 4 cores.

I've also tried both yarn-cluster and yarn-client mode and get the same
behavior for both, except that for yarn-client mode the application never
even shows up in the YARN ResourceManager. However, spark-shell seems to
work just fine (when I run commands, it starts up executors dynamically
just fine), which makes no sense to me.

What settings/logs should I look at to debug this, and what more
information can I provide? Your help would be very much appreciated!

Thanks,
Jonathan