from:"Nicholas Chammas"

Re: Upgrading Spark in EC2 clusters

2015-11-12 Thread Nicholas Chammas

spark-ec2 does not offer a way to upgrade an existing cluster, and from
what I gather, it wasn't intended to be used to manage long-lasting
infrastructure. The recommended approach really is to just destroy your
existing cluster and launch a new one with the desired configuration.

If you want to upgrade the cluster in place, you'll probably have to do
that manually. Otherwise, perhaps spark-ec2 is not the right tool, and
instead you want one of those "grown-up" management tools like Ansible
which can be setup to allow in-place upgrades. That'll take a bit of work,
though.

Nick

On Wed, Nov 11, 2015 at 6:01 PM Augustus Hong 
wrote:

> Hey All,
>
> I have a Spark cluster(running version 1.5.0) on EC2 launched with the
> provided spark-ec2 scripts. If I want to upgrade Spark to 1.5.2 in the same
> cluster, what's the safest / recommended way to do that?
>
>
> I know I can spin up a new cluster running 1.5.2, but it doesn't seem
> efficient to spin up a new cluster every time we need to upgrade.
>
>
> Thanks,
> Augustus
>
>
>
>
>
> --
> [image: Branch Metrics mobile deep linking] * Augustus
> Hong*
>  Data Analytics | Branch Metrics
>  m 650-391-3369 | e augus...@branch.io
>

Re: A proposal for Spark 2.0

2015-11-12 Thread Nicholas Chammas

With regards to Machine learning, it would be great to move useful features
from MLlib to ML and deprecate the former. Current structure of two
separate machine learning packages seems to be somewhat confusing.

With regards to GraphX, it would be great to deprecate the use of RDD in
GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.

On that note of deprecating stuff, it might be good to deprecate some
things in 2.0 without removing or replacing them immediately. That way 2.0
doesn’t have to wait for everything that we want to deprecate to be
replaced all at once.

Nick


On Thu, Nov 12, 2015 at 12:45 PM Ulanov, Alexander 
wrote:

> Parameter Server is a new feature and thus does not match the goal of 2.0
> is “to fix things that are broken in the current API and remove certain
> deprecated APIs”. At the same time I would be happy to have that feature.
>
>
>
> With regards to Machine learning, it would be great to move useful
> features from MLlib to ML and deprecate the former. Current structure of
> two separate machine learning packages seems to be somewhat confusing.
>
> With regards to GraphX, it would be great to deprecate the use of RDD in
> GraphX and switch to Dataframe. This will allow GraphX evolve with Tungsten.
>
>
>
> Best regards, Alexander
>
>
>
> *From:* Nan Zhu [mailto:zhunanmcg...@gmail.com]
> *Sent:* Thursday, November 12, 2015 7:28 AM
> *To:* wi...@qq.com
> *Cc:* dev@spark.apache.org
> *Subject:* Re: A proposal for Spark 2.0
>
>
>
> Being specific to Parameter Server, I think the current agreement is that
> PS shall exist as a third-party library instead of a component of the core
> code base, isn’t?
>
>
>
> Best,
>
>
>
> --
>
> Nan Zhu
>
> http://codingcat.me
>
>
>
> On Thursday, November 12, 2015 at 9:49 AM, wi...@qq.com wrote:
>
> Who has the idea of machine learning? Spark missing some features for
> machine learning, For example, the parameter server.
>
>
>
>
>
> 在 2015年11月12日，05:32，Matei Zaharia  写道：
>
>
>
> I like the idea of popping out Tachyon to an optional component too to
> reduce the number of dependencies. In the future, it might even be useful
> to do this for Hadoop, but it requires too many API changes to be worth
> doing now.
>
>
>
> Regarding Scala 2.12, we should definitely support it eventually, but I
> don't think we need to block 2.0 on that because it can be added later too.
> Has anyone investigated what it would take to run on there? I imagine we
> don't need many code changes, just maybe some REPL stuff.
>
>
>
> Needless to say, but I'm all for the idea of making "major" releases as
> undisruptive as possible in the model Reynold proposed. Keeping everyone
> working with the same set of releases is super important.
>
>
>
> Matei
>
>
>
> On Nov 11, 2015, at 4:58 AM, Sean Owen  wrote:
>
>
>
> On Wed, Nov 11, 2015 at 12:10 AM, Reynold Xin  wrote:
>
> to the Spark community. A major release should not be very different from a
>
> minor release and should not be gated based on new features. The main
>
> purpose of a major release is an opportunity to fix things that are broken
>
> in the current API and remove certain deprecated APIs (examples follow).
>
>
>
> Agree with this stance. Generally, a major release might also be a
>
> time to replace some big old API or implementation with a new one, but
>
> I don't see obvious candidates.
>
>
>
> I wouldn't mind turning attention to 2.x sooner than later, unless
>
> there's a fairly good reason to continue adding features in 1.x to a
>
> 1.7 release. The scope as of 1.6 is already pretty darned big.
>
>
>
>
>
> 1. Scala 2.11 as the default build. We should still support Scala 2.10, but
>
> it has been end-of-life.
>
>
>
> By the time 2.x rolls around, 2.12 will be the main version, 2.11 will
>
> be quite stable, and 2.10 will have been EOL for a while. I'd propose
>
> dropping 2.10. Otherwise it's supported for 2 more years.
>
>
>
>
>
> 2. Remove Hadoop 1 support.
>
>
>
> I'd go further to drop support for <2.2 for sure (2.0 and 2.1 were
>
> sort of 'alpha' and 'beta' releases) and even <2.6.
>
>
>
> I'm sure we'll think of a number of other small things -- shading a
>
> bunch of stuff? reviewing and updating dependencies in light of
>
> simpler, more recent dependencies to support from Hadoop etc?
>
>
>
> Farming out Tachyon to a module? (I felt like someone proposed this?)
>
> Pop out any Docker stuff to another repo?
>
> Continue that same effort for EC2?
>
> Farming out some of the "external" integrations to another repo (?
>
> controversial)
>
>
>
> See also anything marked version "2+" in JIRA.
>
>
>
> -
>
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>
>
>
>
> -
>
> To unsubscribe,

Re: A proposal for Spark 2.0

2015-11-10 Thread Nicholas Chammas

> For this reason, I would *not* propose doing major releases to break
substantial API's or perform large re-architecting that prevent users from
upgrading. Spark has always had a culture of evolving architecture
incrementally and making changes - and I don't think we want to change this
model.

+1 for this. The Python community went through a lot of turmoil over the
Python 2 -> Python 3 transition because the upgrade process was too painful
for too long. The Spark community will benefit greatly from our explicitly
looking to avoid a similar situation.

> 3. Assembly-free distribution of Spark: don’t require building an
enormous assembly jar in order to run Spark.

Could you elaborate a bit on this? I'm not sure what an assembly-free
distribution means.

Nick

On Tue, Nov 10, 2015 at 6:11 PM Reynold Xin  wrote:

> I’m starting a new thread since the other one got intermixed with feature
> requests. Please refrain from making feature request in this thread. Not
> that we shouldn’t be adding features, but we can always add features in
> 1.7, 2.1, 2.2, ...
>
> First - I want to propose a premise for how to think about Spark 2.0 and
> major releases in Spark, based on discussion with several members of the
> community: a major release should be low overhead and minimally disruptive
> to the Spark community. A major release should not be very different from a
> minor release and should not be gated based on new features. The main
> purpose of a major release is an opportunity to fix things that are broken
> in the current API and remove certain deprecated APIs (examples follow).
>
> For this reason, I would *not* propose doing major releases to break
> substantial API's or perform large re-architecting that prevent users from
> upgrading. Spark has always had a culture of evolving architecture
> incrementally and making changes - and I don't think we want to change this
> model. In fact, we’ve released many architectural changes on the 1.X line.
>
> If the community likes the above model, then to me it seems reasonable to
> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or immediately
> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence of
> major releases every 2 years seems doable within the above model.
>
> Under this model, here is a list of example things I would propose doing
> in Spark 2.0, separated into APIs and Operation/Deployment:
>
>
> APIs
>
> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
> Spark 1.x.
>
> 2. Remove Akka from Spark’s API dependency (in streaming), so user
> applications can use Akka (SPARK-5293). We have gotten a lot of complaints
> about user applications being unable to use Akka due to Spark’s dependency
> on Akka.
>
> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
>
> 4. Better class package structure for low level developer API’s. In
> particular, we have some DeveloperApi (mostly various listener-related
> classes) added over the years. Some packages include only one or two public
> classes but a lot of private classes. A better structure is to have public
> classes isolated to a few public packages, and these public packages should
> have minimal private classes for low level developer APIs.
>
> 5. Consolidate task metric and accumulator API. Although having some
> subtle differences, these two are very similar but have completely
> different code path.
>
> 6. Possibly making Catalyst, Dataset, and DataFrame more general by moving
> them to other package(s). They are already used beyond SQL, e.g. in ML
> pipelines, and will be used by streaming also.
>
>
> Operation/Deployment
>
> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
> but it has been end-of-life.
>
> 2. Remove Hadoop 1 support.
>
> 3. Assembly-free distribution of Spark: don’t require building an enormous
> assembly jar in order to run Spark.
>
>

Re: Recommended change to core-site.xml template

2015-11-05 Thread Nicholas Chammas

Thanks for sharing this, Christian.

What build of Spark are you using? If I understand correctly, if you are
using Spark built against Hadoop 2.6+ then additional configs alone won't
help because additional libraries also need to be installed
.

Nick

On Thu, Nov 5, 2015 at 11:25 AM Christian  wrote:

> We ended up reading and writing to S3 a ton in our Spark jobs.
> For this to work, we ended up having to add s3a, and s3 key/secret pairs.
> We also had to add fs.hdfs.impl to get these things to work.
>
> I thought maybe I'd share what we did and it might be worth adding these
> to the spark conf for out of the box functionality with S3.
>
> We created:
> ec2/deploy.generic/root/spark-ec2/templates/root/spark/conf/core-site.xml
>
> We changed the contents form the original, adding in the following:
>
>   
> fs.file.impl
> org.apache.hadoop.fs.LocalFileSystem
>   
>
>   
> fs.hdfs.impl
> org.apache.hadoop.hdfs.DistributedFileSystem
>   
>
>   
> fs.s3.impl
> org.apache.hadoop.fs.s3native.NativeS3FileSystem
>   
>
>   
> fs.s3.awsAccessKeyId
> {{aws_access_key_id}}
>   
>
>   
> fs.s3.awsSecretAccessKey
> {{aws_secret_access_key}}
>   
>
>   
> fs.s3n.awsAccessKeyId
> {{aws_access_key_id}}
>   
>
>   
> fs.s3n.awsSecretAccessKey
> {{aws_secret_access_key}}
>   
>
>   
> fs.s3a.awsAccessKeyId
> {{aws_access_key_id}}
>   
>
>   
> fs.s3a.awsSecretAccessKey
> {{aws_secret_access_key}}
>   
>
> This change makes spark on ec2 work out of the box for us. It took us
> several days to figure this out. It works for 1.4.1 and 1.5.1 on Hadoop
> version 2.
>
> Best Regards,
> Christian
>

Re: Recommended change to core-site.xml template

2015-11-05 Thread Nicholas Chammas

> I am using both 1.4.1 and 1.5.1.

That's the Spark version. I'm wondering what version of Hadoop your Spark
is built against.

For example, when you download Spark
<http://spark.apache.org/downloads.html> you have to select from a number
of packages (under "Choose a package type"), and each is built against a
different version of Hadoop. When Spark is built against Hadoop 2.6+, from
my understanding, you need to install additional libraries
<https://issues.apache.org/jira/browse/SPARK-7481> to access S3. When Spark
is built against Hadoop 2.4 or earlier, you don't need to do this.

I'm confirming that this is what is happening in your case.

Nick

On Thu, Nov 5, 2015 at 12:17 PM Christian <engr...@gmail.com> wrote:

> I am using both 1.4.1 and 1.5.1. In the end, we used 1.5.1 because of the
> new feature for instance-profile which greatly helps with this as well.
> Without the instance-profile, we got it working by copying a
> .aws/credentials file up to each node. We could easily automate that
> through the templates.
>
> I don't need any additional libraries. We just need to change the
> core-site.xml
>
> -Christian
>
> On Thu, Nov 5, 2015 at 9:35 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Thanks for sharing this, Christian.
>>
>> What build of Spark are you using? If I understand correctly, if you are
>> using Spark built against Hadoop 2.6+ then additional configs alone won't
>> help because additional libraries also need to be installed
>> <https://issues.apache.org/jira/browse/SPARK-7481>.
>>
>> Nick
>>
>> On Thu, Nov 5, 2015 at 11:25 AM Christian <engr...@gmail.com> wrote:
>>
>>> We ended up reading and writing to S3 a ton in our Spark jobs.
>>> For this to work, we ended up having to add s3a, and s3 key/secret
>>> pairs. We also had to add fs.hdfs.impl to get these things to work.
>>>
>>> I thought maybe I'd share what we did and it might be worth adding these
>>> to the spark conf for out of the box functionality with S3.
>>>
>>> We created:
>>> ec2/deploy.generic/root/spark-ec2/templates/root/spark/conf/core-site.xml
>>>
>>> We changed the contents form the original, adding in the following:
>>>
>>>   
>>> fs.file.impl
>>> org.apache.hadoop.fs.LocalFileSystem
>>>   
>>>
>>>   
>>> fs.hdfs.impl
>>> org.apache.hadoop.hdfs.DistributedFileSystem
>>>   
>>>
>>>   
>>> fs.s3.impl
>>> org.apache.hadoop.fs.s3native.NativeS3FileSystem
>>>   
>>>
>>>   
>>> fs.s3.awsAccessKeyId
>>> {{aws_access_key_id}}
>>>   
>>>
>>>   
>>> fs.s3.awsSecretAccessKey
>>> {{aws_secret_access_key}}
>>>   
>>>
>>>   
>>> fs.s3n.awsAccessKeyId
>>> {{aws_access_key_id}}
>>>   
>>>
>>>   
>>> fs.s3n.awsSecretAccessKey
>>> {{aws_secret_access_key}}
>>>   
>>>
>>>   
>>> fs.s3a.awsAccessKeyId
>>> {{aws_access_key_id}}
>>>   
>>>
>>>   
>>> fs.s3a.awsSecretAccessKey
>>> {{aws_secret_access_key}}
>>>   
>>>
>>> This change makes spark on ec2 work out of the box for us. It took us
>>> several days to figure this out. It works for 1.4.1 and 1.5.1 on Hadoop
>>> version 2.
>>>
>>> Best Regards,
>>> Christian
>>>
>>
>

Re: Spark EC2 script on Large clusters

2015-11-05 Thread Nicholas Chammas

Yeah, as Shivaram mentioned, this issue is well-known. It's documented in
SPARK-5189  and a bunch
of related issues. Unfortunately, it's hard to resolve this issue in
spark-ec2 without rewriting large parts of the project. But if you take a
crack at it and succeed I'm sure a lot of people will be happy.

I've started a separate project  --
which Shivaram also mentioned -- which aims to solve the problem of long
launch times and other issues
 with spark-ec2. It's
still very young and lacks several critical features, but we are making
steady progress.

Nick

On Thu, Nov 5, 2015 at 12:30 PM Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> It is a known limitation that spark-ec2 is very slow for large
> clusters and as you mention most of this is due to the use of rsync to
> transfer things from the master to all the slaves.
>
> Nick cc'd has been working on an alternative approach at
> https://github.com/nchammas/flintrock that is more scalable.
>
> Thanks
> Shivaram
>
> On Thu, Nov 5, 2015 at 8:12 AM, Christian  wrote:
> > For starters, thanks for the awesome product!
> >
> > When creating ec2-clusters of 20-40 nodes, things work great. When we
> create
> > a cluster with the provided spark-ec2 script, it takes hours. When
> creating
> > a 200 node cluster, it takes 2 1/2 hours and for a 500 node cluster it
> takes
> > over 5 hours. One other problem we are having is that some nodes don't
> come
> > up when the other ones do, the process seems to just move on, skipping
> the
> > rsync and any installs on those ones.
> >
> > My guess as to why it takes so long to set up a large cluster is because
> of
> > the use of rsync. What if instead of using rsync, you synched to s3 and
> then
> > did a pdsh to pull it down on all of the machines. This is a big deal
> for us
> > and if we can come up with a good plan, we might be able help out with
> the
> > required changes.
> >
> > Are there any suggestions on how to deal with some of the nodes not being
> > ready when the process starts?
> >
> > Thanks for your time,
> > Christian
> >
>

Re: [VOTE] Release Apache Spark 1.5.2 (RC2)

2015-11-05 Thread Nicholas Chammas

-0

The spark-ec2 version is still set to 1.5.1
.

Nick

On Wed, Nov 4, 2015 at 8:20 PM Egor Pahomov  wrote:

> +1
>
> Things, which our infrastructure use and I checked:
>
> Dynamic allocation
> Spark ODBC server
> Reading json
> Writing parquet
> SQL quires (hive context)
> Running on CDH
>
>
> 2015-11-04 9:03 GMT-08:00 Sean Owen :
>
>> As usual the signatures and licenses and so on look fine. I continue
>> to get the same test failures on Ubuntu in Java 7/8:
>>
>> - Unpersisting HttpBroadcast on executors only in distributed mode ***
>> FAILED ***
>>
>> But I continue to assume that's specific to tests and/or Ubuntu and/or
>> the build profile, since I don't see any evidence of this in other
>> builds on Jenkins. It's not a change from previous behavior, though it
>> doesn't always happen either.
>>
>> On Tue, Nov 3, 2015 at 11:22 PM, Reynold Xin  wrote:
>> > Please vote on releasing the following candidate as Apache Spark version
>> > 1.5.2. The vote is open until Sat Nov 7, 2015 at 00:00 UTC and passes
>> if a
>> > majority of at least 3 +1 PMC votes are cast.
>> >
>> > [ ] +1 Release this package as Apache Spark 1.5.2
>> > [ ] -1 Do not release this package because ...
>> >
>> >
>> > The release fixes 59 known issues in Spark 1.5.1, listed here:
>> > http://s.apache.org/spark-1.5.2
>> >
>> > The tag to be voted on is v1.5.2-rc2:
>> > https://github.com/apache/spark/releases/tag/v1.5.2-rc2
>> >
>> > The release files, including signatures, digests, etc. can be found at:
>> > http://people.apache.org/~pwendell/spark-releases/spark-1.5.2-rc2-bin/
>> >
>> > Release artifacts are signed with the following key:
>> > https://people.apache.org/keys/committer/pwendell.asc
>> >
>> > The staging repository for this release can be found at:
>> > - as version 1.5.2-rc2:
>> > https://repository.apache.org/content/repositories/orgapachespark-1153
>> > - as version 1.5.2:
>> > https://repository.apache.org/content/repositories/orgapachespark-1152
>> >
>> > The documentation corresponding to this release can be found at:
>> > http://people.apache.org/~pwendell/spark-releases/spark-1.5.2-rc2-docs/
>> >
>> >
>> > ===
>> > How can I help test this release?
>> > ===
>> > If you are a Spark user, you can help us test this release by taking an
>> > existing Spark workload and running on this release candidate, then
>> > reporting any regressions.
>> >
>> > 
>> > What justifies a -1 vote for this release?
>> > 
>> > -1 vote should occur for regressions from Spark 1.5.1. Bugs already
>> present
>> > in 1.5.1 will not block this release.
>> >
>> >
>>
>> -
>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>> For additional commands, e-mail: dev-h...@spark.apache.org
>>
>>
>
>
> --
>
> *Sincerely yoursEgor Pakhomov, *
>
> *AnchorFree*
>
>

Re: Recommended change to core-site.xml template

2015-11-05 Thread Nicholas Chammas

I might be mistaken, but yes, even with the changes you mentioned you will
not be able to access S3 if Spark is built against Hadoop 2.6+ unless you
install additional libraries. The issue is explained in SPARK-7481
<https://issues.apache.org/jira/browse/SPARK-7481> and SPARK-7442
<https://issues.apache.org/jira/browse/SPARK-7442>.

On Fri, Nov 6, 2015 at 12:22 AM Christian <engr...@gmail.com> wrote:

> Even with the changes I mentioned above?
> On Thu, Nov 5, 2015 at 8:10 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Yep, I think if you try spark-1.5.1-hadoop-2.6 you will find that you
>> cannot access S3, unfortunately.
>>
>> On Thu, Nov 5, 2015 at 3:53 PM Christian <engr...@gmail.com> wrote:
>>
>>> I created the cluster with the following:
>>>
>>> --hadoop-major-version=2
>>> --spark-version=1.4.1
>>>
>>> from: spark-1.5.1-bin-hadoop1
>>>
>>> Are you saying there might be different behavior if I download
>>> spark-1.5.1-hadoop-2.6 and create my cluster?
>>>
>>> On Thu, Nov 5, 2015 at 1:28 PM, Christian <engr...@gmail.com> wrote:
>>>
>>>> Spark 1.5.1-hadoop1
>>>>
>>>> On Thu, Nov 5, 2015 at 10:28 AM, Nicholas Chammas <
>>>> nicholas.cham...@gmail.com> wrote:
>>>>
>>>>> > I am using both 1.4.1 and 1.5.1.
>>>>>
>>>>> That's the Spark version. I'm wondering what version of Hadoop your
>>>>> Spark is built against.
>>>>>
>>>>> For example, when you download Spark
>>>>> <http://spark.apache.org/downloads.html> you have to select from a
>>>>> number of packages (under "Choose a package type"), and each is built
>>>>> against a different version of Hadoop. When Spark is built against Hadoop
>>>>> 2.6+, from my understanding, you need to install additional libraries
>>>>> <https://issues.apache.org/jira/browse/SPARK-7481> to access S3. When
>>>>> Spark is built against Hadoop 2.4 or earlier, you don't need to do this.
>>>>>
>>>>> I'm confirming that this is what is happening in your case.
>>>>>
>>>>> Nick
>>>>>
>>>>> On Thu, Nov 5, 2015 at 12:17 PM Christian <engr...@gmail.com> wrote:
>>>>>
>>>>>> I am using both 1.4.1 and 1.5.1. In the end, we used 1.5.1 because of
>>>>>> the new feature for instance-profile which greatly helps with this as 
>>>>>> well.
>>>>>> Without the instance-profile, we got it working by copying a
>>>>>> .aws/credentials file up to each node. We could easily automate that
>>>>>> through the templates.
>>>>>>
>>>>>> I don't need any additional libraries. We just need to change the
>>>>>> core-site.xml
>>>>>>
>>>>>> -Christian
>>>>>>
>>>>>> On Thu, Nov 5, 2015 at 9:35 AM, Nicholas Chammas <
>>>>>> nicholas.cham...@gmail.com> wrote:
>>>>>>
>>>>>>> Thanks for sharing this, Christian.
>>>>>>>
>>>>>>> What build of Spark are you using? If I understand correctly, if you
>>>>>>> are using Spark built against Hadoop 2.6+ then additional configs alone
>>>>>>> won't help because additional libraries also need to be installed
>>>>>>> <https://issues.apache.org/jira/browse/SPARK-7481>.
>>>>>>>
>>>>>>> Nick
>>>>>>>
>>>>>>> On Thu, Nov 5, 2015 at 11:25 AM Christian <engr...@gmail.com> wrote:
>>>>>>>
>>>>>>>> We ended up reading and writing to S3 a ton in our Spark jobs.
>>>>>>>> For this to work, we ended up having to add s3a, and s3 key/secret
>>>>>>>> pairs. We also had to add fs.hdfs.impl to get these things to work.
>>>>>>>>
>>>>>>>> I thought maybe I'd share what we did and it might be worth adding
>>>>>>>> these to the spark conf for out of the box functionality with S3.
>>>>>>>>
>>>>>>>> We created:
>>>>>>>>
>>>>>>>> ec2/deploy.generic/root/spark-ec2/templates/root/spark/conf/core-site.xml
>>>>>>>>
>>>>>>>> We changed the contents form the original, adding in the following:
>>>>>>>>
>>>>>>>>   
>>>>>>>> fs.file.impl
>>>>>>>> org.apache.hadoop.fs.LocalFileSystem
>>>>>>>>   
>>>>>>>>
>>>>>>>>   
>>>>>>>> fs.hdfs.impl
>>>>>>>> org.apache.hadoop.hdfs.DistributedFileSystem
>>>>>>>>   
>>>>>>>>
>>>>>>>>   
>>>>>>>> fs.s3.impl
>>>>>>>> org.apache.hadoop.fs.s3native.NativeS3FileSystem
>>>>>>>>   
>>>>>>>>
>>>>>>>>   
>>>>>>>> fs.s3.awsAccessKeyId
>>>>>>>> {{aws_access_key_id}}
>>>>>>>>   
>>>>>>>>
>>>>>>>>   
>>>>>>>> fs.s3.awsSecretAccessKey
>>>>>>>> {{aws_secret_access_key}}
>>>>>>>>   
>>>>>>>>
>>>>>>>>   
>>>>>>>> fs.s3n.awsAccessKeyId
>>>>>>>> {{aws_access_key_id}}
>>>>>>>>   
>>>>>>>>
>>>>>>>>   
>>>>>>>> fs.s3n.awsSecretAccessKey
>>>>>>>> {{aws_secret_access_key}}
>>>>>>>>   
>>>>>>>>
>>>>>>>>   
>>>>>>>> fs.s3a.awsAccessKeyId
>>>>>>>> {{aws_access_key_id}}
>>>>>>>>   
>>>>>>>>
>>>>>>>>   
>>>>>>>> fs.s3a.awsSecretAccessKey
>>>>>>>> {{aws_secret_access_key}}
>>>>>>>>   
>>>>>>>>
>>>>>>>> This change makes spark on ec2 work out of the box for us. It took
>>>>>>>> us several days to figure this out. It works for 1.4.1 and 1.5.1 on 
>>>>>>>> Hadoop
>>>>>>>> version 2.
>>>>>>>>
>>>>>>>> Best Regards,
>>>>>>>> Christian
>>>>>>>>
>>>>>>>
>>>>>>
>>>>
>>>

Re: Recommended change to core-site.xml template

2015-11-05 Thread Nicholas Chammas

Yep, I think if you try spark-1.5.1-hadoop-2.6 you will find that you
cannot access S3, unfortunately.

On Thu, Nov 5, 2015 at 3:53 PM Christian <engr...@gmail.com> wrote:

> I created the cluster with the following:
>
> --hadoop-major-version=2
> --spark-version=1.4.1
>
> from: spark-1.5.1-bin-hadoop1
>
> Are you saying there might be different behavior if I download
> spark-1.5.1-hadoop-2.6 and create my cluster?
>
> On Thu, Nov 5, 2015 at 1:28 PM, Christian <engr...@gmail.com> wrote:
>
>> Spark 1.5.1-hadoop1
>>
>> On Thu, Nov 5, 2015 at 10:28 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> > I am using both 1.4.1 and 1.5.1.
>>>
>>> That's the Spark version. I'm wondering what version of Hadoop your
>>> Spark is built against.
>>>
>>> For example, when you download Spark
>>> <http://spark.apache.org/downloads.html> you have to select from a
>>> number of packages (under "Choose a package type"), and each is built
>>> against a different version of Hadoop. When Spark is built against Hadoop
>>> 2.6+, from my understanding, you need to install additional libraries
>>> <https://issues.apache.org/jira/browse/SPARK-7481> to access S3. When
>>> Spark is built against Hadoop 2.4 or earlier, you don't need to do this.
>>>
>>> I'm confirming that this is what is happening in your case.
>>>
>>> Nick
>>>
>>> On Thu, Nov 5, 2015 at 12:17 PM Christian <engr...@gmail.com> wrote:
>>>
>>>> I am using both 1.4.1 and 1.5.1. In the end, we used 1.5.1 because of
>>>> the new feature for instance-profile which greatly helps with this as well.
>>>> Without the instance-profile, we got it working by copying a
>>>> .aws/credentials file up to each node. We could easily automate that
>>>> through the templates.
>>>>
>>>> I don't need any additional libraries. We just need to change the
>>>> core-site.xml
>>>>
>>>> -Christian
>>>>
>>>> On Thu, Nov 5, 2015 at 9:35 AM, Nicholas Chammas <
>>>> nicholas.cham...@gmail.com> wrote:
>>>>
>>>>> Thanks for sharing this, Christian.
>>>>>
>>>>> What build of Spark are you using? If I understand correctly, if you
>>>>> are using Spark built against Hadoop 2.6+ then additional configs alone
>>>>> won't help because additional libraries also need to be installed
>>>>> <https://issues.apache.org/jira/browse/SPARK-7481>.
>>>>>
>>>>> Nick
>>>>>
>>>>> On Thu, Nov 5, 2015 at 11:25 AM Christian <engr...@gmail.com> wrote:
>>>>>
>>>>>> We ended up reading and writing to S3 a ton in our Spark jobs.
>>>>>> For this to work, we ended up having to add s3a, and s3 key/secret
>>>>>> pairs. We also had to add fs.hdfs.impl to get these things to work.
>>>>>>
>>>>>> I thought maybe I'd share what we did and it might be worth adding
>>>>>> these to the spark conf for out of the box functionality with S3.
>>>>>>
>>>>>> We created:
>>>>>>
>>>>>> ec2/deploy.generic/root/spark-ec2/templates/root/spark/conf/core-site.xml
>>>>>>
>>>>>> We changed the contents form the original, adding in the following:
>>>>>>
>>>>>>   
>>>>>> fs.file.impl
>>>>>> org.apache.hadoop.fs.LocalFileSystem
>>>>>>   
>>>>>>
>>>>>>   
>>>>>> fs.hdfs.impl
>>>>>> org.apache.hadoop.hdfs.DistributedFileSystem
>>>>>>   
>>>>>>
>>>>>>   
>>>>>> fs.s3.impl
>>>>>> org.apache.hadoop.fs.s3native.NativeS3FileSystem
>>>>>>   
>>>>>>
>>>>>>   
>>>>>> fs.s3.awsAccessKeyId
>>>>>> {{aws_access_key_id}}
>>>>>>   
>>>>>>
>>>>>>   
>>>>>> fs.s3.awsSecretAccessKey
>>>>>> {{aws_secret_access_key}}
>>>>>>   
>>>>>>
>>>>>>   
>>>>>> fs.s3n.awsAccessKeyId
>>>>>> {{aws_access_key_id}}
>>>>>>   
>>>>>>
>>>>>>   
>>>>>> fs.s3n.awsSecretAccessKey
>>>>>> {{aws_secret_access_key}}
>>>>>>   
>>>>>>
>>>>>>   
>>>>>> fs.s3a.awsAccessKeyId
>>>>>> {{aws_access_key_id}}
>>>>>>   
>>>>>>
>>>>>>   
>>>>>> fs.s3a.awsSecretAccessKey
>>>>>> {{aws_secret_access_key}}
>>>>>>   
>>>>>>
>>>>>> This change makes spark on ec2 work out of the box for us. It took us
>>>>>> several days to figure this out. It works for 1.4.1 and 1.5.1 on Hadoop
>>>>>> version 2.
>>>>>>
>>>>>> Best Regards,
>>>>>> Christian
>>>>>>
>>>>>
>>>>
>>
>

Re: Downloading Hadoop from s3://spark-related-packages/

2015-11-01 Thread Nicholas Chammas

Oh, sweet! For example:

http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz?asjson=1

Thanks for sharing that tip. Looks like you can also use as_json
<https://svn.apache.org/repos/asf/infrastructure/site/trunk/content/dyn/mirrors/mirrors.cgi>
(vs. asjson).

Nick


On Sun, Nov 1, 2015 at 5:32 PM Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> On Sun, Nov 1, 2015 at 2:16 PM, Nicholas Chammas
> <nicholas.cham...@gmail.com> wrote:
> > OK, I’ll focus on the Apache mirrors going forward.
> >
> > The problem with the Apache mirrors, if I am not mistaken, is that you
> > cannot use a single URL that automatically redirects you to a working
> mirror
> > to download Hadoop. You have to pick a specific mirror and pray it
> doesn’t
> > disappear tomorrow.
> >
> > They don’t go away, especially http://mirror.ox.ac.uk , and in the us
> the
> > apache.osuosl.org, osu being a where a lot of the ASF servers are kept.
> >
> > So does Apache offer no way to query a URL and automatically get the
> closest
> > working mirror? If I’m installing HDFS onto servers in various EC2
> regions,
> > the best mirror will vary depending on my location.
> >
> Not sure if this is officially documented somewhere but if you pass
> '=1' you will get back a JSON which has a 'preferred' field set
> to the closest mirror.
>
> Shivaram
> > Nick
> >
> >
> > On Sun, Nov 1, 2015 at 12:25 PM Shivaram Venkataraman
> > <shiva...@eecs.berkeley.edu> wrote:
> >>
> >> I think that getting them from the ASF mirrors is a better strategy in
> >> general as it'll remove the overhead of keeping the S3 bucket up to
> >> date. It works in the spark-ec2 case because we only support a limited
> >> number of Hadoop versions from the tool. FWIW I don't have write
> >> access to the bucket and also haven't heard of any plans to support
> >> newer versions in spark-ec2.
> >>
> >> Thanks
> >> Shivaram
> >>
> >> On Sun, Nov 1, 2015 at 2:30 AM, Steve Loughran <ste...@hortonworks.com>
> >> wrote:
> >> >
> >> > On 1 Nov 2015, at 03:17, Nicholas Chammas <nicholas.cham...@gmail.com
> >
> >> > wrote:
> >> >
> >> > https://s3.amazonaws.com/spark-related-packages/
> >> >
> >> > spark-ec2 uses this bucket to download and install HDFS on clusters.
> Is
> >> > it
> >> > owned by the Spark project or by the AMPLab?
> >> >
> >> > Anyway, it looks like the latest Hadoop install available on there is
> >> > Hadoop
> >> > 2.4.0.
> >> >
> >> > Are there plans to add newer versions of Hadoop for use by spark-ec2
> and
> >> > similar tools, or should we just be getting that stuff via an Apache
> >> > mirror?
> >> > The latest version is 2.7.1, by the way.
> >> >
> >> >
> >> > you should be grabbing the artifacts off the ASF and then verifying
> >> > their
> >> > SHA1 checksums as published on the ASF HTTPS web site
> >> >
> >> >
> >> > The problem with the Apache mirrors, if I am not mistaken, is that you
> >> > cannot use a single URL that automatically redirects you to a working
> >> > mirror
> >> > to download Hadoop. You have to pick a specific mirror and pray it
> >> > doesn't
> >> > disappear tomorrow.
> >> >
> >> >
> >> > They don't go away, especially http://mirror.ox.ac.uk , and in the us
> >> > the
> >> > apache.osuosl.org, osu being a where a lot of the ASF servers are
> kept.
> >> >
> >> > full list with availability stats
> >> >
> >> > http://www.apache.org/mirrors/
> >> >
> >> >
>

Re: Downloading Hadoop from s3://spark-related-packages/

2015-11-01 Thread Nicholas Chammas

Hmm, yeah, some Googling confirms this, though there isn't any clear
documentation about this.

Strangely, if I click on the link from your email the download works, but
curl and wget somehow don't get redirected correctly...

Nick

On Sun, Nov 1, 2015 at 6:40 PM Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> I think the lua one at
>
> https://svn.apache.org/repos/asf/infrastructure/site/trunk/content/dyn/closer.lua
> has replaced the cgi one from before. Also it looks like the lua one
> also supports `action=download` with a filename argument. So you could
> just do something like
>
> wget
> http://www.apache.org/dyn/closer.lua?filename=hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz=download
>
> Thanks
> Shivaram
>
> On Sun, Nov 1, 2015 at 3:18 PM, Nicholas Chammas
> <nicholas.cham...@gmail.com> wrote:
> > Oh, sweet! For example:
> >
> >
> http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz?asjson=1
> >
> > Thanks for sharing that tip. Looks like you can also use as_json (vs.
> > asjson).
> >
> > Nick
> >
> >
> > On Sun, Nov 1, 2015 at 5:32 PM Shivaram Venkataraman
> > <shiva...@eecs.berkeley.edu> wrote:
> >>
> >> On Sun, Nov 1, 2015 at 2:16 PM, Nicholas Chammas
> >> <nicholas.cham...@gmail.com> wrote:
> >> > OK, I’ll focus on the Apache mirrors going forward.
> >> >
> >> > The problem with the Apache mirrors, if I am not mistaken, is that you
> >> > cannot use a single URL that automatically redirects you to a working
> >> > mirror
> >> > to download Hadoop. You have to pick a specific mirror and pray it
> >> > doesn’t
> >> > disappear tomorrow.
> >> >
> >> > They don’t go away, especially http://mirror.ox.ac.uk , and in the us
> >> > the
> >> > apache.osuosl.org, osu being a where a lot of the ASF servers are
> kept.
> >> >
> >> > So does Apache offer no way to query a URL and automatically get the
> >> > closest
> >> > working mirror? If I’m installing HDFS onto servers in various EC2
> >> > regions,
> >> > the best mirror will vary depending on my location.
> >> >
> >> Not sure if this is officially documented somewhere but if you pass
> >> '=1' you will get back a JSON which has a 'preferred' field set
> >> to the closest mirror.
> >>
> >> Shivaram
> >> > Nick
> >> >
> >> >
> >> > On Sun, Nov 1, 2015 at 12:25 PM Shivaram Venkataraman
> >> > <shiva...@eecs.berkeley.edu> wrote:
> >> >>
> >> >> I think that getting them from the ASF mirrors is a better strategy
> in
> >> >> general as it'll remove the overhead of keeping the S3 bucket up to
> >> >> date. It works in the spark-ec2 case because we only support a
> limited
> >> >> number of Hadoop versions from the tool. FWIW I don't have write
> >> >> access to the bucket and also haven't heard of any plans to support
> >> >> newer versions in spark-ec2.
> >> >>
> >> >> Thanks
> >> >> Shivaram
> >> >>
> >> >> On Sun, Nov 1, 2015 at 2:30 AM, Steve Loughran <
> ste...@hortonworks.com>
> >> >> wrote:
> >> >> >
> >> >> > On 1 Nov 2015, at 03:17, Nicholas Chammas
> >> >> > <nicholas.cham...@gmail.com>
> >> >> > wrote:
> >> >> >
> >> >> > https://s3.amazonaws.com/spark-related-packages/
> >> >> >
> >> >> > spark-ec2 uses this bucket to download and install HDFS on
> clusters.
> >> >> > Is
> >> >> > it
> >> >> > owned by the Spark project or by the AMPLab?
> >> >> >
> >> >> > Anyway, it looks like the latest Hadoop install available on there
> is
> >> >> > Hadoop
> >> >> > 2.4.0.
> >> >> >
> >> >> > Are there plans to add newer versions of Hadoop for use by
> spark-ec2
> >> >> > and
> >> >> > similar tools, or should we just be getting that stuff via an
> Apache
> >> >> > mirror?
> >> >> > The latest version is 2.7.1, by the way.
> >> >> >
> >> >> >
> >> >> > you should be grabbing the artifacts off the ASF and then verifying
> >> >> > their
> >> >> > SHA1 checksums as published on the ASF HTTPS web site
> >> >> >
> >> >> >
> >> >> > The problem with the Apache mirrors, if I am not mistaken, is that
> >> >> > you
> >> >> > cannot use a single URL that automatically redirects you to a
> working
> >> >> > mirror
> >> >> > to download Hadoop. You have to pick a specific mirror and pray it
> >> >> > doesn't
> >> >> > disappear tomorrow.
> >> >> >
> >> >> >
> >> >> > They don't go away, especially http://mirror.ox.ac.uk , and in
> the us
> >> >> > the
> >> >> > apache.osuosl.org, osu being a where a lot of the ASF servers are
> >> >> > kept.
> >> >> >
> >> >> > full list with availability stats
> >> >> >
> >> >> > http://www.apache.org/mirrors/
> >> >> >
> >> >> >
>

Re: Downloading Hadoop from s3://spark-related-packages/

2015-11-01 Thread Nicholas Chammas

OK, I’ll focus on the Apache mirrors going forward.

The problem with the Apache mirrors, if I am not mistaken, is that you
cannot use a single URL that automatically redirects you to a working
mirror to download Hadoop. You have to pick a specific mirror and pray it
doesn’t disappear tomorrow.

They don’t go away, especially http://mirror.ox.ac.uk , and in the us the
apache.osuosl.org, osu being a where a lot of the ASF servers are kept.

So does Apache offer no way to query a URL and automatically get the
closest working mirror? If I’m installing HDFS onto servers in various EC2
regions, the best mirror will vary depending on my location.

Nick


On Sun, Nov 1, 2015 at 12:25 PM Shivaram Venkataraman <
shiva...@eecs.berkeley.edu> wrote:

> I think that getting them from the ASF mirrors is a better strategy in
> general as it'll remove the overhead of keeping the S3 bucket up to
> date. It works in the spark-ec2 case because we only support a limited
> number of Hadoop versions from the tool. FWIW I don't have write
> access to the bucket and also haven't heard of any plans to support
> newer versions in spark-ec2.
>
> Thanks
> Shivaram
>
> On Sun, Nov 1, 2015 at 2:30 AM, Steve Loughran <ste...@hortonworks.com>
> wrote:
> >
> > On 1 Nov 2015, at 03:17, Nicholas Chammas <nicholas.cham...@gmail.com>
> > wrote:
> >
> > https://s3.amazonaws.com/spark-related-packages/
> >
> > spark-ec2 uses this bucket to download and install HDFS on clusters. Is
> it
> > owned by the Spark project or by the AMPLab?
> >
> > Anyway, it looks like the latest Hadoop install available on there is
> Hadoop
> > 2.4.0.
> >
> > Are there plans to add newer versions of Hadoop for use by spark-ec2 and
> > similar tools, or should we just be getting that stuff via an Apache
> mirror?
> > The latest version is 2.7.1, by the way.
> >
> >
> > you should be grabbing the artifacts off the ASF and then verifying their
> > SHA1 checksums as published on the ASF HTTPS web site
> >
> >
> > The problem with the Apache mirrors, if I am not mistaken, is that you
> > cannot use a single URL that automatically redirects you to a working
> mirror
> > to download Hadoop. You have to pick a specific mirror and pray it
> doesn't
> > disappear tomorrow.
> >
> >
> > They don't go away, especially http://mirror.ox.ac.uk , and in the us
> the
> > apache.osuosl.org, osu being a where a lot of the ASF servers are kept.
> >
> > full list with availability stats
> >
> > http://www.apache.org/mirrors/
> >
> >
>

Downloading Hadoop from s3://spark-related-packages/

2015-10-31 Thread Nicholas Chammas

https://s3.amazonaws.com/spark-related-packages/

spark-ec2 uses this bucket to download and install HDFS on clusters. Is it
owned by the Spark project or by the AMPLab?

Anyway, it looks like the latest Hadoop install available on there is
Hadoop 2.4.0.

Are there plans to add newer versions of Hadoop for use by spark-ec2 and
similar tools, or should we just be getting that stuff via an Apache mirror
? The latest version is 2.7.1, by
the way.

The problem with the Apache mirrors, if I am not mistaken, is that you
cannot use a single URL that automatically redirects you to a working
mirror to download Hadoop. You have to pick a specific mirror and pray it
doesn't disappear tomorrow.

Nick

Re: Sorry, but Nabble and ML suck

2015-10-31 Thread Nicholas Chammas

Nabble is an unofficial archive of this mailing list. I don't know who runs
it, but it's not Apache. There are often delays between when things get
posted to the list and updated on Nabble, and sometimes things never make
it over for whatever reason.

This mailing list is, I agree, very 1980s. Unfortunately, it's required by
the Apache Software Foundation (ASF).

There was a discussion earlier this year

about
migrating to Discourse that explained why we're stuck with what we have for
now. Ironically, that discussion is hard to follow on the Apache archives
(which is precisely one of the motivations for proposing to migrate to
Discourse), but there is a more readable archive on another unofficial site

.

Nick

On Sat, Oct 31, 2015 at 12:20 PM Martin Senne 
wrote:

> Having written a post on last Tuesday, I'm still not able to see my post
> under nabble. And yeah, subscription to u...@apache.spark.org was
> successful (rechecked a minute ago)
>
> Even more, I have no way (and no confirmation) that my post was accepted,
> rejected, whatever.
>
> This is very L4M3 and so 80ies.
>
> Any help appreciated. Thx!
>

[jira] [Commented] (SPARK-3342) m3 instances don't get local SSDs

2015-10-26 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14974660#comment-14974660
 ] 

Nicholas Chammas commented on SPARK-3342:
-

FWIW, that statement on M3 instances is [no longer 
there|http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/InstanceStorage.html], 
so we should be able to drop [this 
logic|https://github.com/apache/spark/blob/07ced43424447699e47106de9ca2fa714756bdeb/ec2/spark_ec2.py#L588-L595]
 in spark-ec2. cc [~shivaram]

> m3 instances don't get local SSDs
> -
>
> Key: SPARK-3342
> URL: https://issues.apache.org/jira/browse/SPARK-3342
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.0.2
>Reporter: Matei Zaharia
>Assignee: Daniel Darabos
> Fix For: 1.1.0
>
>
> As discussed on https://github.com/apache/spark/pull/2081, these instances 
> ignore the block device mapping on the AMI and require ephemeral drives to be 
> added programmatically when launching them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-10002) SSH problem during Setup of Spark(1.3.0) cluster on EC2

2015-10-22 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969814#comment-14969814
 ] 

Nicholas Chammas commented on SPARK-10002:
--

[~deepalib] - Is {{--private-ips}} the solution, as the previous commenter 
suggested?

> SSH problem during Setup of Spark(1.3.0) cluster on EC2
> ---
>
> Key: SPARK-10002
> URL: https://issues.apache.org/jira/browse/SPARK-10002
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Affects Versions: 1.3.0
> Environment: EC2, SPARK 1.3.0 cluster setup in vpc/subnet.
>Reporter: Deepali Bhandari
>
> Steps to start a Spark cluster with EC2 scripts
> 1. I created an ec2 instance in the vpc, and subnet. Amazon Linux 
> 2. I dowloaded spark-1.3.0
> 3. chmod 400 key file
> 4. Export aws access and secret keys
> 5. Now ran the command
>  ./spark-ec2 --key-pair=deepali-ec2-keypair 
> --identity-file=/home/ec2-user/Spark/deepali-ec2-keypair.pem 
> --region=us-west-2 --zone=us-west-2b --vpc-id=vpc-03d67b66 
> --subnet-id=subnet-72fd5905 --resume launch deepali-spark-nodocker
>  6. The master and slave instances are created but cannot ssh says host not 
> resolved.
>  7. I can ping the master and slave, I can ssh from the command line, but not 
> from the ec2 scripts. 
>  8. I have spent more than 2 days now. But no luck yet.
>  9. The ec2 scripts dont work .. code has a bug in referencing the cluster 
> nodes via the wrong hostnames 
>  
> SCREEN CONSOLE log
>  ./spark-ec2 --key-pair=deepali-ec2-keypair --identity-file=/home 
>   
>  
> /ec2-user/Spark/deepali-ec2-keypair.pem --region=us-west-2 --zone=us-west-2b 
> --vpc-id=vpc-03d67b6  
>   
> 6 --subnet-id=subnet-72fd5905 launch deepali-spark-nodocker
> Downloading Boto from PyPi
> Finished downloading Boto
> Setting up security groups...
> Creating security group deepali-spark-nodocker-master
> Creating security group deepali-spark-nodocker-slaves
> Searching for existing cluster deepali-spark-nodocker...
> Spark AMI: ami-9a6e0daa
> Launching instances...
> Launched 1 slaves in us-west-2b, regid = r-0d2088fb
> Launched master in us-west-2b, regid = r-312088c7
> Waiting for AWS to propagate instance metadata...
> Waiting for cluster to enter 'ssh-ready' state...
> Warning: SSH connection error. (This could be temporary.)
> Host: None
> SSH return code: 255
> SSH output: ssh: Could not resolve hostname None: Name or service not known



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Can we add an unsubscribe link in the footer of every email?

2015-10-21 Thread Nicholas Chammas

Every week or so someone emails the list asking to unsubscribe.

Of course, that's not the right way to do it. You're supposed to email
a different
address  than this one to
unsubscribe, yet this is not in-your-face obvious, so many people miss it.
And someone steps up almost every time to point people in the right
direction.

The vast majority of mailing lists I'm familiar with include a small footer
at the bottom of each email with a link to unsubscribe. I think this is
what most people expect, and it's where they check first.

Can we add a footer like that?

I think it would cut down on the weekly emails from people wanting to
unsubscribe, and it would match existing mailing list conventions elsewhere.

Nick

Re: SPARK_MASTER_IP actually expects a DNS name, not IP address

2015-10-16 Thread Nicholas Chammas

JB,

I am using spark-env.sh to define the master address instead of using
spark-defaults.conf.

I understand that that should work, and indeed it does, but only if
SPARK_MASTER_IP is set to a DNS name and not an IP address.

Perhaps I'm misunderstanding these configuration methods...

Nick


On Fri, Oct 16, 2015 at 12:05 PM Jean-Baptiste Onofré <j...@nanthrax.net>
wrote:

> Hi Nick,
>
> there's the Spark master defined in conf/spark-defaults.conf and the -h
> option that you can provide to sbin/start-master.sh script.
>
> Did you try:
>
> sbin/start-master.sh -h xxx.xxx.xxx.xxx
>
> and then use the IP when you start the slaves:
>
> sbin/start-slave.sh spark://xxx.xxx.xxx.xxx.7077
>
> ?
>
> Regards
> JB
>
> On 10/16/2015 06:01 PM, Nicholas Chammas wrote:
> > I'd look into tracing a possible bug here, but I'm not sure where to
> > look. Searching the codebase for `SPARK_MASTER_IP`, amazingly, does not
> > show it being used in any place directly by Spark
> > <https://github.com/apache/spark/search?utf8=%E2%9C%93=SPARK_MASTER_IP
> >.
> >
> > Clearly, Spark is using this environment variable (otherwise I wouldn't
> > see the behavior described in my first email), but I can't see where.
> >
> > Can someone give me a pointer?
> >
> > Nick
> >
> > On Thu, Oct 15, 2015 at 12:37 AM Ted Yu <yuzhih...@gmail.com
> > <mailto:yuzhih...@gmail.com>> wrote:
> >
> > Some old bits:
> >
> >
> http://stackoverflow.com/questions/28162991/cant-run-spark-1-2-in-standalone-mode-on-mac
> >
> http://stackoverflow.com/questions/29412157/passing-hostname-to-netty
> >
> > FYI
> >
> > On Wed, Oct 14, 2015 at 7:10 PM, Nicholas Chammas
> > <nicholas.cham...@gmail.com <mailto:nicholas.cham...@gmail.com>>
> wrote:
> >
> > I’m setting the Spark master address via the |SPARK_MASTER_IP|
> > environment variable in |spark-env.sh|, like spark-ec2 does
> > <
> https://github.com/amplab/spark-ec2/blob/a990752575cd8b0ab25731d7820a55c714798ec3/templates/root/spark/conf/spark-env.sh#L13
> >.
> >
> > The funny thing is that Spark seems to accept this only if the
> > value of |SPARK_MASTER_IP| is a DNS name and not an IP address.
> >
> > When I provide an IP address, I get errors in the log when
> > starting the master:
> >
> > |15/10/15 01:47:31 ERROR NettyTransport: failed to bind to
> > /54.210.XX.XX:7077, shutting down Netty transport |
> >
> > (XX is my redaction of the full IP address.)
> >
> > Am I misunderstanding something about how to use this
> > environment variable?
> >
> > The spark-env.sh template indicates that either an IP address or
> > a hostname should work
> > <
> https://github.com/apache/spark/blob/4ace4f8a9c91beb21a0077e12b75637a4560a542/conf/spark-env.sh.template#L49
> >,
> > but my testing shows that only hostnames work.
> >
> > Nick
> >
> > 
> >
> >
>
> --
> Jean-Baptiste Onofré
> jbono...@apache.org
> http://blog.nanthrax.net
> Talend - http://www.talend.com
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: SPARK_MASTER_IP actually expects a DNS name, not IP address

2015-10-16 Thread Nicholas Chammas

I'd look into tracing a possible bug here, but I'm not sure where to look.
Searching the codebase for `SPARK_MASTER_IP`, amazingly, does not show it
being used in any place directly by Spark
<https://github.com/apache/spark/search?utf8=%E2%9C%93=SPARK_MASTER_IP>.

Clearly, Spark is using this environment variable (otherwise I wouldn't see
the behavior described in my first email), but I can't see where.

Can someone give me a pointer?

Nick

On Thu, Oct 15, 2015 at 12:37 AM Ted Yu <yuzhih...@gmail.com> wrote:

> Some old bits:
>
>
> http://stackoverflow.com/questions/28162991/cant-run-spark-1-2-in-standalone-mode-on-mac
> http://stackoverflow.com/questions/29412157/passing-hostname-to-netty
>
> FYI
>
> On Wed, Oct 14, 2015 at 7:10 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I’m setting the Spark master address via the SPARK_MASTER_IP environment
>> variable in spark-env.sh, like spark-ec2 does
>> <https://github.com/amplab/spark-ec2/blob/a990752575cd8b0ab25731d7820a55c714798ec3/templates/root/spark/conf/spark-env.sh#L13>
>> .
>>
>> The funny thing is that Spark seems to accept this only if the value of
>> SPARK_MASTER_IP is a DNS name and not an IP address.
>>
>> When I provide an IP address, I get errors in the log when starting the
>> master:
>>
>> 15/10/15 01:47:31 ERROR NettyTransport: failed to bind to 
>> /54.210.XX.XX:7077, shutting down Netty transport
>>
>> (XX is my redaction of the full IP address.)
>>
>> Am I misunderstanding something about how to use this environment
>> variable?
>>
>> The spark-env.sh template indicates that either an IP address or a
>> hostname should work
>> <https://github.com/apache/spark/blob/4ace4f8a9c91beb21a0077e12b75637a4560a542/conf/spark-env.sh.template#L49>,
>> but my testing shows that only hostnames work.
>>
>> Nick
>> 
>>
>
>

Re: SPARK_MASTER_IP actually expects a DNS name, not IP address

2015-10-16 Thread Nicholas Chammas

Ah, my bad, I missed it
<https://github.com/apache/spark/blob/08698ee1d6f29b2c999416f18a074d5193cdacd5/sbin/start-master.sh#L58-L60>
since the GitHub search results preview only showed
<https://github.com/apache/spark/search?utf8=%E2%9C%93=SPARK_MASTER_IP>
the first hit from start-master.sh and not this part:

"$sbin"/spark-daemon.sh start org.apache.spark.deploy.master.Master 1 \
  --ip $SPARK_MASTER_IP --port $SPARK_MASTER_PORT --webui-port
$SPARK_MASTER_WEBUI_PORT \
  $ORIGINAL_ARGS

Same goes for some of the other sbin scripts.

Anyway, let’s take a closer look…

Nick


On Fri, Oct 16, 2015 at 12:05 PM Sean Owen <so...@cloudera.com> wrote:

> It's used in scripts like sbin/start-master.sh
>
> On Fri, Oct 16, 2015 at 5:01 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I'd look into tracing a possible bug here, but I'm not sure where to
>> look. Searching the codebase for `SPARK_MASTER_IP`, amazingly, does not
>> show it being used in any place directly by Spark
>> <https://github.com/apache/spark/search?utf8=%E2%9C%93=SPARK_MASTER_IP>
>> .
>>
>> Clearly, Spark is using this environment variable (otherwise I wouldn't
>> see the behavior described in my first email), but I can't see where.
>>
>> Can someone give me a pointer?
>>
>> Nick
>>
>> On Thu, Oct 15, 2015 at 12:37 AM Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> Some old bits:
>>>
>>>
>>> http://stackoverflow.com/questions/28162991/cant-run-spark-1-2-in-standalone-mode-on-mac
>>> http://stackoverflow.com/questions/29412157/passing-hostname-to-netty
>>>
>>> FYI
>>>
>>> On Wed, Oct 14, 2015 at 7:10 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> I’m setting the Spark master address via the SPARK_MASTER_IP
>>>> environment variable in spark-env.sh, like spark-ec2 does
>>>> <https://github.com/amplab/spark-ec2/blob/a990752575cd8b0ab25731d7820a55c714798ec3/templates/root/spark/conf/spark-env.sh#L13>
>>>> .
>>>>
>>>> The funny thing is that Spark seems to accept this only if the value of
>>>> SPARK_MASTER_IP is a DNS name and not an IP address.
>>>>
>>>> When I provide an IP address, I get errors in the log when starting the
>>>> master:
>>>>
>>>> 15/10/15 01:47:31 ERROR NettyTransport: failed to bind to 
>>>> /54.210.XX.XX:7077, shutting down Netty transport
>>>>
>>>> (XX is my redaction of the full IP address.)
>>>>
>>>> Am I misunderstanding something about how to use this environment
>>>> variable?
>>>>
>>>> The spark-env.sh template indicates that either an IP address or a
>>>> hostname should work
>>>> <https://github.com/apache/spark/blob/4ace4f8a9c91beb21a0077e12b75637a4560a542/conf/spark-env.sh.template#L49>,
>>>> but my testing shows that only hostnames work.
>>>>
>>>> Nick
>>>> 
>>>>
>>>
>>>
>

Re: stability of Spark 1.4.1 with Python 3 versions

2015-10-14 Thread Nicholas Chammas

The Spark 1.4 release notes
 say that
Python 3 is supported. The 1.4 docs are incorrect, and the 1.5 programming
guide has been updated to indicate Python 3 support.

On Wed, Oct 14, 2015 at 7:06 AM shoira.mukhsin...@bnpparibasfortis.com <
shoira.mukhsin...@bnpparibasfortis.com> wrote:

> Dear Spark Community,
>
>
>
> The official documentation of Spark 1.4.1 mentions that Spark runs on Python
> 2.6+ http://spark.apache.org/docs/1.4.1/
>
> It is not clear if by “Python 2.6+” do you also mean Python 3.4 or not.
>
>
>
> There is a resolved issue on this point which makes me believe that it
> does run on Python 3.4: https://issues.apache.org/jira/i#browse/SPARK-9705
>
> Maybe the documentation is simply not up to date ? The programming guide
> mentions that it does not work for Python 3:
> https://spark.apache.org/docs/1.4.1/programming-guide.html
>
>
>
> Do you confirm that Spark 1.4.1 does run on Python3.4?
>
>
>
> Thanks in advance for your reaction!
>
>
>
> Regards,
>
> Shoira
>
>
>
>
>
>
>
> ==
> BNP Paribas Fortis disclaimer:
> http://www.bnpparibasfortis.com/e-mail-disclaimer.html
>
> BNP Paribas Fortis privacy policy:
> http://www.bnpparibasfortis.com/privacy-policy.html
>
> ==
>

Re: [ANNOUNCE] Announcing Spark 1.5.1

2015-10-11 Thread Nicholas Chammas

You can find the source tagged for release on GitHub
, as was clearly
linked to in the thread to vote on the release (titled "[VOTE] Release
Apache Spark 1.5.1 (RC1)").

Is there something about that thread that was unclear?

Nick


On Sun, Oct 11, 2015 at 11:23 AM Daniel Gruno  wrote:

> On 10/11/2015 05:12 PM, Sean Owen wrote:
> > The Spark releases include a source distribution and several binary
> > distributions. This is pretty normal for Apache projects. What are you
> > referring to here?
>
> Surely the _source_ distribution does not contain binaries? How else can
> you vote on a release if you don't know what it contains?
>
> You can produce convenience downloads that contain binary files, yes,
> but surely you need a source-only package which is the one you vote on,
> that does not contain any binaries. Do you have such a thing? And where
> may I find it?
>
> With regards,
> Daniel.
>
> >
> > On Sun, Oct 11, 2015 at 3:26 PM, Daniel Gruno 
> wrote:
> >> Out of curiosity: How can you vote on a release that contains 34 binary
> files? Surely a source code release should only contain source code and not
> binaries, as you cannot verify the content of these.
> >>
> >> Looking forward to a response.
> >>
> >> With regards,
> >> Daniel.
> >>
> >> On 10/2/2015, 4:42:31 AM, Reynold Xin  wrote:
> >>> Hi All,
> >>>
> >>> Spark 1.5.1 is a maintenance release containing stability fixes. This
> >>> release is based on the branch-1.5 maintenance branch of Spark. We
> >>> *strongly recommend* all 1.5.0 users to upgrade to this release.
> >>>
> >>> The full list of bug fixes is here: http://s.apache.org/spark-1.5.1
> >>>
> >>> http://spark.apache.org/releases/spark-release-1-5-1.html
> >>>
> >>>
> >>> (note: it can take a few hours for everything to be propagated, so you
> >>> might get 404 on some download links, but everything should be in maven
> >>> central already)
> >>>
> >>
> >> -
> >> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: dev-h...@spark.apache.org
> >>
>
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>

Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-07 Thread Nicholas Chammas

Sounds good to me.

For my purposes, I'm less concerned about old Spark artifacts and more
concerned about the consistency of the set of artifacts that get generated
with new releases. (e.g. Each new release will always include one artifact
each for Hadoop 1, Hadoop 1 + Scala 2.11, etc...)

It sounds like we can expect that set to stay the same with new releases
for now, but it's not a hard guarantee. I think that's fine for now.

Nick

On Wed, Oct 7, 2015 at 1:57 PM Patrick Wendell <pwend...@gmail.com> wrote:

> I don't think we have a firm contract around that. So far we've never
> removed old artifacts, but the ASF has asked us at time to decrease the
> size of binaries we post. In the future at some point we may drop older
> ones since we keep adding new ones.
>
> If downstream projects are depending on our artifacts, I'd say just hold
> tight for now until something changes. If it changes, then those projects
> might need to build Spark on their own and host older hadoop versions, etc.
>
> On Wed, Oct 7, 2015 at 9:59 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Thanks guys.
>>
>> Regarding this earlier question:
>>
>> More importantly, is there some rough specification for what packages we
>> should be able to expect in this S3 bucket with every release?
>>
>> Is the implied answer that we should continue to expect the same set of
>> artifacts for every release for the foreseeable future?
>>
>> Nick
>> 
>>
>> On Tue, Oct 6, 2015 at 1:13 AM Patrick Wendell <pwend...@gmail.com>
>> wrote:
>>
>>> The missing artifacts are uploaded now. Things should propagate in the
>>> next 24 hours. If there are still issues past then ping this thread. Thanks!
>>>
>>> - Patrick
>>>
>>> On Mon, Oct 5, 2015 at 2:41 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> Thanks for looking into this Josh.
>>>>
>>>> On Mon, Oct 5, 2015 at 5:39 PM Josh Rosen <joshro...@databricks.com>
>>>> wrote:
>>>>
>>>>> I'm working on a fix for this right now. I'm planning to re-run a
>>>>> modified copy of the release packaging scripts which will emit only the
>>>>> missing artifacts (so we won't upload new artifacts with different SHAs 
>>>>> for
>>>>> the builds which *did* succeed).
>>>>>
>>>>> I expect to have this finished in the next day or so; I'm currently
>>>>> blocked by some infra downtime but expect that to be resolved soon.
>>>>>
>>>>> - Josh
>>>>>
>>>>> On Mon, Oct 5, 2015 at 8:46 AM, Nicholas Chammas <
>>>>> nicholas.cham...@gmail.com> wrote:
>>>>>
>>>>>> Blaž said:
>>>>>>
>>>>>> Also missing is
>>>>>> http://s3.amazonaws.com/spark-related-packages/spark-1.5.1-bin-hadoop1.tgz
>>>>>> which breaks spark-ec2 script.
>>>>>>
>>>>>> This is the package I am referring to in my original email.
>>>>>>
>>>>>> Nick said:
>>>>>>
>>>>>> It appears that almost every version of Spark up to and including
>>>>>> 1.5.0 has included a —bin-hadoop1.tgz release (e.g.
>>>>>> spark-1.5.0-bin-hadoop1.tgz). However, 1.5.1 has no such package.
>>>>>>
>>>>>> Nick
>>>>>> 
>>>>>>
>>>>>> On Mon, Oct 5, 2015 at 3:27 AM Blaž Šnuderl <snud...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Also missing is
>>>>>>> http://s3.amazonaws.com/spark-related-packages/spark-1.5.1-bin-hadoop1.tgz
>>>>>>> which breaks spark-ec2 script.
>>>>>>>
>>>>>>> On Mon, Oct 5, 2015 at 5:20 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>>>>
>>>>>>>> hadoop1 package for Scala 2.10 wasn't in RC1 either:
>>>>>>>>
>>>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
>>>>>>>>
>>>>>>>> On Sun, Oct 4, 2015 at 5:17 PM, Nicholas Chammas <
>>>>>>>> nicholas.cham...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> I’m looking here:
>>>>>>>>>
>>>>>>>>> https://s3.amazonaws.com/spark-related-packages/
>>>>>>>>>
>>>>>>>>> I believe this is where one set of official packages is published.
>>>>>>>>> Please correct me if this is not the case.
>>>>>>>>>
>>>>>>>>> It appears that almost every version of Spark up to and including
>>>>>>>>> 1.5.0 has included a --bin-hadoop1.tgz release (e.g.
>>>>>>>>> spark-1.5.0-bin-hadoop1.tgz).
>>>>>>>>>
>>>>>>>>> However, 1.5.1 has no such package. There is a
>>>>>>>>> spark-1.5.1-bin-hadoop1-scala2.11.tgz package, but this is a
>>>>>>>>> separate thing. (1.5.0 also has a hadoop1-scala2.11 package.)
>>>>>>>>>
>>>>>>>>> Was this intentional?
>>>>>>>>>
>>>>>>>>> More importantly, is there some rough specification for what
>>>>>>>>> packages we should be able to expect in this S3 bucket with every 
>>>>>>>>> release?
>>>>>>>>>
>>>>>>>>> This is important for those of us who depend on this publishing
>>>>>>>>> venue (e.g. spark-ec2 and related tools).
>>>>>>>>>
>>>>>>>>> Nick
>>>>>>>>> 
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>
>>>
>

Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-07 Thread Nicholas Chammas

Thanks guys.

Regarding this earlier question:

More importantly, is there some rough specification for what packages we
should be able to expect in this S3 bucket with every release?

Is the implied answer that we should continue to expect the same set of
artifacts for every release for the foreseeable future?

Nick


On Tue, Oct 6, 2015 at 1:13 AM Patrick Wendell <pwend...@gmail.com> wrote:

> The missing artifacts are uploaded now. Things should propagate in the
> next 24 hours. If there are still issues past then ping this thread. Thanks!
>
> - Patrick
>
> On Mon, Oct 5, 2015 at 2:41 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Thanks for looking into this Josh.
>>
>> On Mon, Oct 5, 2015 at 5:39 PM Josh Rosen <joshro...@databricks.com>
>> wrote:
>>
>>> I'm working on a fix for this right now. I'm planning to re-run a
>>> modified copy of the release packaging scripts which will emit only the
>>> missing artifacts (so we won't upload new artifacts with different SHAs for
>>> the builds which *did* succeed).
>>>
>>> I expect to have this finished in the next day or so; I'm currently
>>> blocked by some infra downtime but expect that to be resolved soon.
>>>
>>> - Josh
>>>
>>> On Mon, Oct 5, 2015 at 8:46 AM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> Blaž said:
>>>>
>>>> Also missing is
>>>> http://s3.amazonaws.com/spark-related-packages/spark-1.5.1-bin-hadoop1.tgz
>>>> which breaks spark-ec2 script.
>>>>
>>>> This is the package I am referring to in my original email.
>>>>
>>>> Nick said:
>>>>
>>>> It appears that almost every version of Spark up to and including 1.5.0
>>>> has included a —bin-hadoop1.tgz release (e.g. spark-1.5.0-bin-hadoop1.tgz).
>>>> However, 1.5.1 has no such package.
>>>>
>>>> Nick
>>>> 
>>>>
>>>> On Mon, Oct 5, 2015 at 3:27 AM Blaž Šnuderl <snud...@gmail.com> wrote:
>>>>
>>>>> Also missing is http://s3.amazonaws.com/spark-related-packages/spark-
>>>>> 1.5.1-bin-hadoop1.tgz which breaks spark-ec2 script.
>>>>>
>>>>> On Mon, Oct 5, 2015 at 5:20 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>>
>>>>>> hadoop1 package for Scala 2.10 wasn't in RC1 either:
>>>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
>>>>>>
>>>>>> On Sun, Oct 4, 2015 at 5:17 PM, Nicholas Chammas <
>>>>>> nicholas.cham...@gmail.com> wrote:
>>>>>>
>>>>>>> I’m looking here:
>>>>>>>
>>>>>>> https://s3.amazonaws.com/spark-related-packages/
>>>>>>>
>>>>>>> I believe this is where one set of official packages is published.
>>>>>>> Please correct me if this is not the case.
>>>>>>>
>>>>>>> It appears that almost every version of Spark up to and including
>>>>>>> 1.5.0 has included a --bin-hadoop1.tgz release (e.g.
>>>>>>> spark-1.5.0-bin-hadoop1.tgz).
>>>>>>>
>>>>>>> However, 1.5.1 has no such package. There is a
>>>>>>> spark-1.5.1-bin-hadoop1-scala2.11.tgz package, but this is a
>>>>>>> separate thing. (1.5.0 also has a hadoop1-scala2.11 package.)
>>>>>>>
>>>>>>> Was this intentional?
>>>>>>>
>>>>>>> More importantly, is there some rough specification for what
>>>>>>> packages we should be able to expect in this S3 bucket with every 
>>>>>>> release?
>>>>>>>
>>>>>>> This is important for those of us who depend on this publishing
>>>>>>> venue (e.g. spark-ec2 and related tools).
>>>>>>>
>>>>>>> Nick
>>>>>>> 
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>
>

Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-05 Thread Nicholas Chammas

Thanks for looking into this Josh.

On Mon, Oct 5, 2015 at 5:39 PM Josh Rosen <joshro...@databricks.com> wrote:

> I'm working on a fix for this right now. I'm planning to re-run a modified
> copy of the release packaging scripts which will emit only the missing
> artifacts (so we won't upload new artifacts with different SHAs for the
> builds which *did* succeed).
>
> I expect to have this finished in the next day or so; I'm currently
> blocked by some infra downtime but expect that to be resolved soon.
>
> - Josh
>
> On Mon, Oct 5, 2015 at 8:46 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Blaž said:
>>
>> Also missing is
>> http://s3.amazonaws.com/spark-related-packages/spark-1.5.1-bin-hadoop1.tgz
>> which breaks spark-ec2 script.
>>
>> This is the package I am referring to in my original email.
>>
>> Nick said:
>>
>> It appears that almost every version of Spark up to and including 1.5.0
>> has included a —bin-hadoop1.tgz release (e.g. spark-1.5.0-bin-hadoop1.tgz).
>> However, 1.5.1 has no such package.
>>
>> Nick
>> 
>>
>> On Mon, Oct 5, 2015 at 3:27 AM Blaž Šnuderl <snud...@gmail.com> wrote:
>>
>>> Also missing is http://s3.amazonaws.com/spark-related-packages/spark-
>>> 1.5.1-bin-hadoop1.tgz which breaks spark-ec2 script.
>>>
>>> On Mon, Oct 5, 2015 at 5:20 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>>> hadoop1 package for Scala 2.10 wasn't in RC1 either:
>>>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
>>>>
>>>> On Sun, Oct 4, 2015 at 5:17 PM, Nicholas Chammas <
>>>> nicholas.cham...@gmail.com> wrote:
>>>>
>>>>> I’m looking here:
>>>>>
>>>>> https://s3.amazonaws.com/spark-related-packages/
>>>>>
>>>>> I believe this is where one set of official packages is published.
>>>>> Please correct me if this is not the case.
>>>>>
>>>>> It appears that almost every version of Spark up to and including
>>>>> 1.5.0 has included a --bin-hadoop1.tgz release (e.g.
>>>>> spark-1.5.0-bin-hadoop1.tgz).
>>>>>
>>>>> However, 1.5.1 has no such package. There is a
>>>>> spark-1.5.1-bin-hadoop1-scala2.11.tgz package, but this is a separate
>>>>> thing. (1.5.0 also has a hadoop1-scala2.11 package.)
>>>>>
>>>>> Was this intentional?
>>>>>
>>>>> More importantly, is there some rough specification for what packages
>>>>> we should be able to expect in this S3 bucket with every release?
>>>>>
>>>>> This is important for those of us who depend on this publishing venue
>>>>> (e.g. spark-ec2 and related tools).
>>>>>
>>>>> Nick
>>>>> 
>>>>>
>>>>
>>>>
>>>
>

Re: Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-05 Thread Nicholas Chammas

Blaž said:

Also missing is
http://s3.amazonaws.com/spark-related-packages/spark-1.5.1-bin-hadoop1.tgz
which breaks spark-ec2 script.

This is the package I am referring to in my original email.

Nick said:

It appears that almost every version of Spark up to and including 1.5.0 has
included a —bin-hadoop1.tgz release (e.g. spark-1.5.0-bin-hadoop1.tgz).
However, 1.5.1 has no such package.

Nick


On Mon, Oct 5, 2015 at 3:27 AM Blaž Šnuderl <snud...@gmail.com> wrote:

> Also missing is 
> http://s3.amazonaws.com/spark-related-packages/spark-1.5.1-bin-hadoop1.tgz
> which breaks spark-ec2 script.
>
> On Mon, Oct 5, 2015 at 5:20 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> hadoop1 package for Scala 2.10 wasn't in RC1 either:
>> http://people.apache.org/~pwendell/spark-releases/spark-1.5.1-rc1-bin/
>>
>> On Sun, Oct 4, 2015 at 5:17 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> I’m looking here:
>>>
>>> https://s3.amazonaws.com/spark-related-packages/
>>>
>>> I believe this is where one set of official packages is published.
>>> Please correct me if this is not the case.
>>>
>>> It appears that almost every version of Spark up to and including 1.5.0
>>> has included a --bin-hadoop1.tgz release (e.g.
>>> spark-1.5.0-bin-hadoop1.tgz).
>>>
>>> However, 1.5.1 has no such package. There is a
>>> spark-1.5.1-bin-hadoop1-scala2.11.tgz package, but this is a separate
>>> thing. (1.5.0 also has a hadoop1-scala2.11 package.)
>>>
>>> Was this intentional?
>>>
>>> More importantly, is there some rough specification for what packages we
>>> should be able to expect in this S3 bucket with every release?
>>>
>>> This is important for those of us who depend on this publishing venue
>>> (e.g. spark-ec2 and related tools).
>>>
>>> Nick
>>> 
>>>
>>
>>
>

Spark 1.5.1 - Scala 2.10 - Hadoop 1 package is missing from S3

2015-10-04 Thread Nicholas Chammas

I’m looking here:

https://s3.amazonaws.com/spark-related-packages/

I believe this is where one set of official packages is published. Please
correct me if this is not the case.

It appears that almost every version of Spark up to and including 1.5.0 has
included a --bin-hadoop1.tgz release (e.g. spark-1.5.0-bin-hadoop1.tgz).

However, 1.5.1 has no such package. There is a
spark-1.5.1-bin-hadoop1-scala2.11.tgz package, but this is a separate
thing. (1.5.0 also has a hadoop1-scala2.11 package.)

Was this intentional?

More importantly, is there some rough specification for what packages we
should be able to expect in this S3 bucket with every release?

This is important for those of us who depend on this publishing venue (e.g.
spark-ec2 and related tools).

Nick

[issue25284] Spec for BaseEventLoop.run_in_executor(executor, callback, *args) is outdated in documentation

2015-09-30 Thread Nicholas Chammas


Changes by Nicholas Chammas <nicholas.cham...@gmail.com>:


--
nosy: +Nicholas Chammas

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue25284>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Re: How to get the HDFS path for each RDD

2015-09-27 Thread Nicholas Chammas

Shouldn't this discussion be held on the user list and not the dev list?
The dev list (this list) is for discussing development on Spark itself.

Please move the discussion accordingly.

Nick
2015년 9월 27일 (일) 오후 10:57, Fengdong Yu 님이 작성:

> Hi Anchit,
> cat you create more than one data in each dataset to test again?
>
>
>
> On Sep 26, 2015, at 18:00, Fengdong Yu  wrote:
>
> Anchit,
>
> please ignore my inputs. you are right. Thanks.
>
>
>
> On Sep 26, 2015, at 17:27, Fengdong Yu  wrote:
>
> Hi Anchit,
>
> this is not my expected, because you specified the HDFS directory in your
> code.
> I've solved like this:
>
>val text = sc.hadoopFile(Args.input,
>classOf[TextInputFormat], classOf[LongWritable],
> classOf[Text], 2)
> val hadoopRdd = text.asInstanceOf[HadoopRDD[LongWritable, Text]]
>
>   hadoopRdd.mapPartitionsWithInputSplit((inputSplit, iterator) => {
>   val file = inputSplit.asInstanceOf[FileSplit]
>   terator.map ( tp => {tp._1, new Text(file.toString + “,” +
> tp._2.toString)})
>   }
>
>
>
>
> On Sep 25, 2015, at 13:12, Anchit Choudhry 
> wrote:
>
> Hi Fengdong,
>
> So I created two files in HDFS under a test folder.
>
> test/dt=20100101.json
> { "key1" : "value1" }
>
> test/dt=20100102.json
> { "key2" : "value2" }
>
> Then inside PySpark shell
>
> rdd = sc.wholeTextFiles('./test/*')
> rdd.collect()
> [(u'hdfs://localhost:9000/user/hduser/test/dt=20100101.json', u'{ "key1"
> : "value1" }), (u'hdfs://localhost:9000/user/hduser/test/dt=20100102.json',
> u'{ "key2" : "value2" })]
> import json
> def editMe(y, x):
>   j = json.loads(y)
>   j['source'] = x
>   return j
>
> rdd.map(lambda (x,y): editMe(y,x)).collect()
> [{'source': u'hdfs://localhost:9000/user/hduser/test/dt=20100101.json',
> u'key1': u'value1'}, {u'key2': u'value2', 'source': u'hdfs://localhost
> :9000/user/hduser/test/dt=20100102.json'}]
>
> Similarly you could modify the function to return 'source' and 'date' with
> some string manipulation per your requirements.
>
> Let me know if this helps.
>
> Thanks,
> Anchit
>
>
> On 24 September 2015 at 23:55, Fengdong Yu 
> wrote:
>
>>
>> yes. such as I have two data sets:
>>
>> date set A: /data/test1/dt=20100101
>> data set B: /data/test2/dt=20100202
>>
>>
>> all data has the same JSON format , such as:
>> {“key1” : “value1”, “key2” : “value2” }
>>
>>
>> my output expected:
>> {“key1” : “value1”, “key2” : “value2” , “source” : “test1”, “date” :
>> “20100101"}
>> {“key1” : “value1”, “key2” : “value2” , “source” : “test2”, “date” :
>> “20100202"}
>>
>>
>> On Sep 25, 2015, at 11:52, Anchit Choudhry 
>> wrote:
>>
>> Sure. May I ask for a sample input(could be just few lines) and the
>> output you are expecting to bring clarity to my thoughts?
>>
>> On Thu, Sep 24, 2015, 23:44 Fengdong Yu  wrote:
>>
>>> Hi Anchit,
>>>
>>> Thanks for the quick answer.
>>>
>>> my exact question is : I want to add HDFS location into each line in my
>>> JSON  data.
>>>
>>>
>>>
>>> On Sep 25, 2015, at 11:25, Anchit Choudhry 
>>> wrote:
>>>
>>> Hi Fengdong,
>>>
>>> Thanks for your question.
>>>
>>> Spark already has a function called wholeTextFiles within sparkContext
>>> which can help you with that:
>>>
>>> Python
>>>
>>> hdfs://a-hdfs-path/part-0hdfs://a-hdfs-path/part-1
>>> ...hdfs://a-hdfs-path/part-n
>>>
>>> rdd = sparkContext.wholeTextFiles(“hdfs://a-hdfs-path”)
>>>
>>> (a-hdfs-path/part-0, its content)
>>> (a-hdfs-path/part-1, its content)
>>> ...
>>> (a-hdfs-path/part-n, its content)
>>>
>>> More info: http://spark.apache.org/docs/latest/api/python/pyspark
>>> .html?highlight=wholetext#pyspark.SparkContext.wholeTextFiles
>>>
>>> 
>>>
>>> Scala
>>>
>>> val rdd = sparkContext.wholeTextFile("hdfs://a-hdfs-path")
>>>
>>> More info: https://spark.apache.org/docs/latest/api/scala
>>> /index.html#org.apache.spark.SparkContext@wholeTextFiles(String,Int):RDD
>>> [(String,String)]
>>>
>>> Let us know if this helps or you need more help.
>>>
>>> Thanks,
>>> Anchit Choudhry
>>>
>>> On 24 September 2015 at 23:12, Fengdong Yu 
>>> wrote:
>>>
 Hi,

 I have  multiple files with JSON format, such as:

 /data/test1_data/sub100/test.data
 /data/test2_data/sub200/test.data


 I can sc.textFile(“/data/*/*”)

 but I want to add the {“source” : “HDFS_LOCATION”} to each line, then
 save it the one target HDFS location.

 how to do it, Thanks.






 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org


>>>
>>>
>>
>
>
>
>

[jira] [Commented] (SPARK-2622) Add Jenkins build numbers to SparkQA messages

2015-09-17 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14803169#comment-14803169
 ] 

Nicholas Chammas commented on SPARK-2622:
-

[~mxm] - I noticed you have been posting this kind of message on several Spark 
JIRAs (with a link to a non-related Flink PR). They appear to be mistakes made 
by some automated bot. Please correct this issue.

> Add Jenkins build numbers to SparkQA messages
> -
>
> Key: SPARK-2622
> URL: https://issues.apache.org/jira/browse/SPARK-2622
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.0.1
>Reporter: Xiangrui Meng
>Priority: Minor
>
> It takes Jenkins 2 hours to finish testing. It is possible to have the 
> following:
> {code}
> Build 1 started.
> PR updated.
> Build 2 started.
> Build 1 finished successfully.
> A committer merged the PR because the last build seemed to be okay.
> Build 2 failed.
> {code}
> It would be nice to put the build number in the SparkQA message so it is easy 
> to match the result with the build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-2622) Add Jenkins build numbers to SparkQA messages

2015-09-17 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-2622?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14804559#comment-14804559
 ] 

Nicholas Chammas commented on SPARK-2622:
-

No worries. Thanks for quickly finding and resolving the issue. I appreciate it!

> Add Jenkins build numbers to SparkQA messages
> -
>
> Key: SPARK-2622
> URL: https://issues.apache.org/jira/browse/SPARK-2622
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.0.1
>Reporter: Xiangrui Meng
>Priority: Minor
>
> It takes Jenkins 2 hours to finish testing. It is possible to have the 
> following:
> {code}
> Build 1 started.
> PR updated.
> Build 2 started.
> Build 1 finished successfully.
> A committer merged the PR because the last build seemed to be okay.
> Build 2 failed.
> {code}
> It would be nice to put the build number in the SparkQA message so it is easy 
> to match the result with the build.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4216) Eliminate duplicate Jenkins GitHub posts from AMPLab

2015-09-16 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14791016#comment-14791016
 ] 

Nicholas Chammas commented on SPARK-4216:
-

Thanks Josh!

> Eliminate duplicate Jenkins GitHub posts from AMPLab
> 
>
> Key: SPARK-4216
> URL: https://issues.apache.org/jira/browse/SPARK-4216
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Project Infra
>        Reporter: Nicholas Chammas
>Priority: Minor
>
> * [Real Jenkins | 
> https://github.com/apache/spark/pull/2988#issuecomment-60873361]
> * [Imposter Jenkins | 
> https://github.com/apache/spark/pull/2988#issuecomment-60873366]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3369) Java mapPartitions Iterator->Iterable is inconsistent with Scala's Iterator->Iterator

2015-09-08 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3369?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735323#comment-14735323
 ] 

Nicholas Chammas commented on SPARK-3369:
-

Sean said:

{quote}
I don't think there's a "why" – just hasn't been done by someone who wants to 
do it. I think it's fine to document this. It would more constructive if you 
opened a PR to this effect.
{quote}

I was about to comment to this effect.

There is a known problem here that we cannot fix until Spark 2.0 due to API 
compatibility guarantees. The only thing that can be done now is to perhaps add 
some documentation explaining this issue.

Ryan said:

{quote}
You know what type of change is guaranteed not to break existing code? Javadoc 
changes. Why has the FlatMapFunction interface (and other affected types and 
methods) not been documented as defective?
{quote}

The answer is simply that no-one has stepped up to do it yet. In open source 
projects, people generally work on what interests them. The person best in a 
position to fix an issue like this is one to whom this issue matters, and who 
is willing to take the initiative.

> Java mapPartitions Iterator->Iterable is inconsistent with Scala's 
> Iterator->Iterator
> -
>
> Key: SPARK-3369
> URL: https://issues.apache.org/jira/browse/SPARK-3369
> Project: Spark
>  Issue Type: Improvement
>  Components: Java API
>Affects Versions: 1.0.2, 1.2.1
>Reporter: Sean Owen
>Assignee: Sean Owen
>  Labels: breaking_change
> Attachments: FlatMapIterator.patch
>
>
> {{mapPartitions}} in the Scala RDD API takes a function that transforms an 
> {{Iterator}} to an {{Iterator}}: 
> http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.RDD
> In the Java RDD API, the equivalent is a FlatMapFunction, which operates on 
> an {{Iterator}} but is requires to return an {{Iterable}}, which is a 
> stronger condition and appears inconsistent. It's a problematic inconsistent 
> though because this seems to require copying all of the input into memory in 
> order to create an object that can be iterated many times, since the input 
> does not afford this itself.
> Similarity for other {{mapPartitions*}} methods and other 
> {{*FlatMapFunctions}}s in Java.
> (Is there a reason for this difference that I'm overlooking?)
> If I'm right that this was inadvertent inconsistency, then the big issue here 
> is that of course this is part of a public API. Workarounds I can think of:
> Promise that Spark will only call {{iterator()}} once, so implementors can 
> use a hacky {{IteratorIterable}} that returns the same {{Iterator}}.
> Or, make a series of methods accepting a {{FlatMapFunction2}}, etc. with the 
> desired signature, and deprecate existing ones.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-28 Thread Nicholas Chammas

Hi Everybody!

Thanks for participating in the spark-ec2 survey. The full results are
publicly viewable here:

https://docs.google.com/forms/d/1VC3YEcylbguzJ-YeggqxntL66MbqksQHPwbodPz_RTg/viewanalytics

The gist of the results is as follows:

Most people found spark-ec2 useful as an easy way to get a working Spark
cluster to run a quick experiment or do some benchmarking without having to
do a lot of manual configuration or setup work.

Many people lamented the slow launch times of spark-ec2, problems getting
it to launch clusters within a VPC, and broken Ganglia installs. Some also
mentioned that Hadoop 2 didn't work as expected.

Wish list items for spark-ec2 included faster launches, selectable Hadoop 2
versions, and more configuration options.

If you'd like to add your own feedback to what's already there, I've
decided to leave the survey open for a few more days:

http://goo.gl/forms/erct2s6KRR

As noted before, your results are anonymous and public.

Thanks again for participating! I hope this has been useful to the
community.

Nick

On Tue, Aug 25, 2015 at 1:31 PM Nicholas Chammas nicholas.cham...@gmail.com
wrote:

Final chance to fill out the survey!

http://goo.gl/forms/erct2s6KRR

I'm gonna close it to new responses tonight and send out a summary of the
results.

Nick

On Thu, Aug 20, 2015 at 2:08 PM Nicholas Chammas
nicholas.cham...@gmail.com wrote:

I'm planning to close the survey to further responses early next week.

If you haven't chimed in yet, the link to the survey is here:

http://goo.gl/forms/erct2s6KRR

We already have some great responses, which you can view. I'll share a
summary after the survey is closed.

Cheers!

Nick

On Mon, Aug 17, 2015 at 11:09 AM Nicholas Chammas
nicholas.cham...@gmail.com wrote:

Howdy folks!

I’m interested in hearing about what people think of spark-ec2
http://spark.apache.org/docs/latest/ec2-scripts.html outside of the
formal JIRA process. Your answers will all be anonymous and public.

If the embedded form below doesn’t work for you, you can use this link
to get the same survey:

http://goo.gl/forms/erct2s6KRR

Cheers!
Nick

Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-28 Thread Nicholas Chammas

Hi Everybody!

Thanks for participating in the spark-ec2 survey. The full results are
publicly viewable here:

https://docs.google.com/forms/d/1VC3YEcylbguzJ-YeggqxntL66MbqksQHPwbodPz_RTg/viewanalytics

The gist of the results is as follows:

Most people found spark-ec2 useful as an easy way to get a working Spark
cluster to run a quick experiment or do some benchmarking without having to
do a lot of manual configuration or setup work.

Many people lamented the slow launch times of spark-ec2, problems getting
it to launch clusters within a VPC, and broken Ganglia installs. Some also
mentioned that Hadoop 2 didn't work as expected.

Wish list items for spark-ec2 included faster launches, selectable Hadoop 2
versions, and more configuration options.

If you'd like to add your own feedback to what's already there, I've
decided to leave the survey open for a few more days:

http://goo.gl/forms/erct2s6KRR

As noted before, your results are anonymous and public.

Thanks again for participating! I hope this has been useful to the
community.

Nick

On Tue, Aug 25, 2015 at 1:31 PM Nicholas Chammas nicholas.cham...@gmail.com
wrote:

Final chance to fill out the survey!

http://goo.gl/forms/erct2s6KRR

I'm gonna close it to new responses tonight and send out a summary of the
results.

Nick

On Thu, Aug 20, 2015 at 2:08 PM Nicholas Chammas
nicholas.cham...@gmail.com wrote:

I'm planning to close the survey to further responses early next week.

If you haven't chimed in yet, the link to the survey is here:

http://goo.gl/forms/erct2s6KRR

We already have some great responses, which you can view. I'll share a
summary after the survey is closed.

Cheers!

Nick

On Mon, Aug 17, 2015 at 11:09 AM Nicholas Chammas
nicholas.cham...@gmail.com wrote:

Howdy folks!

If the embedded form below doesn’t work for you, you can use this link
to get the same survey:

http://goo.gl/forms/erct2s6KRR

Cheers!
Nick

Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-25 Thread Nicholas Chammas

Final chance to fill out the survey!

http://goo.gl/forms/erct2s6KRR

I'm gonna close it to new responses tonight and send out a summary of the
results.

Nick

On Thu, Aug 20, 2015 at 2:08 PM Nicholas Chammas nicholas.cham...@gmail.com
wrote:

 I'm planning to close the survey to further responses early next week.

 If you haven't chimed in yet, the link to the survey is here:

 http://goo.gl/forms/erct2s6KRR

 We already have some great responses, which you can view. I'll share a
 summary after the survey is closed.

 Cheers!

 Nick


 On Mon, Aug 17, 2015 at 11:09 AM Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Howdy folks!

 I’m interested in hearing about what people think of spark-ec2
 http://spark.apache.org/docs/latest/ec2-scripts.html outside of the
 formal JIRA process. Your answers will all be anonymous and public.

 If the embedded form below doesn’t work for you, you can use this link to
 get the same survey:

 http://goo.gl/forms/erct2s6KRR

 Cheers!
 Nick

Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-25 Thread Nicholas Chammas

Final chance to fill out the survey!

http://goo.gl/forms/erct2s6KRR

I'm gonna close it to new responses tonight and send out a summary of the
results.

Nick

On Thu, Aug 20, 2015 at 2:08 PM Nicholas Chammas nicholas.cham...@gmail.com
wrote:

 I'm planning to close the survey to further responses early next week.

 If you haven't chimed in yet, the link to the survey is here:

 http://goo.gl/forms/erct2s6KRR

 We already have some great responses, which you can view. I'll share a
 summary after the survey is closed.

 Cheers!

 Nick


 On Mon, Aug 17, 2015 at 11:09 AM Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 Howdy folks!

 I’m interested in hearing about what people think of spark-ec2
 http://spark.apache.org/docs/latest/ec2-scripts.html outside of the
 formal JIRA process. Your answers will all be anonymous and public.

 If the embedded form below doesn’t work for you, you can use this link to
 get the same survey:

 http://goo.gl/forms/erct2s6KRR

 Cheers!
 Nick

[jira] [Commented] (SPARK-10191) spark-ec2 cannot stop running cluster

2015-08-24 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-10191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14710093#comment-14710093
 ] 

Nicholas Chammas commented on SPARK-10191:
--

Can you fill in the description here with a minimal description and 
reproduction of the issue?

 spark-ec2 cannot stop running cluster
 -

 Key: SPARK-10191
 URL: https://issues.apache.org/jira/browse/SPARK-10191
 Project: Spark
  Issue Type: Bug
  Components: EC2
 Environment: AWS EC2
Reporter: Ruofan Kong





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3533) Add saveAsTextFileByKey() method to RDDs

2015-08-20 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705446#comment-14705446
 ] 

Nicholas Chammas commented on SPARK-3533:
-

{quote}
Nicholas Chammas Have you been able to take a look at the code?
{quote}

I'm unfortunately not in a good position to review the code and give the 
appropriate feedback. Someone like [~davies] or [~srowen] may be able to do 
that, but I can't speak for their availability.

{quote}
I'm not sure if you're suggesting it would be better to make a pull request 
now, or whether the gist is sufficient. I will open a pull request if you 
prefer. Is there anything else I should be doing to get committer buy-in?
{quote}

As a fellow contributor, I'm just advising that committer buy-in is essential 
to getting a feature like this landed. To get that, you may need to risk 
offering up a full solution knowing that it may be rejected or require many 
changes before acceptance.

An alternative would be to get some pre-approval for the idea and guidance from 
a committer (perhaps via the dev list) before crafting a full solution. 
However, if the feature is not already a priority for some committer, this is 
unlikely to happen.

I'm not sure what the right way to go is, but those are your options, 
realistically.

 Add saveAsTextFileByKey() method to RDDs
 

 Key: SPARK-3533
 URL: https://issues.apache.org/jira/browse/SPARK-3533
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
Reporter: Nicholas Chammas

 Users often have a single RDD of key-value pairs that they want to save to 
 multiple locations based on the keys.
 For example, say I have an RDD like this:
 {code}
  a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 
  'Frankie']).keyBy(lambda x: x[0])
  a.collect()
 [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')]
  a.keys().distinct().collect()
 ['B', 'F', 'N']
 {code}
 Now I want to write the RDD out to different paths depending on the keys, so 
 that I have one output directory per distinct key. Each output directory 
 could potentially have multiple {{part-}} files, one per RDD partition.
 So the output would look something like:
 {code}
 /path/prefix/B [/part-1, /part-2, etc]
 /path/prefix/F [/part-1, /part-2, etc]
 /path/prefix/N [/part-1, /part-2, etc]
 {code}
 Though it may be possible to do this with some combination of 
 {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the 
 {{MultipleTextOutputFormat}} output format class, it isn't straightforward. 
 It's not clear if it's even possible at all in PySpark.
 Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs 
 that makes it easy to save RDDs out to multiple locations at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-20 Thread Nicholas Chammas

I'm planning to close the survey to further responses early next week.

If you haven't chimed in yet, the link to the survey is here:

http://goo.gl/forms/erct2s6KRR

We already have some great responses, which you can view. I'll share a
summary after the survey is closed.

Cheers!

Nick


On Mon, Aug 17, 2015 at 11:09 AM Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 Howdy folks!

 I’m interested in hearing about what people think of spark-ec2
 http://spark.apache.org/docs/latest/ec2-scripts.html outside of the
 formal JIRA process. Your answers will all be anonymous and public.

 If the embedded form below doesn’t work for you, you can use this link to
 get the same survey:

 http://goo.gl/forms/erct2s6KRR

 Cheers!
 Nick

Re: [survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-20 Thread Nicholas Chammas

I'm planning to close the survey to further responses early next week.

If you haven't chimed in yet, the link to the survey is here:

http://goo.gl/forms/erct2s6KRR

We already have some great responses, which you can view. I'll share a
summary after the survey is closed.

Cheers!

Nick


On Mon, Aug 17, 2015 at 11:09 AM Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 Howdy folks!

 I’m interested in hearing about what people think of spark-ec2
 http://spark.apache.org/docs/latest/ec2-scripts.html outside of the
 formal JIRA process. Your answers will all be anonymous and public.

 If the embedded form below doesn’t work for you, you can use this link to
 get the same survey:

 http://goo.gl/forms/erct2s6KRR

 Cheers!
 Nick

[jira] [Commented] (SPARK-3533) Add saveAsTextFileByKey() method to RDDs

2015-08-20 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14705182#comment-14705182
 ] 

Nicholas Chammas commented on SPARK-3533:
-

No need to open a separate ticket if your proposal is closely related to
and satisfies the original intent of this one.

More important is getting committer buy-in for your idea and showing some
code. (Doing the latter first may help immensely with the former, in fact,
but there's a risk the effort will still be rejected.)

There are already other solutions out there (several on Stack Overflow
which are linked to from here) that make do with Spark's current API. This
proposal should focus on figuring what parts of those solutions can and
should go into Spark core.



 Add saveAsTextFileByKey() method to RDDs
 

 Key: SPARK-3533
 URL: https://issues.apache.org/jira/browse/SPARK-3533
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
Reporter: Nicholas Chammas

 Users often have a single RDD of key-value pairs that they want to save to 
 multiple locations based on the keys.
 For example, say I have an RDD like this:
 {code}
  a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 
  'Frankie']).keyBy(lambda x: x[0])
  a.collect()
 [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')]
  a.keys().distinct().collect()
 ['B', 'F', 'N']
 {code}
 Now I want to write the RDD out to different paths depending on the keys, so 
 that I have one output directory per distinct key. Each output directory 
 could potentially have multiple {{part-}} files, one per RDD partition.
 So the output would look something like:
 {code}
 /path/prefix/B [/part-1, /part-2, etc]
 /path/prefix/F [/part-1, /part-2, etc]
 /path/prefix/N [/part-1, /part-2, etc]
 {code}
 Though it may be possible to do this with some combination of 
 {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the 
 {{MultipleTextOutputFormat}} output format class, it isn't straightforward. 
 It's not clear if it's even possible at all in PySpark.
 Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs 
 that makes it easy to save RDDs out to multiple locations at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-3533) Add saveAsTextFileByKey() method to RDDs

2015-08-17 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14699613#comment-14699613
 ] 

Nicholas Chammas commented on SPARK-3533:
-

[~silasdavis] - If you already have a working implementation that covers at 
least the Python, Java, and Scala APIs, then I suggest opening a PR to get 
detailed feedback. 

Is there anyone watching this JIRA who would be willing to shepherd a PR to 
solve it? Apart from having a working PR, we will need people (especially 
committers) to review and critique the approach for it to be accepted.

By the way Silas, there was a [previous 
attempt|https://github.com/apache/spark/pull/4895] at solving this issue that 
was closed by the author because he could not get it to work with Python. You 
might want to take a quick look at that.

 Add saveAsTextFileByKey() method to RDDs
 

 Key: SPARK-3533
 URL: https://issues.apache.org/jira/browse/SPARK-3533
 Project: Spark
  Issue Type: Improvement
  Components: PySpark, Spark Core
Affects Versions: 1.1.0
Reporter: Nicholas Chammas

 Users often have a single RDD of key-value pairs that they want to save to 
 multiple locations based on the keys.
 For example, say I have an RDD like this:
 {code}
  a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 
  'Frankie']).keyBy(lambda x: x[0])
  a.collect()
 [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')]
  a.keys().distinct().collect()
 ['B', 'F', 'N']
 {code}
 Now I want to write the RDD out to different paths depending on the keys, so 
 that I have one output directory per distinct key. Each output directory 
 could potentially have multiple {{part-}} files, one per RDD partition.
 So the output would look something like:
 {code}
 /path/prefix/B [/part-1, /part-2, etc]
 /path/prefix/F [/part-1, /part-2, etc]
 /path/prefix/N [/part-1, /part-2, etc]
 {code}
 Though it may be possible to do this with some combination of 
 {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the 
 {{MultipleTextOutputFormat}} output format class, it isn't straightforward. 
 It's not clear if it's even possible at all in PySpark.
 Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs 
 that makes it easy to save RDDs out to multiple locations at once.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-17 Thread Nicholas Chammas

Howdy folks!

I’m interested in hearing about what people think of spark-ec2
http://spark.apache.org/docs/latest/ec2-scripts.html outside of the
formal JIRA process. Your answers will all be anonymous and public.

If the embedded form below doesn’t work for you, you can use this link to
get the same survey:

http://goo.gl/forms/erct2s6KRR

Cheers!
Nick

[survey] [spark-ec2] What do you like/dislike about spark-ec2?

2015-08-17 Thread Nicholas Chammas

Howdy folks!

I’m interested in hearing about what people think of spark-ec2
http://spark.apache.org/docs/latest/ec2-scripts.html outside of the
formal JIRA process. Your answers will all be anonymous and public.

If the embedded form below doesn’t work for you, you can use this link to
get the same survey:

http://goo.gl/forms/erct2s6KRR

Cheers!
Nick

Re: Writing to multiple outputs in Spark

2015-08-14 Thread Nicholas Chammas

See: https://issues.apache.org/jira/browse/SPARK-3533

Feel free to comment there and make a case if you think the issue should be
reopened.

Nick

On Fri, Aug 14, 2015 at 11:11 AM Abhishek R. Singh
abhis...@tetrationanalytics.com wrote:

A workaround would be to have multiple passes on the RDD and each pass
write its own output?

Or in a foreachPartition do it in a single pass (open up multiple files
per partition to write out)?

-Abhishek-

On Aug 14, 2015, at 7:56 AM, Silas Davis si...@silasdavis.net wrote:

Would it be right to assume that the silence on this topic implies others
don't really have this issue/desire?

On Sat, 18 Jul 2015 at 17:24 Silas Davis si...@silasdavis.net wrote:

*tl;dr hadoop and cascading* *provide ways of writing tuples to multiple
output files based on key, but the plain RDD interface doesn't seem to and
it should.*

I have been looking into ways to write to multiple outputs in Spark. It
seems like a feature that is somewhat missing from Spark.

The idea is to partition output and write the elements of an RDD to
different locations depending based on the key. For example in a pair RDD
your key may be (language, date, userId) and you would like to write
separate files to $someBasePath/$language/$date. Then there would be a
version of saveAsHadoopDataset that would be able to multiple location
based on key using the underlying OutputFormat. Perahps it would take a
pair RDD with keys ($partitionKey, $realKey), so for example ((language,
date), userId).

The prior art I have found on this is the following.

Using SparkSQL:
The 'partitionBy' method of DataFrameWriter:
https://spark.apache.org/docs/1.4.0/api/scala/index.html#org.apache.spark.sql.DataFrameWriter

This only works for parquet at the moment.

Using Spark/Hadoop:
This pull request (with the hadoop1 API,) :
https://github.com/apache/spark/pull/4895/files.

This uses MultipleTextOutputFormat (which in turn uses
MultipleOutputFormat) which is part of the old hadoop1 API. It only works
for text but could be generalised for any underlying OutputFormat by using
MultipleOutputFormat (but only for hadoop1 - which doesn't support
ParquetAvroOutputFormat for example)

This gist (With the hadoop2 API):
https://gist.github.com/mlehman/df9546f6be2e362bbad2

This uses MultipleOutputs (available for both the old and new hadoop
APIs) and extends saveAsNewHadoopDataset to support multiple outputs.
Should work for any underlying OutputFormat. Probably better implemented by
extending saveAs[NewAPI]HadoopDataset.

In Cascading:
Cascading provides PartititionTap:
http://docs.cascading.org/cascading/2.5/javadoc/cascading/tap/local/PartitionTap.html
to do this

So my questions are: is there a reason why Spark doesn't provide this?
Does Spark provide similar functionality through some other mechanism? How
would it be best implemented?

Since I started composing this message I've had a go at writing an
wrapper OutputFormat that writes multiple outputs using hadoop
MultipleOutputs but doesn't require modification of the PairRDDFunctions.
The principle is similar however. Again it feels slightly hacky to use
dummy fields for the ReduceContextImpl, but some of this may be a part of
the impedance mismatch between Spark and plain Hadoop... Here is my
attempt: https://gist.github.com/silasdavis/d1d1f1f7ab78249af462

I'd like to see this functionality in Spark somehow but invite suggestion
of how best to achieve it.

Thanks,
Silas

Re: Unsubscribe

2015-08-03 Thread Nicholas Chammas

The way to do that is to follow the Unsubscribe link here for dev@spark:

http://spark.apache.org/community.html

We can't drop you. You have to do it yourself.

Nick

On Mon, Aug 3, 2015 at 1:54 PM Trevor Grant trevor.d.gr...@gmail.com
wrote:

 Please drop me from this list

 Trevor Grant
 Data Scientist
 https://github.com/rawkintrevo
 http://stackexchange.com/users/3002022/rawkintrevo

 *Fortunate is he, who is able to know the causes of things.  -Virgil*

Re: Should spark-ec2 get its own repo?

2015-08-02 Thread Nicholas Chammas

On Sat, Aug 1, 2015 at 1:09 PM Matt Goodman meawo...@gmail.com wrote:

 I am considering porting some of this to a more general spark-cloud
 launcher, including google/aliyun/rackspace.  It shouldn't be hard at all
 given the current approach for setup/install.


FWIW, there are already some tools for launching Spark clusters on GCE and
Azure:

http://spark-packages.org/?q=tags%3A%22Deployment%22

Nick

Re: spark spark-ec2 credentials using aws_security_token

2015-07-27 Thread Nicholas Chammas

You refer to `aws_security_token`, but I'm not sure where you're specifying
it. Can you elaborate? Is it an environment variable?

On Mon, Jul 27, 2015 at 4:21 AM Jan Zikeš jan.zi...@centrum.cz wrote:

 Hi,

 I would like to ask if it is currently possible to use spark-ec2 script
 together with credentials that are consisting not only from:
 aws_access_key_id and aws_secret_access_key, but it also contains
 aws_security_token.

 When I try to run the script I am getting following error message:

 ERROR:boto:Caught exception reading instance data
 Traceback (most recent call last):
   File /Users/zikes/opensource/spark/ec2/lib/boto-2.34.0/boto/utils.py,
 line 210, in retry_url
 r = opener.open(req, timeout=timeout)
   File

 /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py,
 line 404, in open
 response = self._open(req, data)
   File

 /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py,
 line 422, in _open
 '_open', req)
   File

 /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py,
 line 382, in _call_chain
 result = func(*args)
   File

 /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py,
 line 1214, in http_open
 return self.do_open(httplib.HTTPConnection, req)
   File

 /System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/urllib2.py,
 line 1184, in do_open
 raise URLError(err)
 URLError: urlopen error [Errno 64] Host is down
 ERROR:boto:Unable to read instance data, giving up
 No handler was ready to authenticate. 1 handlers were checked.
 ['QuerySignatureV2AuthHandler'] Check your credentials

 Does anyone has some idea what can be possibly wrong? Is aws_security_token
 the problem?
 I know that it seems more like a boto problem, but still I would like to
 ask
 if anybody has some experience with this?

 My launch command is:
 ./spark-ec2 -k my_key -i my_key.pem --additional-tags
 mytag:tag1,mytag2:tag2 --instance-profile-name profile1 -s 1 launch
 test

 Thank you in advance for any help.
 Best regards,

 Jan

 Note:
 I have also asked at

 http://stackoverflow.com/questions/31583513/spark-spark-ec2-credentials-using-aws-security-token?noredirect=1#comment51151822_31583513
 without any success.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-spark-ec2-credentials-using-aws-security-token-tp24007.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Re: Should spark-ec2 get its own repo?

2015-07-13 Thread Nicholas Chammas

At a high level I see the spark-ec2 scripts as an effort to provide a
reference implementation for launching EC2 clusters with Apache Spark

On a side note, this is precisely how I used spark-ec2 for a personal
project that does something similar: reference implementation.

Nick
2015년 7월 13일 (월) 오후 1:27, Shivaram Venkataraman shiva...@eecs.berkeley.edu님이
작성:

I think moving the repo-location and re-organizing the python code to
handle dependencies, testing etc. sounds good to me. However, I think there
are a couple of things which I am not sure about

1. I strongly believe that we should preserve existing command-line in
ec2/spark-ec2 (i.e. the shell script not the python file). This could be a
thin wrapper script that just checks out the or downloads something
(similar to say build/mvn). Mainly, I see no reason to break the workflow
that users are used to right now.

2. I am also not sure about that moving the issue tracker is necessarily a
good idea. I don't think we get a large number of issues due to the EC2
stuff and if we do have a workflow for launching EC2 clusters, the Spark
JIRA would still be the natural place to report issues related to this.

At a high level I see the spark-ec2 scripts as an effort to provide a
reference implementation for launching EC2 clusters with Apache Spark --
Given this view I am not sure it makes sense to completely decouple this
from the Apache project.

Thanks
Shivaram

On Sun, Jul 12, 2015 at 1:34 AM, Sean Owen so...@cloudera.com wrote:

I agree with these points. The ec2 support is substantially a separate
project, and would likely be better managed as one. People can much
more rapidly iterate on it and release it.

I suggest:

1. Pick a new repo location. amplab/spark-ec2 ? spark-ec2/spark-ec2 ?
2. Add interested parties as owners/contributors
3. Reassemble a working clone of the current code from spark/ec2 and
mesos/spark-ec2 and check it in
4. Announce the new location on user@, dev@
5. Triage open JIRAs to the new repo's issue tracker and close them
elsewhere
6. Remove the old copies of the code and leave a pointer to the new
location in their place

I'd also like to hear a few more nods before pulling the trigger though.

On Sat, Jul 11, 2015 at 7:07 PM, Matt Goodman meawo...@gmail.com wrote:
I wanted to revive the conversation about the spark-ec2 tools, as it
seems
to have been lost in the 1.4.1 release voting spree.

I think that splitting it into its own repository is a really good
move, and
I would also be happy to help with this transition, as well as help
maintain
the resulting repository. Here is my justification for why we ought to
do
this split.

User Facing:

The spark-ec2 launcher dosen't use anything in the parent spark
repository
spark-ec2 version is disjoint from the parent repo. I consider it
confusing
that the spark-ec2 script dosen't launch the version of spark it is
checked-out with.
Someone interested in setting up spark-ec2 with anything but the default
configuration will have to clone at least 2 repositories at present, and
probably fork and push changes to 1.
spark-ec2 has mismatched dependencies wrt. to spark itself. This
includes a
confusing shim in the spark-ec2 script to install boto, which frankly
should
just be a dependency of the script

Developer Facing:

Support across 2 repos will be worse than across 1. Its unclear where
to
file issues/PRs, and requires extra communications for even fairly
trivial
stuff.
Spark-ec2 also depends on a number binary blobs being in the right
place,
currently the responsibility for these is decentralized, and likely
prone to
various flavors of dumb.
The current flow of booting a spark-ec2 cluster is _complicated_ I
spent the
better part of a couple days figuring out how to integrate our custom
tools
into this stack. This is very hard to fix when commits/PR's need to
span
groups/repositories/buckets-o-binary, I am sure there are several other
problems that are languishing under similar roadblocks
It makes testing possible. The spark-ec2 script is a great case for CI
given the number of permutations of launch criteria there are. I
suspect
AWS would be happy to foot the bill on spark-ec2 testing (probably ~20
bucks
a month based on some envelope sketches), as it is a piece of software
that
directly impacts other people giving them money. I have some contacts
there, and I am pretty sure this would be an easy conversation,
particularly
if the repo directly concerned with ec2. Think also being able to
assemble
the binary blobs into s3 bucket dedicated to spark-ec2

Any other thoughts/voices appreciated here. spark-ec2 is a super-power
tool
and deserves a fair bit of attention!
--Matthew Goodman

=
Check Out My Website: http://craneium.net
Find me on LinkedIn: http://tinyurl.com/d6wlch

[jira] [Commented] (SPARK-8960) Style cleanup of spark_ec2.py

2015-07-10 Thread Nicholas Chammas (JIRA)

[
https://issues.apache.org/jira/browse/SPARK-8960?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14622542#comment-14622542
]

Nicholas Chammas commented on SPARK-8960:
-

Style cleanup is OK, but should be preferably done at the same time as
something more significant, like a refactoring of some part of the code, or
adding new features or bug fixes.

A PR just to clean up minor style issues is probably not needed (feel free to
make the case otherwise), since the script is already PEP 8-compliant and thus
meets a minimal level of consistency and good style.

A PR towards reorganizing the script (per the [dev list
discussion|http://mail-archives.apache.org/mod_mbox/incubator-spark-dev/201507.mbox/%3CCAOhmDzcnYgswssNP11VbGzSLisOKjGfnuMQMQc7yHiDL5SusmA%40mail.gmail.com%3E])
or adding tests for spark-ec2 (as Daniel mentioned) is probably much more
valuable.

Style cleanup of spark_ec2.py
-

Key: SPARK-8960
URL: https://issues.apache.org/jira/browse/SPARK-8960
Project: Spark
Issue Type: Task
Components: EC2
Affects Versions: 1.4.0
Reporter: Daniel Darabos
Priority: Trivial

The spark_ec2.py script could use some cleanup I think. There are simple
style issues like mixing single and double quotes, but also some rather
un-Pythonic constructs (e.g.
https://github.com/apache/spark/pull/6336#commitcomment-12088624 that sparked
this JIRA). Whenever I read it, I always find something that is too minor for
a pull request/JIRA, but I'd fix it if it was my code. Perhaps we can address
such issues in this JIRA.
The intention is not to introduce any behavioral changes. It's hard to verify
this without testing, so perhaps we should also add some kind of test.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: spark ec2 as non-root / any plan to improve that in the future ?

2015-07-09 Thread Nicholas Chammas

No plans to change that at the moment, but agreed it is against accepted
convention. It would be a lot of work to change the tool, change the AMIs,
and test everything. My suggestion is not to hold your breath for such a
change.

spark-ec2, as far as I understand, is not intended for spinning up
permanent or production infrastructure (though people may use it for those
purposes), so there isn't a big impetus to fix this kind of issue. It works
really well for what it was intended for: spinning up clusters for testing,
prototyping, and experimenting.

Nick

On Thu, Jul 9, 2015 at 3:25 AM matd matd...@gmail.com wrote:

 Hi,

 Spark ec2 scripts are useful, but they install everything as root.
 AFAIK, it's not a good practice ;-)

 Why is it so ?
 Should these scripts reserved for test/demo purposes, and not to be used
 for
 a production system ?
 Is it planned in some roadmap to improve that, or to replace ec2-scripts
 with something else ?

 Would it be difficult to change them to use a sudo-er instead ?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/spark-ec2-as-non-root-any-plan-to-improve-that-in-the-future-tp23734.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

Should spark-ec2 get its own repo?

2015-07-03 Thread Nicholas Chammas

spark-ec2 is kind of a mini project within a project.

It’s composed of a set of EC2 AMIs
https://github.com/mesos/spark-ec2/tree/branch-1.4/ami-list under
someone’s account (maybe Patrick’s?) plus the following 2 code bases:

   - Main command line tool: https://github.com/apache/spark/tree/master/ec2
   - Scripts used to install stuff on launched instances:
   https://github.com/mesos/spark-ec2

You’ll notice that part of the code lives under the Mesos GitHub
organization. This is an artifact of history, when Spark itself kinda grew
out of Mesos before becoming its own project.

There are a few issues with this state of affairs, none of which are major
but which nonetheless merit some discussion:

   - The spark-ec2 code is split across 2 repositories when it is not
   technically necessary.
   - Some of that code is owned by an organization that should technically
   not be owning Spark stuff.
   - Spark and spark-ec2 live in the same repo but spark-ec2 issues are
   often completely disjoint from issues with Spark itself. This has led in
   some cases to new Spark RCs being cut because of minor issues with
   spark-ec2 (like version strings not being updated).

I wanted to put up for discussion a few suggestions and see what people
agreed with.

   1. The current state of affairs is fine and it is not worth moving stuff
   around.
   2. spark-ec2 should get its own repo, and should be moved out of the
   main Spark repo. That means both of the code bases linked above would live
   in one place (maybe a spark-ec2/spark-ec2 repo).
   3. spark-ec2 should stay in the Spark repo, but the stuff under the
   Mesos organization should be moved elsewhere (again, perhaps under a
   spark-ec2/spark-ec2 repo).

What do you think?

Nick

[jira] [Commented] (SPARK-8670) Nested columns can't be referenced (but they can be selected)

2015-06-29 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14605822#comment-14605822
 ] 

Nicholas Chammas commented on SPARK-8670:
-

Not sure. Does Scala offer the same flexibility in syntax like Python?

 Nested columns can't be referenced (but they can be selected)
 -

 Key: SPARK-8670
 URL: https://issues.apache.org/jira/browse/SPARK-8670
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.4.0
Reporter: Nicholas Chammas

 This is strange and looks like a regression from 1.3.
 {code}
 import json
 daterz = [
   {
 'name': 'Nick',
 'stats': {
   'age': 28
 }
   },
   {
 'name': 'George',
 'stats': {
   'age': 31
 }
   }
 ]
 df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x)))
 df.select('stats.age').show()
 df['stats.age']  # 1.4 fails on this line
 {code}
 On 1.3 this works and yields:
 {code}
 age
 28 
 31 
 Out[1]: Columnstats.age AS age#2958L
 {code}
 On 1.4, however, this gives an error on the last line:
 {code}
 +---+
 |age|
 +---+
 | 28|
 | 31|
 +---+
 ---
 IndexErrorTraceback (most recent call last)
 ipython-input-1-04bd990e94c6 in module()
  19 
  20 df.select('stats.age').show()
 --- 21 df['stats.age']
 /path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item)
 678 if isinstance(item, basestring):
 679 if item not in self.columns:
 -- 680 raise IndexError(no such column: %s % item)
 681 jc = self._jdf.apply(item)
 682 return Column(jc)
 IndexError: no such column: stats.age
 {code}
 This means, among other things, that you can't join DataFrames on nested 
 columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8670) Nested columns can't be referenced (but they can be selected)

2015-06-29 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606178#comment-14606178
 ] 

Nicholas Chammas commented on SPARK-8670:
-

FYI: `df.stats.age` works neither on 1.3 nor on 1.4. In both cases it yields 
this:

{code}
AttributeError: 'Column' object has no attribute 'age'
{code}

`df.selectExpr(stats.age)` does work, though.

 Nested columns can't be referenced (but they can be selected)
 -

 Key: SPARK-8670
 URL: https://issues.apache.org/jira/browse/SPARK-8670
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.4.0
Reporter: Nicholas Chammas

 This is strange and looks like a regression from 1.3.
 {code}
 import json
 daterz = [
   {
 'name': 'Nick',
 'stats': {
   'age': 28
 }
   },
   {
 'name': 'George',
 'stats': {
   'age': 31
 }
   }
 ]
 df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x)))
 df.select('stats.age').show()
 df['stats.age']  # 1.4 fails on this line
 {code}
 On 1.3 this works and yields:
 {code}
 age
 28 
 31 
 Out[1]: Columnstats.age AS age#2958L
 {code}
 On 1.4, however, this gives an error on the last line:
 {code}
 +---+
 |age|
 +---+
 | 28|
 | 31|
 +---+
 ---
 IndexErrorTraceback (most recent call last)
 ipython-input-1-04bd990e94c6 in module()
  19 
  20 df.select('stats.age').show()
 --- 21 df['stats.age']
 /path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item)
 678 if isinstance(item, basestring):
 679 if item not in self.columns:
 -- 680 raise IndexError(no such column: %s % item)
 681 jc = self._jdf.apply(item)
 682 return Column(jc)
 IndexError: no such column: stats.age
 {code}
 This means, among other things, that you can't join DataFrames on nested 
 columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-8670) Nested columns can't be referenced (but they can be selected)

2015-06-29 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606328#comment-14606328
 ] 

Nicholas Chammas edited comment on SPARK-8670 at 6/29/15 9:01 PM:
--

After a discussion with [~davies], it appears that the new way to access or 
reference a nested field in 1.4 using the \_\_getitem\_\_ syntax is as follows:

{code}
# corrected example
df['stats']['age']  # 1.4 works, 1.3 doesn't

# original example
df['stats.age']  # 1.3 works, 1.4 doesn't
{code}

So it looks like something changed from 1.3 and 1.4, and the new way is the way 
of the future.

Thankfully, the corrected example is clearer than the original, and I 
understand from [~yhuai] that 1.4 now supports column names with dots in them, 
so `df\['stats.age'\]` in 1.4 would reference a non-existent column.

Marking this as not an issue, even though technically something that worked in 
1.3 no longer works in 1.4.


was (Author: nchammas):
After a discussion with [~davies], it appears that the way to access or 
reference a nested field in both 1.3 and 1.4 is as follows:

{code}
# corrected example
df['stats']['age']  # works on both 1.3 and 1.4

# original example
df['stats.age']  # 1.3 works, 1.4 doesn't
{code}

So I'm not sure this is a bug so much as it is just a misunderstanding of how 
to access nested fields combined with a change in expressions are parsed.

Thankfully, the corrected example is clearer than the original, and I 
understand from [~yhuai] that 1.4 now supports column names with dots in them, 
so `df\['stats.age'\]` in 1.4 would reference a non-existent column.

Marking this as not an issue.

 Nested columns can't be referenced (but they can be selected)
 -

 Key: SPARK-8670
 URL: https://issues.apache.org/jira/browse/SPARK-8670
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.4.0
Reporter: Nicholas Chammas

 This is strange and looks like a regression from 1.3.
 {code}
 import json
 daterz = [
   {
 'name': 'Nick',
 'stats': {
   'age': 28
 }
   },
   {
 'name': 'George',
 'stats': {
   'age': 31
 }
   }
 ]
 df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x)))
 df.select('stats.age').show()
 df['stats.age']  # 1.4 fails on this line
 {code}
 On 1.3 this works and yields:
 {code}
 age
 28 
 31 
 Out[1]: Columnstats.age AS age#2958L
 {code}
 On 1.4, however, this gives an error on the last line:
 {code}
 +---+
 |age|
 +---+
 | 28|
 | 31|
 +---+
 ---
 IndexErrorTraceback (most recent call last)
 ipython-input-1-04bd990e94c6 in module()
  19 
  20 df.select('stats.age').show()
 --- 21 df['stats.age']
 /path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item)
 678 if isinstance(item, basestring):
 679 if item not in self.columns:
 -- 680 raise IndexError(no such column: %s % item)
 681 jc = self._jdf.apply(item)
 682 return Column(jc)
 IndexError: no such column: stats.age
 {code}
 This means, among other things, that you can't join DataFrames on nested 
 columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-8670) Nested columns can't be referenced (but they can be selected)

2015-06-29 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-8670.
-
Resolution: Invalid

 Nested columns can't be referenced (but they can be selected)
 -

 Key: SPARK-8670
 URL: https://issues.apache.org/jira/browse/SPARK-8670
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.4.0
Reporter: Nicholas Chammas

 This is strange and looks like a regression from 1.3.
 {code}
 import json
 daterz = [
   {
 'name': 'Nick',
 'stats': {
   'age': 28
 }
   },
   {
 'name': 'George',
 'stats': {
   'age': 31
 }
   }
 ]
 df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x)))
 df.select('stats.age').show()
 df['stats.age']  # 1.4 fails on this line
 {code}
 On 1.3 this works and yields:
 {code}
 age
 28 
 31 
 Out[1]: Columnstats.age AS age#2958L
 {code}
 On 1.4, however, this gives an error on the last line:
 {code}
 +---+
 |age|
 +---+
 | 28|
 | 31|
 +---+
 ---
 IndexErrorTraceback (most recent call last)
 ipython-input-1-04bd990e94c6 in module()
  19 
  20 df.select('stats.age').show()
 --- 21 df['stats.age']
 /path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item)
 678 if isinstance(item, basestring):
 679 if item not in self.columns:
 -- 680 raise IndexError(no such column: %s % item)
 681 jc = self._jdf.apply(item)
 682 return Column(jc)
 IndexError: no such column: stats.age
 {code}
 This means, among other things, that you can't join DataFrames on nested 
 columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8670) Nested columns can't be referenced (but they can be selected)

2015-06-29 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14606328#comment-14606328
 ] 

Nicholas Chammas commented on SPARK-8670:
-

After a discussion with [~davies], it appears that the way to access or 
reference a nested field in both 1.3 and 1.4 is as follows:

{code}
# corrected example
df['stats']['age']  # works on both 1.3 and 1.4

# original example
df['stats.age']  # 1.3 works, 1.4 doesn't
{code}

So I'm not sure this is a bug so much as it is just a misunderstanding of how 
to access nested fields combined with a change in expressions are parsed.

Thankfully, the corrected example is clearer than the original, and I 
understand from [~yhuai] that 1.4 now supports column names with dots in them, 
so `df\['stats.age'\]` in 1.4 would reference a non-existent column.

Marking this as not an issue.

 Nested columns can't be referenced (but they can be selected)
 -

 Key: SPARK-8670
 URL: https://issues.apache.org/jira/browse/SPARK-8670
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.4.0
Reporter: Nicholas Chammas

 This is strange and looks like a regression from 1.3.
 {code}
 import json
 daterz = [
   {
 'name': 'Nick',
 'stats': {
   'age': 28
 }
   },
   {
 'name': 'George',
 'stats': {
   'age': 31
 }
   }
 ]
 df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x)))
 df.select('stats.age').show()
 df['stats.age']  # 1.4 fails on this line
 {code}
 On 1.3 this works and yields:
 {code}
 age
 28 
 31 
 Out[1]: Columnstats.age AS age#2958L
 {code}
 On 1.4, however, this gives an error on the last line:
 {code}
 +---+
 |age|
 +---+
 | 28|
 | 31|
 +---+
 ---
 IndexErrorTraceback (most recent call last)
 ipython-input-1-04bd990e94c6 in module()
  19 
  20 df.select('stats.age').show()
 --- 21 df['stats.age']
 /path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item)
 678 if isinstance(item, basestring):
 679 if item not in self.columns:
 -- 680 raise IndexError(no such column: %s % item)
 681 jc = self._jdf.apply(item)
 682 return Column(jc)
 IndexError: no such column: stats.age
 {code}
 This means, among other things, that you can't join DataFrames on nested 
 columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: dataframe left joins are not working as expected in pyspark

2015-06-27 Thread Nicholas Chammas

Yeah, you shouldn't have to rename the columns before joining them.

Do you see the same behavior on 1.3 vs 1.4?

Nick
2015년 6월 27일 (토) 오전 2:51, Axel Dahl a...@whisperstream.com님이 작성:

 still feels like a bug to have to create unique names before a join.

 On Fri, Jun 26, 2015 at 9:51 PM, ayan guha guha.a...@gmail.com wrote:

 You can declare the schema with unique names before creation of df.
 On 27 Jun 2015 13:01, Axel Dahl a...@whisperstream.com wrote:


 I have the following code:

 from pyspark import SQLContext

 d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice',
 'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age': 3}]
 d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'}, {'name':'alice',
 'country': 'ire', 'colour':'green'}]

 r1 = sc.parallelize(d1)
 r2 = sc.parallelize(d2)

 sqlContext = SQLContext(sc)
 df1 = sqlContext.createDataFrame(d1)
 df2 = sqlContext.createDataFrame(d2)
 df1.join(df2, df1.name == df2.name and df1.country == df2.country,
 'left_outer').collect()


 When I run it I get the following, (notice in the first row, all join
 keys are take from the right-side and so are blanked out):

 [Row(age=2, country=None, name=None, colour=None, country=None,
 name=None),
 Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa',
 name=u'bob'),
 Row(age=3, country=u'ire', name=u'alice', colour=u'green',
 country=u'ire', name=u'alice')]

 I would expect to get (though ideally without duplicate columns):
 [Row(age=2, country=u'ire', name=u'Alice', colour=None, country=None,
 name=None),
 Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa',
 name=u'bob'),
 Row(age=3, country=u'ire', name=u'alice', colour=u'green',
 country=u'ire', name=u'alice')]

 The workaround for now is this rather clunky piece of code:
 df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name',
 'name2').withColumnRenamed('country', 'country2')
 df1.join(df2, df1.name == df2.name2 and df1.country == df2.country2,
 'left_outer').collect()

 So to me it looks like a bug, but am I doing something wrong?

 Thanks,

 -Axel

Re: dataframe left joins are not working as expected in pyspark

2015-06-27 Thread Nicholas Chammas

I would test it against 1.3 to be sure, because it could -- though unlikely
-- be a regression. For example, I recently stumbled upon this issue
https://issues.apache.org/jira/browse/SPARK-8670 which was specific to
1.4.

On Sat, Jun 27, 2015 at 12:28 PM Axel Dahl a...@whisperstream.com wrote:

 I've only tested on 1.4, but imagine 1.3 is the same or a lot of people's
 code would be failing right now.

 On Saturday, June 27, 2015, Nicholas Chammas nicholas.cham...@gmail.com
 wrote:

 Yeah, you shouldn't have to rename the columns before joining them.

 Do you see the same behavior on 1.3 vs 1.4?

 Nick
 2015년 6월 27일 (토) 오전 2:51, Axel Dahl a...@whisperstream.com님이 작성:

 still feels like a bug to have to create unique names before a join.

 On Fri, Jun 26, 2015 at 9:51 PM, ayan guha guha.a...@gmail.com wrote:

 You can declare the schema with unique names before creation of df.
 On 27 Jun 2015 13:01, Axel Dahl a...@whisperstream.com wrote:


 I have the following code:

 from pyspark import SQLContext

 d1 = [{'name':'bob', 'country': 'usa', 'age': 1}, {'name':'alice',
 'country': 'jpn', 'age': 2}, {'name':'carol', 'country': 'ire', 'age': 3}]
 d2 = [{'name':'bob', 'country': 'usa', 'colour':'red'},
 {'name':'alice', 'country': 'ire', 'colour':'green'}]

 r1 = sc.parallelize(d1)
 r2 = sc.parallelize(d2)

 sqlContext = SQLContext(sc)
 df1 = sqlContext.createDataFrame(d1)
 df2 = sqlContext.createDataFrame(d2)
 df1.join(df2, df1.name == df2.name and df1.country == df2.country,
 'left_outer').collect()


 When I run it I get the following, (notice in the first row, all join
 keys are take from the right-side and so are blanked out):

 [Row(age=2, country=None, name=None, colour=None, country=None,
 name=None),
 Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa',
 name=u'bob'),
 Row(age=3, country=u'ire', name=u'alice', colour=u'green',
 country=u'ire', name=u'alice')]

 I would expect to get (though ideally without duplicate columns):
 [Row(age=2, country=u'ire', name=u'Alice', colour=None, country=None,
 name=None),
 Row(age=1, country=u'usa', name=u'bob', colour=u'red', country=u'usa',
 name=u'bob'),
 Row(age=3, country=u'ire', name=u'alice', colour=u'green',
 country=u'ire', name=u'alice')]

 The workaround for now is this rather clunky piece of code:
 df2 = sqlContext.createDataFrame(d2).withColumnRenamed('name',
 'name2').withColumnRenamed('country', 'country2')
 df1.join(df2, df1.name == df2.name2 and df1.country == df2.country2,
 'left_outer').collect()

 So to me it looks like a bug, but am I doing something wrong?

 Thanks,

 -Axel

[jira] [Created] (SPARK-8670) Nested columns can't be referenced (but they can be selected)

2015-06-26 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created SPARK-8670:
---

 Summary: Nested columns can't be referenced (but they can be 
selected)
 Key: SPARK-8670
 URL: https://issues.apache.org/jira/browse/SPARK-8670
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.4.0
Reporter: Nicholas Chammas


This is strange and looks like a regression from 1.3.

{code}
import json

daterz = [
  {
'name': 'Nick',
'stats': {
  'age': 28
}
  },
  {
'name': 'George',
'stats': {
  'age': 31
}
  }
]

df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x)))

df.select('stats.age').show()
df['stats.age']  # 1.4 fails on this line
{code}

On 1.3 this works and yields:

{code}
age
28 
31 
Out[1]: Columnstats.age AS age#2958L
{code}

On 1.4, however, this gives an error on the last line:

{code}
+---+
|age|
+---+
| 28|
| 31|
+---+

---
IndexErrorTraceback (most recent call last)
ipython-input-1-04bd990e94c6 in module()
 19 
 20 df.select('stats.age').show()
--- 21 df['stats.age']

/path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item)
678 if isinstance(item, basestring):
679 if item not in self.columns:
-- 680 raise IndexError(no such column: %s % item)
681 jc = self._jdf.apply(item)
682 return Column(jc)

IndexError: no such column: stats.age
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8670) Nested columns can't be referenced (but they can be selected)

2015-06-26 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-8670:

Description: 
This is strange and looks like a regression from 1.3.

{code}
import json

daterz = [
  {
'name': 'Nick',
'stats': {
  'age': 28
}
  },
  {
'name': 'George',
'stats': {
  'age': 31
}
  }
]

df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x)))

df.select('stats.age').show()
df['stats.age']  # 1.4 fails on this line
{code}

On 1.3 this works and yields:

{code}
age
28 
31 
Out[1]: Columnstats.age AS age#2958L
{code}

On 1.4, however, this gives an error on the last line:

{code}
+---+
|age|
+---+
| 28|
| 31|
+---+

---
IndexErrorTraceback (most recent call last)
ipython-input-1-04bd990e94c6 in module()
 19 
 20 df.select('stats.age').show()
--- 21 df['stats.age']

/path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item)
678 if isinstance(item, basestring):
679 if item not in self.columns:
-- 680 raise IndexError(no such column: %s % item)
681 jc = self._jdf.apply(item)
682 return Column(jc)

IndexError: no such column: stats.age
{code}

This means, among other things, that you can't join DataFrames on nested 
columns.

  was:
This is strange and looks like a regression from 1.3.

{code}
import json

daterz = [
  {
'name': 'Nick',
'stats': {
  'age': 28
}
  },
  {
'name': 'George',
'stats': {
  'age': 31
}
  }
]

df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x)))

df.select('stats.age').show()
df['stats.age']  # 1.4 fails on this line
{code}

On 1.3 this works and yields:

{code}
age
28 
31 
Out[1]: Columnstats.age AS age#2958L
{code}

On 1.4, however, this gives an error on the last line:

{code}
+---+
|age|
+---+
| 28|
| 31|
+---+

---
IndexErrorTraceback (most recent call last)
ipython-input-1-04bd990e94c6 in module()
 19 
 20 df.select('stats.age').show()
--- 21 df['stats.age']

/path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item)
678 if isinstance(item, basestring):
679 if item not in self.columns:
-- 680 raise IndexError(no such column: %s % item)
681 jc = self._jdf.apply(item)
682 return Column(jc)

IndexError: no such column: stats.age
{code}


 Nested columns can't be referenced (but they can be selected)
 -

 Key: SPARK-8670
 URL: https://issues.apache.org/jira/browse/SPARK-8670
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.4.0
Reporter: Nicholas Chammas

 This is strange and looks like a regression from 1.3.
 {code}
 import json
 daterz = [
   {
 'name': 'Nick',
 'stats': {
   'age': 28
 }
   },
   {
 'name': 'George',
 'stats': {
   'age': 31
 }
   }
 ]
 df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x)))
 df.select('stats.age').show()
 df['stats.age']  # 1.4 fails on this line
 {code}
 On 1.3 this works and yields:
 {code}
 age
 28 
 31 
 Out[1]: Columnstats.age AS age#2958L
 {code}
 On 1.4, however, this gives an error on the last line:
 {code}
 +---+
 |age|
 +---+
 | 28|
 | 31|
 +---+
 ---
 IndexErrorTraceback (most recent call last)
 ipython-input-1-04bd990e94c6 in module()
  19 
  20 df.select('stats.age').show()
 --- 21 df['stats.age']
 /path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item)
 678 if isinstance(item, basestring):
 679 if item not in self.columns:
 -- 680 raise IndexError(no such column: %s % item)
 681 jc = self._jdf.apply(item)
 682 return Column(jc)
 IndexError: no such column: stats.age
 {code}
 This means, among other things, that you can't join DataFrames on nested 
 columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8670) Nested columns can't be referenced (but they can be selected)

2015-06-26 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603572#comment-14603572
 ] 

Nicholas Chammas commented on SPARK-8670:
-

cc [~rxin], [~davies]

 Nested columns can't be referenced (but they can be selected)
 -

 Key: SPARK-8670
 URL: https://issues.apache.org/jira/browse/SPARK-8670
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.4.0
Reporter: Nicholas Chammas

 This is strange and looks like a regression from 1.3.
 {code}
 import json
 daterz = [
   {
 'name': 'Nick',
 'stats': {
   'age': 28
 }
   },
   {
 'name': 'George',
 'stats': {
   'age': 31
 }
   }
 ]
 df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x)))
 df.select('stats.age').show()
 df['stats.age']  # 1.4 fails on this line
 {code}
 On 1.3 this works and yields:
 {code}
 age
 28 
 31 
 Out[1]: Columnstats.age AS age#2958L
 {code}
 On 1.4, however, this gives an error on the last line:
 {code}
 +---+
 |age|
 +---+
 | 28|
 | 31|
 +---+
 ---
 IndexErrorTraceback (most recent call last)
 ipython-input-1-04bd990e94c6 in module()
  19 
  20 df.select('stats.age').show()
 --- 21 df['stats.age']
 /path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item)
 678 if isinstance(item, basestring):
 679 if item not in self.columns:
 -- 680 raise IndexError(no such column: %s % item)
 681 jc = self._jdf.apply(item)
 682 return Column(jc)
 IndexError: no such column: stats.age
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8670) Nested columns can't be referenced (but they can be selected)

2015-06-26 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14603658#comment-14603658
 ] 

Nicholas Chammas commented on SPARK-8670:
-

I thought, per the discussion on [SPARK-7035], that \{\{df\['stats.age'\]\}\} 
is the preferred syntax over \{\{df.stats.age\}\}. So if I wanted to join 
\{\{df\}\} to another DataFrame on \{\{stats.age\}\} I would specify the join 
condition, for example, as such: \{\{df\['stats.age'\] == other_df\['age'\]\}\}

Also, this worked fine in 1.3 and broke starting in 1.4. Perhaps this was 
intentional, but I didn't see mention of it in the [1.4 release 
notes|https://spark.apache.org/releases/spark-release-1-4-0.html].

 Nested columns can't be referenced (but they can be selected)
 -

 Key: SPARK-8670
 URL: https://issues.apache.org/jira/browse/SPARK-8670
 Project: Spark
  Issue Type: Bug
  Components: PySpark, SQL
Affects Versions: 1.4.0
Reporter: Nicholas Chammas

 This is strange and looks like a regression from 1.3.
 {code}
 import json
 daterz = [
   {
 'name': 'Nick',
 'stats': {
   'age': 28
 }
   },
   {
 'name': 'George',
 'stats': {
   'age': 31
 }
   }
 ]
 df = sqlContext.jsonRDD(sc.parallelize(daterz).map(lambda x: json.dumps(x)))
 df.select('stats.age').show()
 df['stats.age']  # 1.4 fails on this line
 {code}
 On 1.3 this works and yields:
 {code}
 age
 28 
 31 
 Out[1]: Columnstats.age AS age#2958L
 {code}
 On 1.4, however, this gives an error on the last line:
 {code}
 +---+
 |age|
 +---+
 | 28|
 | 31|
 +---+
 ---
 IndexErrorTraceback (most recent call last)
 ipython-input-1-04bd990e94c6 in module()
  19 
  20 df.select('stats.age').show()
 --- 21 df['stats.age']
 /path/to/spark/python/pyspark/sql/dataframe.pyc in __getitem__(self, item)
 678 if isinstance(item, basestring):
 679 if item not in self.columns:
 -- 680 raise IndexError(no such column: %s % item)
 681 jc = self._jdf.apply(item)
 682 return Column(jc)
 IndexError: no such column: stats.age
 {code}
 This means, among other things, that you can't join DataFrames on nested 
 columns.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2

2015-06-23 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-6220.
-
Resolution: Won't Fix

Resolving this issue as won't fix since it is of low importance and can be 
replaced by a few specific options for important EC2 features that are 
currently missing.

 Allow extended EC2 options to be passed through spark-ec2
 -

 Key: SPARK-6220
 URL: https://issues.apache.org/jira/browse/SPARK-6220
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 There are many EC2 options exposed by the boto library that spark-ec2 uses. 
 Over time, many of these EC2 options have been bubbled up here and there to 
 become spark-ec2 options.
 Examples:
 * spot prices
 * placement groups
 * VPC, subnet, and security group assignments
 It's likely that more and more EC2 options will trickle up like this to 
 become spark-ec2 options.
 While major options are well suited to this type of promotion, we should 
 probably allow users to pass through EC2 options they want to use through 
 spark-ec2 in some generic way.
 Let's add two options:
 * {{--ec2-instance-option}} - 
 [{{boto::run}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.image.Image.run]
 * {{--ec2-spot-instance-option}} - 
 [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]
 Each option can be specified multiple times and is simply passed directly to 
 the underlying boto call.
 For example:
 {code}
 spark-ec2 \
 ...
 --ec2-instance-option instance_initiated_shutdown_behavior=terminate \
 --ec2-instance-option ebs_optimized=True
 {code}
 I'm not sure about the exact syntax of the extended options, but something 
 like this will do the trick as long as it can be made to pass the options 
 correctly to boto in most cases.
 I followed the example of {{ssh}}, which supports multiple extended options 
 similarly.
 {code}
 ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-8576) Add spark-ec2 options to assign launched instances into IAM roles and to set instance-initiated shutdown behavior

2015-06-23 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-8576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-8576:

Summary: Add spark-ec2 options to assign launched instances into IAM roles 
and to set instance-initiated shutdown behavior  (was: Add spark-ec2 options to 
assigned launched instances into IAM roles and to set instance-initiated 
shutdown behavior)

 Add spark-ec2 options to assign launched instances into IAM roles and to set 
 instance-initiated shutdown behavior
 -

 Key: SPARK-8576
 URL: https://issues.apache.org/jira/browse/SPARK-8576
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 There are 2 EC2 options that would be useful to add.
 * One is the ability to assign IAM roles to launched instances.
 * The other is the ability to configure instances to self-terminate when they 
 initiate a shutdown.
 Both of these options are useful when spark-ec2 is being used as part of an 
 automated pipeline and the engineers want to minimize the need to pass around 
 AWS keys for access (replaced by IAM role) and to be able to launch a cluster 
 that can terminate itself cleanly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-8576) Add spark-ec2 options to assigned launched instances into IAM roles and to set instance-initiated shutdown behavior

2015-06-23 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created SPARK-8576:
---

 Summary: Add spark-ec2 options to assigned launched instances into 
IAM roles and to set instance-initiated shutdown behavior
 Key: SPARK-8576
 URL: https://issues.apache.org/jira/browse/SPARK-8576
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor


There are 2 EC2 options that would be useful to add.

* One is the ability to assign IAM roles to launched instances.
* The other is the ability to configure instances to self-terminate when they 
initiate a shutdown.

Both of these options are useful when spark-ec2 is being used as part of an 
automated pipeline and the engineers want to minimize the need to pass around 
AWS keys for access (replaced by IAM role) and to be able to launch a cluster 
that can terminate itself cleanly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Stats on targets for 1.5.0

2015-06-19 Thread Nicholas Chammas

 I think it would be fantastic if this work was burned down before
adding big new chunks of work. The stat is worth keeping an eye on.

+1, keeping in mind that burning down work also means just targeting it for
a different release or closing it. :)

Nick


On Fri, Jun 19, 2015 at 3:18 PM Sean Owen so...@cloudera.com wrote:

 Quick point of reference for 1.5.0: 226 issues are Fixed for 1.5.0,
 and 388 are Targeted for 1.5.0. So maybe 36% of things to be done for
 1.5.0 are complete, and we're in theory 3 of 8 weeks into the merge
 window, or 37.5%.

 That's nicely on track! assuming, of course, that nothing else is
 targeted for 1.5.0. History suggests that a lot more will be, since a
 minor release has more usually had 1000+ JIRAs. However lots of
 forward-looking JIRAs have been filed, so it may be that most planned
 work is on the books already this time around.

 I think it would be fantastic if this work was burned down before
 adding big new chunks of work. The stat is worth keeping an eye on.

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

[jira] [Commented] (SPARK-8417) spark-class has illegal statement

2015-06-18 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14591988#comment-14591988
 ] 

Nicholas Chammas commented on SPARK-8417:
-

I'm not sure what I'm looking at. Can the original poster link to the lines 
directly on GitHub so we can see them in context?

 spark-class has illegal statement
 -

 Key: SPARK-8417
 URL: https://issues.apache.org/jira/browse/SPARK-8417
 Project: Spark
  Issue Type: Bug
  Components: Deploy
Affects Versions: 1.4.0
Reporter: jweinste

 spark-class
 There is an illegal statement.
 done  ($RUNNER -cp $LAUNCH_CLASSPATH org.apache.spark.launcher.Main 
 $@)
 Complaint is
 ./bin/spark-class: line 100: syntax error near unexpected token `'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-8429) Add ability to set additional tags

2015-06-18 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-8429?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14592010#comment-14592010
 ] 

Nicholas Chammas commented on SPARK-8429:
-

What is your use case for this feature?

cc [~joshrosen] [~shivaram]

 Add ability to set additional tags
 --

 Key: SPARK-8429
 URL: https://issues.apache.org/jira/browse/SPARK-8429
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 1.4.0
Reporter: Stefano Parmesan
Priority: Minor

 Currently it is not possible to add custom tags to the cluster instances; 
 tags are quite useful for many things, and it should be pretty 
 straightforward to add an extra parameter to support this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Sidebar: issues targeted for 1.4.0

2015-06-18 Thread Nicholas Chammas

 Given fixed time, adding more TODOs generally means other stuff has to be
taken
out for the release. If not, then it happens de facto anyway, which is
worse than managing it on purpose.

+1 to this.

I wouldn't mind helping go through open issues on JIRA targeted for the
next release around RC time to make sure that a) nothing major is getting
missed for the release and b) the JIRA backlog gets trimmed of the cruft
which is constantly building up. It's good housekeeping.

Nick

On Thu, Jun 18, 2015 at 3:23 AM Sean Owen so...@cloudera.com wrote:

 I also like using Target Version meaningfully. It might be a little
 much to require no Target Version = X before starting an RC. I do
 think it's reasonable to not start the RC with Blockers open.

 And here we started the RC with almost 100 TODOs for 1.4.0, most of
 which did not get done. Not the end of the world, but, clearly some
 other decisions were made in the past based on the notion that most of
 those would get done. The 'targeting' is too optimistic. Given fixed
 time, adding more TODOs generally means other stuff has to be taken
 out for the release. If not, then it happens de facto anyway, which is
 worse than managing it on purpose.

 Anyway, thanks all for the attention to some cleanup. I'll wait a
 short while and then fix up the rest of them as intelligently as I
 can. Maybe I can push on this a little the next time we have a release
 cycle to see how we're doing with use of Target Version.





 On Wed, Jun 17, 2015 at 10:03 PM, Heller, Chris chel...@akamai.com
 wrote:
  I appreciate targets having the strong meaning you suggest, as its useful
  to get a sense of what will realistically be included in a release.
 
 
  Would it make sense (speaking as a relative outsider here) that we would
  not enter into the RC phase of a release until all JIRA targeting that
  release were complete?
 
  If a JIRA targeting a release is blocking entry to the RC phase, and its
  determined that the JIRA should not hold up the release, than it should
  get re-targeted to the next release.
 
  -Chris
 
  On 6/17/15, 3:55 PM, Patrick Wendell pwend...@gmail.com wrote:
 
 Hey Sean,
 
 Thanks for bringing this up - I went through and fixed about 10 of
 them. Unfortunately there isn't a hard and fast way to resolve them. I
 found all of the following:
 
 - Features that missed the release and needed to be retargeted to 1.5.
 - Bugs that missed the release and needed to be retargeted to 1.4.1.
 - Issues that were not properly targeted (e.g. someone randomly set
 the target version) and should probably be untargeted.
 
 I'd like to encourage others to do this, especially the more active
 developers on different components (Streaming, ML, etc).
 
 One other question is what the semantics of target version are, which
 I don't think we've defined clearly. Is it the target of the person
 contributing the feature? Or in some sense the target of the
 committership? My preference would be that targeting a JIRA has some
 strong semantics - i.e. it means the commiter targeting has mentally
 allocated time to review a patch for that feature in the timeline of
 that release. I.e. prefer to have fewer targeted JIRA's for a release,
 and also expect to get most of the targeted features merged into a
 release. In the past I think targeting has meant different things to
 different people.
 
 - Patrick
 
 On Tue, Jun 16, 2015 at 8:09 AM, Josh Rosen rosenvi...@gmail.com
 wrote:
  Whatever you do, DO NOT use the built-in JIRA 'releases' feature to
 migrate
  issues from 1.4.0 to another version: the JIRA feature will have the
  side-effect of automatically changing the target versions for issues
 that
  have been closed, which is going to be really confusing. I've made this
  mistake once myself and it was a bit of a hassle to clean up.
 
  On Tue, Jun 16, 2015 at 5:24 AM, Sean Owen so...@cloudera.com wrote:
 
  Question: what would happen if I cleared Target Version for everything
  still marked Target Version = 1.4.0? There are 76 right now, and
  clearly that's not correct.
 
  56 were opened by committers, including issues like Do X for 1.4.
  I'd like to understand whether these are resolved but just weren't
  closed, or else why so many issues are being filed as a todo and not
  resolved? Slipping things here or there is OK, but these weren't even
  slipped, just forgotten.
 
  On Sat, May 30, 2015 at 3:55 PM, Sean Owen so...@cloudera.com
 wrote:
   In an ideal world,  Target Version really is what's going to go in
 as
   far as anyone knows and when new stuff comes up, we all have to
 figure
   out what gets dropped to fit by the release date. Boring, standard
   software project management practice. I don't know how realistic
 that
   is, but, I'm wondering how people feel about this, who have filed
   these JIRAs?
  
   Concretely, should non-Critical issues for 1.4.0 be un-Targeted?
   should they all be un-Targeted after the release?

[jira] [Commented] (SPARK-6220) Allow extended EC2 options to be passed through spark-ec2

2015-06-15 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-6220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14586334#comment-14586334
 ] 

Nicholas Chammas commented on SPARK-6220:
-

 please forgive my greenness

No need. Greenness is not a crime around these parts. :)

I suggest creating a new JIRA for that specific feature. In the JIRA you can 
reference this issue here as related.

By the way, I took a look at your commit. If I understood correctly, your 
change associates launched instances with an IAM profile (allowing the launched 
cluster to, for example, access S3 without credentials), but the machine you 
are running spark-ec2 from still needs AWS keys to launch them.

That seems fine to me, but it doesn't sound exactly like what you intended from 
your comment.

 Allow extended EC2 options to be passed through spark-ec2
 -

 Key: SPARK-6220
 URL: https://issues.apache.org/jira/browse/SPARK-6220
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 There are many EC2 options exposed by the boto library that spark-ec2 uses. 
 Over time, many of these EC2 options have been bubbled up here and there to 
 become spark-ec2 options.
 Examples:
 * spot prices
 * placement groups
 * VPC, subnet, and security group assignments
 It's likely that more and more EC2 options will trickle up like this to 
 become spark-ec2 options.
 While major options are well suited to this type of promotion, we should 
 probably allow users to pass through EC2 options they want to use through 
 spark-ec2 in some generic way.
 Let's add two options:
 * {{--ec2-instance-option}} - 
 [{{boto::run}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.image.Image.run]
 * {{--ec2-spot-instance-option}} - 
 [{{boto::request_spot_instances}}|http://boto.readthedocs.org/en/latest/ref/ec2.html#boto.ec2.connection.EC2Connection.request_spot_instances]
 Each option can be specified multiple times and is simply passed directly to 
 the underlying boto call.
 For example:
 {code}
 spark-ec2 \
 ...
 --ec2-instance-option instance_initiated_shutdown_behavior=terminate \
 --ec2-instance-option ebs_optimized=True
 {code}
 I'm not sure about the exact syntax of the extended options, but something 
 like this will do the trick as long as it can be made to pass the options 
 correctly to boto in most cases.
 I followed the example of {{ssh}}, which supports multiple extended options 
 similarly.
 {code}
 ssh -o LogLevel=ERROR -o UserKnowHostsFile=/dev/null ...
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Remove Hadoop 1 support (Hadoop 2.2) for Spark 1.5?

2015-06-12 Thread Nicholas Chammas

I'm personally in favor, but I don't have a sense of how many people still
rely on Hadoop 1.

Nick

2015년 6월 12일 (금) 오전 9:13, Steve Loughran
ste...@hortonworks.com님이 작성:

+1 for 2.2+

 Not only are the APis in Hadoop 2 better, there's more people testing
 Hadoop 2.x  spark, and bugs in Hadoop itself being fixed.

 (usual disclaimers, I work off branch-2.7 snapshots I build nightly, etc)

  On 12 Jun 2015, at 11:09, Sean Owen so...@cloudera.com wrote:
 
  How does the idea of removing support for Hadoop 1.x for Spark 1.5
  strike everyone? Really, I mean, Hadoop  2.2, as 2.2 seems to me more
  consistent with the modern 2.x line than 2.1 or 2.0.
 
  The arguments against are simply, well, someone out there might be
  using these versions.
 
  The arguments for are just simplification -- fewer gotchas in trying
  to keep supporting older Hadoop, of which we've seen several lately.
  We get to chop out a little bit of shim code and update to use some
  non-deprecated APIs. Along with removing support for Java 6, it might
  be a reasonable time to also draw a line under older Hadoop too.
 
  I'm just gauging feeling now: for, against, indifferent?
  I favor it, but would not push hard on it if there are objections.
 
  -
  To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
  For additional commands, e-mail: dev-h...@spark.apache.org
 


 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

[jira] [Created] (SPARK-8316) Upgrade Maven to 3.3.3

2015-06-11 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created SPARK-8316:
---

 Summary: Upgrade Maven to 3.3.3
 Key: SPARK-8316
 URL: https://issues.apache.org/jira/browse/SPARK-8316
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Nicholas Chammas
Priority: Minor


Maven versions prior to 3.3 apparently have some bugs.

See: https://github.com/apache/spark/pull/6492#issuecomment-111001101



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Did the 3.4.4 docs get published early?

2015-06-11 Thread Nicholas Chammas

Sorry, somehow the formatting in my previous email didn't come through
correctly.

This part was supposed to be in a quote block:

 Also, just replacing the version number in the URL works for the python 3
series
 (use 3.X even for python 3.0), even farther back than the drop down menu
allows.

Nick

On Wed, Jun 10, 2015 at 2:25 PM Nicholas Chammas nicholas.cham...@gmail.com
wrote:

 Also, just replacing the version number in the URL works for the python 3
 series (use 3.X even for python 3.0), even farther back than the drop down
 menu allows.

 This does not help in this case:

 https://docs.python.org/3.4/library/asyncio-task.html#asyncio.ensure_future

 Also, you cannot select the docs for a maintenance release, like 3.4.3.

 Anyway, it’s not a big deal as long as significant changes are tagged
 appropriately with notes like “New in version NNN”, which they are.

 Ideally, the docs would only show the latest changes for released versions
 of Python, but since some changes (like the one I linked to) are introduced
 in maintenance versions, it’s probably hard to separate them out into
 separate branches.

 Nick
 

 On Wed, Jun 10, 2015 at 10:11 AM Nicholas Chammas 
 nicholas.cham...@gmail.com wrote:

 For example, here is a New in version 3.4.4 method:

 https://docs.python.org/3/library/asyncio-task.html#asyncio.ensure_future

 However, the latest release appears to be 3.4.3:

 https://www.python.org/downloads/

 Is this normal, or did the 3.4.4 docs somehow get published early by
 mistake?

 Nick


-- 
https://mail.python.org/mailman/listinfo/python-list

Did the 3.4.4 docs get published early?

2015-06-10 Thread Nicholas Chammas

For example, here is a New in version 3.4.4 method:

https://docs.python.org/3/library/asyncio-task.html#asyncio.ensure_future

However, the latest release appears to be 3.4.3:

https://www.python.org/downloads/

Is this normal, or did the 3.4.4 docs somehow get published early by
mistake?

Nick
-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Did the 3.4.4 docs get published early?

2015-06-10 Thread Nicholas Chammas

Also, just replacing the version number in the URL works for the python 3
series (use 3.X even for python 3.0), even farther back than the drop down
menu allows.

This does not help in this case:

https://docs.python.org/3.4/library/asyncio-task.html#asyncio.ensure_future

Also, you cannot select the docs for a maintenance release, like 3.4.3.

Anyway, it’s not a big deal as long as significant changes are tagged
appropriately with notes like “New in version NNN”, which they are.

Ideally, the docs would only show the latest changes for released versions
of Python, but since some changes (like the one I linked to) are introduced
in maintenance versions, it’s probably hard to separate them out into
separate branches.

Nick


On Wed, Jun 10, 2015 at 10:11 AM Nicholas Chammas 
nicholas.cham...@gmail.com wrote:

 For example, here is a New in version 3.4.4 method:

 https://docs.python.org/3/library/asyncio-task.html#asyncio.ensure_future

 However, the latest release appears to be 3.4.3:

 https://www.python.org/downloads/

 Is this normal, or did the 3.4.4 docs somehow get published early by
 mistake?

 Nick


-- 
https://mail.python.org/mailman/listinfo/python-list

Re: Required settings for permanent HDFS Spark on EC2

2015-06-05 Thread Nicholas Chammas

If your problem is that stopping/starting the cluster resets configs, then
you may be running into this issue:

https://issues.apache.org/jira/browse/SPARK-4977

Nick

On Thu, Jun 4, 2015 at 2:46 PM barmaley o...@solver.com wrote:

 Hi - I'm having similar problem with switching from ephemeral to persistent
 HDFS - it always looks for 9000 port regardless of options I set for 9010
 persistent HDFS. Have you figured out a solution? Thanks



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Required-settings-for-permanent-HDFS-Spark-on-EC2-tp22860p23157.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org

[jira] [Commented] (SPARK-5398) Support the eu-central-1 region for spark-ec2

2015-06-04 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573138#comment-14573138
 ] 

Nicholas Chammas commented on SPARK-5398:
-

I don't have the credentials to do that, unfortunately. Maybe [~pwendell] does, 
though.

 Support the eu-central-1 region for spark-ec2
 -

 Key: SPARK-5398
 URL: https://issues.apache.org/jira/browse/SPARK-5398
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 {{spark-ec2}} [doesn't currently 
 support|https://github.com/mesos/spark-ec2/tree/branch-1.3/ami-list] the 
 {{eu-central-1}} region.
 You can see the [full list of EC2 regions 
 here|http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html].
  {{eu-central-1}} is the only one missing as of Jan 2015. ({{cn-north-1}}, 
 for some reason, is not listed there.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-5398) Support the eu-central-1 region for spark-ec2

2015-06-04 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-5398?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14573140#comment-14573140
 ] 

Nicholas Chammas commented on SPARK-5398:
-

I don't have the credentials to do that, unfortunately. Maybe [~pwendell] does, 
though.

 Support the eu-central-1 region for spark-ec2
 -

 Key: SPARK-5398
 URL: https://issues.apache.org/jira/browse/SPARK-5398
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 {{spark-ec2}} [doesn't currently 
 support|https://github.com/mesos/spark-ec2/tree/branch-1.3/ami-list] the 
 {{eu-central-1}} region.
 You can see the [full list of EC2 regions 
 here|http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html].
  {{eu-central-1}} is the only one missing as of Jan 2015. ({{cn-north-1}}, 
 for some reason, is not listed there.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Issue Comment Deleted] (SPARK-5398) Support the eu-central-1 region for spark-ec2

2015-06-04 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5398?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-5398:

Comment: was deleted

(was: I don't have the credentials to do that, unfortunately. Maybe [~pwendell] 
does, though.)

 Support the eu-central-1 region for spark-ec2
 -

 Key: SPARK-5398
 URL: https://issues.apache.org/jira/browse/SPARK-5398
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Reporter: Nicholas Chammas
Priority: Minor

 {{spark-ec2}} [doesn't currently 
 support|https://github.com/mesos/spark-ec2/tree/branch-1.3/ami-list] the 
 {{eu-central-1}} region.
 You can see the [full list of EC2 regions 
 here|http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/using-regions-availability-zones.html].
  {{eu-central-1}} is the only one missing as of Jan 2015. ({{cn-north-1}}, 
 for some reason, is not listed there.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7900) Reduce number of tagging calls in spark-ec2

2015-06-03 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14571460#comment-14571460
 ] 

Nicholas Chammas commented on SPARK-7900:
-

I'm marking this as a duplicate of [SPARK-4983].

 Reduce number of tagging calls in spark-ec2
 ---

 Key: SPARK-7900
 URL: https://issues.apache.org/jira/browse/SPARK-7900
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.4.0
Reporter: Nicholas Chammas
Priority: Minor

 spark-ec2 currently tags each instance with its own name:
 https://github.com/apache/spark/blob/4615081d7a10b023491e25478d19b8161e030974/ec2/spark_ec2.py#L684-L692
 Quite often, one of these tagging calls will fail:
 {code}
 Launching instances...
 Launched 10 slaves in us-west-2a, regid = r-89656e83
 Launched master in us-west-2a, regid = r-07646f0d
 Waiting for AWS to propagate instance metadata...
 Traceback (most recent call last):
   File ../spark/ec2/spark_ec2.py, line 1395, in module
 main()
   File ../spark/ec2/spark_ec2.py, line 1387, in main
 real_main()
   File ../spark/ec2/spark_ec2.py, line 1222, in real_main
 (master_nodes, slave_nodes) = launch_cluster(conn, opts, cluster_name)
   File ../spark/ec2/spark_ec2.py, line 667, in launch_cluster
 value='{cn}-slave-{iid}'.format(cn=cluster_name, iid=slave.id))
   File /path/spark/ec2/lib/boto-2.34.0/boto/ec2/ec2object.py, line 80, in 
 add_tag
 self.add_tags({key: value}, dry_run)
   File /path/spark/ec2/lib/boto-2.34.0/boto/ec2/ec2object.py, line 97, in 
 add_tags
 dry_run=dry_run
   File /path/spark/ec2/lib/boto-2.34.0/boto/ec2/connection.py, line 4202, 
 in create_tags
 return self.get_status('CreateTags', params, verb='POST')
   File /path/spark/ec2/lib/boto-2.34.0/boto/connection.py, line 1223, in 
 get_status
 raise self.ResponseError(response.status, response.reason, body)
 boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request
 ?xml version=1.0 encoding=UTF-8?
 ResponseErrorsErrorCodeInvalidInstanceID.NotFound/CodeMessageThe 
 instance ID 'i-d3b72524' does not 
 exist/Message/Error/ErrorsRequestIDf0936ab5-4d10-46d1-a35d-cefaf8a68adc/RequestID/Response
 {code}
 This is presumably a problem with AWS metadata taking time to become 
 available on all the servers that spark-ec2 hits as it makes the several 
 tagging calls.
 Instead of retrying the tagging calls, we should just reduce them to 2 
 calls--one for the master, one for the slaves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Resolved] (SPARK-7900) Reduce number of tagging calls in spark-ec2

2015-06-03 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-7900?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas resolved SPARK-7900.
-
Resolution: Duplicate

 Reduce number of tagging calls in spark-ec2
 ---

 Key: SPARK-7900
 URL: https://issues.apache.org/jira/browse/SPARK-7900
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.4.0
Reporter: Nicholas Chammas
Priority: Minor

 spark-ec2 currently tags each instance with its own name:
 https://github.com/apache/spark/blob/4615081d7a10b023491e25478d19b8161e030974/ec2/spark_ec2.py#L684-L692
 Quite often, one of these tagging calls will fail:
 {code}
 Launching instances...
 Launched 10 slaves in us-west-2a, regid = r-89656e83
 Launched master in us-west-2a, regid = r-07646f0d
 Waiting for AWS to propagate instance metadata...
 Traceback (most recent call last):
   File ../spark/ec2/spark_ec2.py, line 1395, in module
 main()
   File ../spark/ec2/spark_ec2.py, line 1387, in main
 real_main()
   File ../spark/ec2/spark_ec2.py, line 1222, in real_main
 (master_nodes, slave_nodes) = launch_cluster(conn, opts, cluster_name)
   File ../spark/ec2/spark_ec2.py, line 667, in launch_cluster
 value='{cn}-slave-{iid}'.format(cn=cluster_name, iid=slave.id))
   File /path/spark/ec2/lib/boto-2.34.0/boto/ec2/ec2object.py, line 80, in 
 add_tag
 self.add_tags({key: value}, dry_run)
   File /path/spark/ec2/lib/boto-2.34.0/boto/ec2/ec2object.py, line 97, in 
 add_tags
 dry_run=dry_run
   File /path/spark/ec2/lib/boto-2.34.0/boto/ec2/connection.py, line 4202, 
 in create_tags
 return self.get_status('CreateTags', params, verb='POST')
   File /path/spark/ec2/lib/boto-2.34.0/boto/connection.py, line 1223, in 
 get_status
 raise self.ResponseError(response.status, response.reason, body)
 boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request
 ?xml version=1.0 encoding=UTF-8?
 ResponseErrorsErrorCodeInvalidInstanceID.NotFound/CodeMessageThe 
 instance ID 'i-d3b72524' does not 
 exist/Message/Error/ErrorsRequestIDf0936ab5-4d10-46d1-a35d-cefaf8a68adc/RequestID/Response
 {code}
 This is presumably a problem with AWS metadata taking time to become 
 available on all the servers that spark-ec2 hits as it makes the several 
 tagging calls.
 Instead of retrying the tagging calls, we should just reduce them to 2 
 calls--one for the master, one for the slaves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-4983) Add sleep() before tagging EC2 instances to allow instance metadata to propagate

2015-06-03 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-4983?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14571467#comment-14571467
 ] 

Nicholas Chammas commented on SPARK-4983:
-

Per the discussion on [SPARK-7900], I think we should increase the wait time 
from the current 5 seconds to, say, 15 or 30 seconds.

An alternative proposed on [SPARK-7900] is to make fewer tagging calls, since 
the extra calls seem to make it more likely the we get metadata errors from AWS 
(like, instance ID not found right after AWS itself has given us the instance 
ID).

 Add sleep() before tagging EC2 instances to allow instance metadata to 
 propagate
 

 Key: SPARK-4983
 URL: https://issues.apache.org/jira/browse/SPARK-4983
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.2.0
Reporter: Nicholas Chammas
Assignee: Gen TANG
Priority: Minor
  Labels: starter
 Fix For: 1.2.2, 1.3.0


 We launch EC2 instances in {{spark-ec2}} and then immediately tag them in a 
 separate boto call. Sometimes, EC2 doesn't get enough time to propagate 
 information about the just-launched instances, so when we go to tag them we 
 get a server that doesn't know about them yet.
 This yields the following type of error:
 {code}
 Launching instances...
 Launched 1 slaves in us-east-1b, regid = r-cf780321
 Launched master in us-east-1b, regid = r-da7e0534
 Traceback (most recent call last):
   File ./ec2/spark_ec2.py, line 1284, in module
 main()
   File ./ec2/spark_ec2.py, line 1276, in main
 real_main()
   File ./ec2/spark_ec2.py, line 1122, in real_main
 (master_nodes, slave_nodes) = launch_cluster(conn, opts, cluster_name)
   File ./ec2/spark_ec2.py, line 646, in launch_cluster
 value='{cn}-master-{iid}'.format(cn=cluster_name, iid=master.id))
   File .../spark/ec2/lib/boto-2.34.0/boto/ec2/ec2object.py, line 80, in 
 add_tag
 self.add_tags({key: value}, dry_run)
   File .../spark/ec2/lib/boto-2.34.0/boto/ec2/ec2object.py, line 97, in 
 add_tags
 dry_run=dry_run
   File .../spark/ec2/lib/boto-2.34.0/boto/ec2/connection.py, line 4202, in 
 create_tags
 return self.get_status('CreateTags', params, verb='POST')
   File .../spark/ec2/lib/boto-2.34.0/boto/connection.py, line 1223, in 
 get_status
 raise self.ResponseError(response.status, response.reason, body)
 boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request
 ?xml version=1.0 encoding=UTF-8?
 ResponseErrorsErrorCodeInvalidInstanceID.NotFound/CodeMessageThe 
 instance ID 'i-585219a6' does not 
 exist/Message/Error/ErrorsRequestIDb9f1ad6e-59b9-47fd-a693-527be1f779eb/RequestID/Response
 {code}
 The solution is to tag the instances in the same call that launches them, or 
 less desirably, tag the instances after some short wait.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Updated] (SPARK-5189) Reorganize EC2 scripts so that nodes can be provisioned independent of Spark master

2015-05-31 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-5189:

Description: 
As of 1.2.0, we launch Spark clusters on EC2 by setting up the master first, 
then setting up all the slaves together. This includes broadcasting files from 
the lonely master to potentially hundreds of slaves.

There are 2 main problems with this approach:
# Broadcasting files from the master to all slaves using 
[{{copy-dir}}|https://github.com/mesos/spark-ec2/blob/branch-1.3/copy-dir.sh] 
(e.g. during [ephemeral-hdfs 
init|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/ephemeral-hdfs/init.sh#L36],
 or during [Spark 
setup|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/spark/setup.sh#L3])
 takes a long time. This time increases as the number of slaves increases.
 I did some testing in {{us-east-1}}. This is, concretely, what the problem 
looks like:
 || number of slaves ({{m3.large}}) || launch time (best of 6 tries) ||
| 1 | 8m 44s |
| 10 | 13m 45s |
| 25 | 22m 50s |
| 50 | 37m 30s |
| 75 | 51m 30s |
| 99 | 1h 5m 30s |
 Unfortunately, I couldn't report on 100 slaves or more due to SPARK-6246, but 
I think the point is clear enough.
 We can extrapolate from this data that *every additional slave adds roughly 35 
seconds to the launch time* (so a cluster with 100 slaves would take 1h 6m 5s 
to launch).
# It's more complicated to add slaves to an existing cluster (a la 
[SPARK-2008]), since slaves are only configured through the master during the 
setup of the master itself.

Logically, the operations we want to implement are:

* Provision a Spark node
* Join a node to a cluster (including an empty cluster) as either a master or a 
slave
* Remove a node from a cluster

We need our scripts to roughly be organized to match the above operations. The 
goals would be:
# When launching a cluster, enable all cluster nodes to be provisioned in 
parallel, removing the master-to-slave file broadcast bottleneck.
# Facilitate cluster modifications like adding or removing nodes.
# Enable exploration of infrastructure tools like 
[Terraform|https://www.terraform.io/] that might simplify {{spark-ec2}} 
internals and perhaps even allow us to build [one tool that launches Spark 
clusters on several different cloud 
platforms|https://groups.google.com/forum/#!topic/terraform-tool/eD23GLLkfDw].

More concretely, the modifications we need to make are:
* Replace all occurrences of {{copy-dir}} or {{rsync}}-to-slaves with 
equivalent, slave-side operations.
* Repurpose {{setup-slave.sh}} as {{provision-spark-node.sh}} and make sure it 
fully creates a node that can be used as either a master or slave.
* Create a new script, {{join-to-cluster.sh}}, that takes a provisioned node, 
configures it as a master or slave, and joins it to a cluster.
* Move any remaining logic in {{setup.sh}} up to {{spark_ec2.py}} and delete 
that script.

  was:
As of 1.2.0, we launch Spark clusters on EC2 by setting up the master first, 
then setting up all the slaves together. This includes broadcasting files from 
the lonely master to potentially hundreds of slaves.

There are 2 main problems with this approach:
# Broadcasting files from the master to all slaves using 
[{{copy-dir}}|https://github.com/mesos/spark-ec2/blob/branch-1.3/copy-dir.sh] 
(e.g. during [ephemeral-hdfs 
init|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/ephemeral-hdfs/init.sh#L36],
 or during [Spark 
setup|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/spark/setup.sh#L3])
 takes a long time. This time increases as the number of slaves increases.
 I did some testing in {{us-east-1}}. This is, concretely, what the problem 
looks like:
 || number of slaves ({{m3.large}}) || launch time (best of 6 tries) ||
| 1 | 8m 44s |
| 10 | 13m 45s |
| 25 | 22m 50s |
| 50 | 37m 30s |
| 75 | 51m 30s |
| 99 | 1h 5m 30s |
 Unfortunately, I couldn't report on 100 slaves or more due to SPARK-6246, but 
I think the point is clear enough.
 We can extrapolate from this data that *every additional slave adds roughly 35 
seconds to the launch time*.
# It's more complicated to add slaves to an existing cluster (a la 
[SPARK-2008]), since slaves are only configured through the master during the 
setup of the master itself.

Logically, the operations we want to implement are:

* Provision a Spark node
* Join a node to a cluster (including an empty cluster) as either a master or a 
slave
* Remove a node from a cluster

We need our scripts to roughly be organized to match the above operations. The 
goals would be:
# When launching a cluster, enable all cluster nodes to be provisioned in 
parallel, removing the master-to-slave file broadcast bottleneck.
# Facilitate cluster modifications like adding or removing nodes.
# Enable exploration of infrastructure

[jira] [Updated] (SPARK-5189) Reorganize EC2 scripts so that nodes can be provisioned independent of Spark master

2015-05-31 Thread Nicholas Chammas (JIRA)


 [ 
https://issues.apache.org/jira/browse/SPARK-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-5189:

Description: 
As of 1.2.0, we launch Spark clusters on EC2 by setting up the master first, 
then setting up all the slaves together. This includes broadcasting files from 
the lonely master to potentially hundreds of slaves.

There are 2 main problems with this approach:
# Broadcasting files from the master to all slaves using 
[{{copy-dir}}|https://github.com/mesos/spark-ec2/blob/branch-1.3/copy-dir.sh] 
(e.g. during [ephemeral-hdfs 
init|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/ephemeral-hdfs/init.sh#L36],
 or during [Spark 
setup|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/spark/setup.sh#L3])
 takes a long time. This time increases as the number of slaves increases.
 I did some testing in {{us-east-1}}. This is, concretely, what the problem 
looks like:
 || number of slaves ({{m3.large}}) || launch time (best of 6 tries) ||
| 1 | 8m 44s |
| 10 | 13m 45s |
| 25 | 22m 50s |
| 50 | 37m 30s |
| 75 | 51m 30s |
| 99 | 1h 5m 30s |
 Unfortunately, I couldn't report on 100 slaves or more due to SPARK-6246, but 
I think the point is clear enough.
 We can extrapolate from this data that *every additional slave adds roughly 35 
seconds to the launch time*.
# It's more complicated to add slaves to an existing cluster (a la 
[SPARK-2008]), since slaves are only configured through the master during the 
setup of the master itself.

Logically, the operations we want to implement are:

* Provision a Spark node
* Join a node to a cluster (including an empty cluster) as either a master or a 
slave
* Remove a node from a cluster

We need our scripts to roughly be organized to match the above operations. The 
goals would be:
# When launching a cluster, enable all cluster nodes to be provisioned in 
parallel, removing the master-to-slave file broadcast bottleneck.
# Facilitate cluster modifications like adding or removing nodes.
# Enable exploration of infrastructure tools like 
[Terraform|https://www.terraform.io/] that might simplify {{spark-ec2}} 
internals and perhaps even allow us to build [one tool that launches Spark 
clusters on several different cloud 
platforms|https://groups.google.com/forum/#!topic/terraform-tool/eD23GLLkfDw].

More concretely, the modifications we need to make are:
* Replace all occurrences of {{copy-dir}} or {{rsync}}-to-slaves with 
equivalent, slave-side operations.
* Repurpose {{setup-slave.sh}} as {{provision-spark-node.sh}} and make sure it 
fully creates a node that can be used as either a master or slave.
* Create a new script, {{join-to-cluster.sh}}, that takes a provisioned node, 
configures it as a master or slave, and joins it to a cluster.
* Move any remaining logic in {{setup.sh}} up to {{spark_ec2.py}} and delete 
that script.

  was:
As of 1.2.0, we launch Spark clusters on EC2 by setting up the master first, 
then setting up all the slaves together. This includes broadcasting files from 
the lonely master to potentially hundreds of slaves.

There are 2 main problems with this approach:
# Broadcasting files from the master to all slaves using 
[{{copy-dir}}|https://github.com/mesos/spark-ec2/blob/branch-1.3/copy-dir.sh] 
(e.g. during [ephemeral-hdfs 
init|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/ephemeral-hdfs/init.sh#L36],
 or during [Spark 
setup|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/spark/setup.sh#L3])
 takes a long time. This time increases as the number of slaves increases.
 I did some testing in {{us-east-1}}. This is, concretely, what the problem 
looks like:
 || number of slaves ({{m3.large}}) || launch time (best of 6 tries) ||
| 1 | 8m 44s |
| 10 | 13m 45s |
| 25 | 22m 50s |
| 50 | 37m 30s |
| 75 | 51m 30s |
| 99 | 1h 5m 30s |
 Unfortunately, I couldn't report on 100 slaves or more due to SPARK-6246, but 
I think the point is clear enough.
# It's more complicated to add slaves to an existing cluster (a la 
[SPARK-2008]), since slaves are only configured through the master during the 
setup of the master itself.

Logically, the operations we want to implement are:

* Provision a Spark node
* Join a node to a cluster (including an empty cluster) as either a master or a 
slave
* Remove a node from a cluster

We need our scripts to roughly be organized to match the above operations. The 
goals would be:
# When launching a cluster, enable all cluster nodes to be provisioned in 
parallel, removing the master-to-slave file broadcast bottleneck.
# Facilitate cluster modifications like adding or removing nodes.
# Enable exploration of infrastructure tools like 
[Terraform|https://www.terraform.io/] that might simplify {{spark-ec2}} 
internals and perhaps even allow us to build [one tool that launches Spark 
clusters

[jira] [Commented] (SPARK-7900) Reduce number of tagging calls in spark-ec2

2015-05-28 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14563208#comment-14563208
 ] 

Nicholas Chammas commented on SPARK-7900:
-

The name tags are optional, but we can tag them all at once with a single call. 
The only downside is that they get the same name: something like 
{{cluster-name-slave}}.

These names are provided just so that they show up on the EC2 web console. They 
serve no other purpose.

 Reduce number of tagging calls in spark-ec2
 ---

 Key: SPARK-7900
 URL: https://issues.apache.org/jira/browse/SPARK-7900
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.4.0
Reporter: Nicholas Chammas
Priority: Minor

 spark-ec2 currently tags each instance with its own name:
 https://github.com/apache/spark/blob/4615081d7a10b023491e25478d19b8161e030974/ec2/spark_ec2.py#L684-L692
 Quite often, one of these tagging calls will fail:
 {code}
 Launching instances...
 Launched 10 slaves in us-west-2a, regid = r-89656e83
 Launched master in us-west-2a, regid = r-07646f0d
 Waiting for AWS to propagate instance metadata...
 Traceback (most recent call last):
   File ../spark/ec2/spark_ec2.py, line 1395, in module
 main()
   File ../spark/ec2/spark_ec2.py, line 1387, in main
 real_main()
   File ../spark/ec2/spark_ec2.py, line 1222, in real_main
 (master_nodes, slave_nodes) = launch_cluster(conn, opts, cluster_name)
   File ../spark/ec2/spark_ec2.py, line 667, in launch_cluster
 value='{cn}-slave-{iid}'.format(cn=cluster_name, iid=slave.id))
   File /path/spark/ec2/lib/boto-2.34.0/boto/ec2/ec2object.py, line 80, in 
 add_tag
 self.add_tags({key: value}, dry_run)
   File /path/spark/ec2/lib/boto-2.34.0/boto/ec2/ec2object.py, line 97, in 
 add_tags
 dry_run=dry_run
   File /path/spark/ec2/lib/boto-2.34.0/boto/ec2/connection.py, line 4202, 
 in create_tags
 return self.get_status('CreateTags', params, verb='POST')
   File /path/spark/ec2/lib/boto-2.34.0/boto/connection.py, line 1223, in 
 get_status
 raise self.ResponseError(response.status, response.reason, body)
 boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request
 ?xml version=1.0 encoding=UTF-8?
 ResponseErrorsErrorCodeInvalidInstanceID.NotFound/CodeMessageThe 
 instance ID 'i-d3b72524' does not 
 exist/Message/Error/ErrorsRequestIDf0936ab5-4d10-46d1-a35d-cefaf8a68adc/RequestID/Response
 {code}
 This is presumably a problem with AWS metadata taking time to become 
 available on all the servers that spark-ec2 hits as it makes the several 
 tagging calls.
 Instead of retrying the tagging calls, we should just reduce them to 2 
 calls--one for the master, one for the slaves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7900) Reduce number of tagging calls in spark-ec2

2015-05-28 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7900?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14563053#comment-14563053
 ] 

Nicholas Chammas commented on SPARK-7900:
-

An alternative approach would be to just [increase the wait 
time|https://github.com/apache/spark/blob/e838a25bdb5603ef05e779225704c972ce436145/ec2/spark_ec2.py#L681-L683]
 to 30 or 60 seconds before trying to tag instances.

[~shivaram] / [~joshrosen]: Any preference on an approach?

 Reduce number of tagging calls in spark-ec2
 ---

 Key: SPARK-7900
 URL: https://issues.apache.org/jira/browse/SPARK-7900
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.4.0
Reporter: Nicholas Chammas
Priority: Minor

 spark-ec2 currently tags each instance with its own name:
 https://github.com/apache/spark/blob/4615081d7a10b023491e25478d19b8161e030974/ec2/spark_ec2.py#L684-L692
 Quite often, one of these tagging calls will fail:
 {code}
 Launching instances...
 Launched 10 slaves in us-west-2a, regid = r-89656e83
 Launched master in us-west-2a, regid = r-07646f0d
 Waiting for AWS to propagate instance metadata...
 Traceback (most recent call last):
   File ../spark/ec2/spark_ec2.py, line 1395, in module
 main()
   File ../spark/ec2/spark_ec2.py, line 1387, in main
 real_main()
   File ../spark/ec2/spark_ec2.py, line 1222, in real_main
 (master_nodes, slave_nodes) = launch_cluster(conn, opts, cluster_name)
   File ../spark/ec2/spark_ec2.py, line 667, in launch_cluster
 value='{cn}-slave-{iid}'.format(cn=cluster_name, iid=slave.id))
   File /path/spark/ec2/lib/boto-2.34.0/boto/ec2/ec2object.py, line 80, in 
 add_tag
 self.add_tags({key: value}, dry_run)
   File /path/spark/ec2/lib/boto-2.34.0/boto/ec2/ec2object.py, line 97, in 
 add_tags
 dry_run=dry_run
   File /path/spark/ec2/lib/boto-2.34.0/boto/ec2/connection.py, line 4202, 
 in create_tags
 return self.get_status('CreateTags', params, verb='POST')
   File /path/spark/ec2/lib/boto-2.34.0/boto/connection.py, line 1223, in 
 get_status
 raise self.ResponseError(response.status, response.reason, body)
 boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request
 ?xml version=1.0 encoding=UTF-8?
 ResponseErrorsErrorCodeInvalidInstanceID.NotFound/CodeMessageThe 
 instance ID 'i-d3b72524' does not 
 exist/Message/Error/ErrorsRequestIDf0936ab5-4d10-46d1-a35d-cefaf8a68adc/RequestID/Response
 {code}
 This is presumably a problem with AWS metadata taking time to become 
 available on all the servers that spark-ec2 hits as it makes the several 
 tagging calls.
 Instead of retrying the tagging calls, we should just reduce them to 2 
 calls--one for the master, one for the slaves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7900) Reduce number of tagging calls in spark-ec2

2015-05-27 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created SPARK-7900:
---

 Summary: Reduce number of tagging calls in spark-ec2
 Key: SPARK-7900
 URL: https://issues.apache.org/jira/browse/SPARK-7900
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.4.0
Reporter: Nicholas Chammas
Priority: Minor


spark-ec2 currently tags each instance with its own name:

https://github.com/apache/spark/blob/4615081d7a10b023491e25478d19b8161e030974/ec2/spark_ec2.py#L684-L692

Quite often, one of these tagging calls will fail:

{code}
Launching instances...
Launched 10 slaves in us-west-2a, regid = r-89656e83
Launched master in us-west-2a, regid = r-07646f0d
Waiting for AWS to propagate instance metadata...
Traceback (most recent call last):
  File ../spark/ec2/spark_ec2.py, line 1395, in module
main()
  File ../spark/ec2/spark_ec2.py, line 1387, in main
real_main()
  File ../spark/ec2/spark_ec2.py, line 1222, in real_main
(master_nodes, slave_nodes) = launch_cluster(conn, opts, cluster_name)
  File ../spark/ec2/spark_ec2.py, line 667, in launch_cluster
value='{cn}-slave-{iid}'.format(cn=cluster_name, iid=slave.id))
  File /path/spark/ec2/lib/boto-2.34.0/boto/ec2/ec2object.py, line 80, in 
add_tag
self.add_tags({key: value}, dry_run)
  File /path/spark/ec2/lib/boto-2.34.0/boto/ec2/ec2object.py, line 97, in 
add_tags
dry_run=dry_run
  File /path/spark/ec2/lib/boto-2.34.0/boto/ec2/connection.py, line 4202, 
in create_tags
return self.get_status('CreateTags', params, verb='POST')
  File /path/spark/ec2/lib/boto-2.34.0/boto/connection.py, line 1223, in 
get_status
raise self.ResponseError(response.status, response.reason, body)
boto.exception.EC2ResponseError: EC2ResponseError: 400 Bad Request
?xml version=1.0 encoding=UTF-8?
ResponseErrorsErrorCodeInvalidInstanceID.NotFound/CodeMessageThe 
instance ID 'i-d3b72524' does not 
exist/Message/Error/ErrorsRequestIDf0936ab5-4d10-46d1-a35d-cefaf8a68adc/RequestID/Response
{code}

This is presumably a problem with AWS metadata taking time to become available 
on all the servers that spark-ec2 hits as it makes the several tagging calls.

Instead of retrying the tagging calls, we should just reduce them to 2 
calls--one for the master, one for the slaves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7505) Update PySpark DataFrame docs: encourage getitem, mark as experimental, etc.

2015-05-22 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14556860#comment-14556860
 ] 

Nicholas Chammas commented on SPARK-7505:
-

cc [~davies] - I think the most pressing change on this list is marking the 
Python DataFrame API as experimental.

As far as I can tell, that's not the case currently.

 Update PySpark DataFrame docs: encourage __getitem__, mark as experimental, 
 etc.
 

 Key: SPARK-7505
 URL: https://issues.apache.org/jira/browse/SPARK-7505
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, PySpark, SQL
Affects Versions: 1.3.1
Reporter: Nicholas Chammas
Priority: Minor

 The PySpark docs for DataFrame need the following fixes and improvements:
 # Per [SPARK-7035], we should encourage the use of {{\_\_getitem\_\_}} over 
 {{\_\_getattr\_\_}} and change all our examples accordingly.
 # *We should say clearly that the API is experimental.* (That is currently 
 not the case for the PySpark docs.)
 # We should provide an example of how to join and select from 2 DataFrames 
 that have identically named columns, because it is not obvious:
   {code}
  df1 = sqlContext.jsonRDD(sc.parallelize(['{a: 4, other: I know}']))
  df2 = sqlContext.jsonRDD(sc.parallelize(['{a: 4, other: I dunno}']))
  df12 = df1.join(df2, df1['a'] == df2['a'])
  df12.select(df1['a'], df2['other']).show()
 a other   
 
 4 I dunno  {code}
 # 
 [{{DF.orderBy}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.orderBy]
  and 
 [{{DF.sort}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sort]
  should be marked as aliases if that's what they are.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7507) pyspark.sql.types.StructType and Row should implement iter()

2015-05-21 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14555482#comment-14555482
 ] 

Nicholas Chammas commented on SPARK-7507:
-

Since {{Row}} seems most analogous to a {{namedtuple}} in Python, here is an 
interesting parallel that suggests we should perhaps instead support 
{{vars(Row)}} and not {{dict(Row)}}.

http://stackoverflow.com/q/26180528/877069
https://docs.python.org/3/library/functions.html#vars
https://docs.python.org/3/library/collections.html#collections.somenamedtuple._asdict

{quote}
somenamedtuple._asdict()

Return a new OrderedDict which maps field names to their corresponding values.

Note, this method is no longer needed now that the same effect can be achieved 
by using the built-in vars() function:
{quote}

 pyspark.sql.types.StructType and Row should implement __iter__()
 

 Key: SPARK-7507
 URL: https://issues.apache.org/jira/browse/SPARK-7507
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, SQL
Reporter: Nicholas Chammas
Priority: Minor

 {{StructType}} looks an awful lot like a Python dictionary.
 However, it doesn't implement {{\_\_iter\_\_()}}, so doing a quick conversion 
 like this doesn't work:
 {code}
  df = sqlContext.jsonRDD(sc.parallelize(['{name: El Magnifico}']))
  df.schema
 StructType(List(StructField(name,StringType,true)))
  dict(df.schema)
 Traceback (most recent call last):
   File stdin, line 1, in module
 TypeError: 'StructType' object is not iterable
 {code}
 This would be super helpful for doing any custom schema manipulations without 
 having to go through the whole {{.json() - json.loads() - manipulate() - 
 json.dumps() - .fromJson()}} charade.
 Same goes for {{Row}}, which offers an 
 [{{asDict()}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.Row.asDict]
  method but doesn't support the more Pythonic {{dict(Row)}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7507) pyspark.sql.types.StructType and Row should implement iter()

2015-05-21 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7507?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14554550#comment-14554550
 ] 

Nicholas Chammas commented on SPARK-7507:
-

Related: A Stack Overflow question about iterating over {{Row}} in the Scala 
API: http://stackoverflow.com/q/30353705/877069

 pyspark.sql.types.StructType and Row should implement __iter__()
 

 Key: SPARK-7507
 URL: https://issues.apache.org/jira/browse/SPARK-7507
 Project: Spark
  Issue Type: Sub-task
  Components: PySpark, SQL
Reporter: Nicholas Chammas
Priority: Minor

 {{StructType}} looks an awful lot like a Python dictionary.
 However, it doesn't implement {{\_\_iter\_\_()}}, so doing a quick conversion 
 like this doesn't work:
 {code}
  df = sqlContext.jsonRDD(sc.parallelize(['{name: El Magnifico}']))
  df.schema
 StructType(List(StructField(name,StringType,true)))
  dict(df.schema)
 Traceback (most recent call last):
   File stdin, line 1, in module
 TypeError: 'StructType' object is not iterable
 {code}
 This would be super helpful for doing any custom schema manipulations without 
 having to go through the whole {{.json() - json.loads() - manipulate() - 
 json.dumps() - .fromJson()}} charade.
 Same goes for {{Row}}, which offers an 
 [{{asDict()}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.Row.asDict]
  method but doesn't support the more Pythonic {{dict(Row)}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Re: Wish for 1.4: upper bound on # tasks in Mesos

2015-05-20 Thread Nicholas Chammas

To put this on the devs' radar, I suggest creating a JIRA for it (and
checking first if one already exists).

issues.apache.org/jira/

Nick

On Tue, May 19, 2015 at 1:34 PM Matei Zaharia matei.zaha...@gmail.com
wrote:

 Yeah, this definitely seems useful there. There might also be some ways to
 cap the application in Mesos, but I'm not sure.

 Matei

 On May 19, 2015, at 1:11 PM, Thomas Dudziak tom...@gmail.com wrote:

 I'm using fine-grained for a multi-tenant environment which is why I would
 welcome the limit of tasks per job :)

 cheers,
 Tom

 On Tue, May 19, 2015 at 10:05 AM, Matei Zaharia matei.zaha...@gmail.com
 wrote:

 Hey Tom,

 Are you using the fine-grained or coarse-grained scheduler? For the
 coarse-grained scheduler, there is a spark.cores.max config setting that
 will limit the total # of cores it grabs. This was there in earlier
 versions too.

 Matei

  On May 19, 2015, at 12:39 PM, Thomas Dudziak tom...@gmail.com wrote:
 
  I read the other day that there will be a fair number of improvements
 in 1.4 for Mesos. Could I ask for one more (if it isn't already in there):
 a configurable limit for the number of tasks for jobs run on Mesos ? This
 would be a very simple yet effective way to prevent a job dominating the
 cluster.
 
  cheers,
  Tom

[jira] [Commented] (SPARK-7640) Private VPC with default Spark AMI breaks yum

2015-05-19 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14551795#comment-14551795
 ] 

Nicholas Chammas commented on SPARK-7640:
-

[~brdwrd] - According to [this doc on Amazon 
Linux|http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/AmazonLinuxAMIBasics.html#security-updates]:

{quote}
Important
If your instance is running in a virtual private cloud (VPC), you must attach 
an Internet Gateway to the VPC in order to contact the yum repository. For more 
information, see Internet Gateways in the Amazon VPC User Guide.
{quote}

Does this not describe your situation? If it does, then isn't this simply an 
issue with Amazon Linux in general, as opposed to the specific Spark AMI we are 
using?

As explained earlier, if this request boils down to change the Linux 
distribution that spark-ec2 uses then we probably have to resolve this as 
Won't Fix for now.

 Private VPC with default Spark AMI breaks yum
 -

 Key: SPARK-7640
 URL: https://issues.apache.org/jira/browse/SPARK-7640
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 1.3.0, 1.3.1
Reporter: Brad Willard
Priority: Minor

 If you create a spark cluster in a private vpc, the amazon yum repos return 
 403 permission denied because Amazon cannot discern the vms are in their 
 datacenter. This makes it incredibly annoying to install things like python 
 2.7 and different compression libs or consider updating anything.
 Potential fixes:
 Add fedora yum repos on the default ami to ones outside of amazon. 
 Change the ami to be based on a non amazon ami, like a standard red-hat one.
 Switch everything to support ec2-user like most modern aws amis to make it 
 easier for the user to pick an ami
 Petition amazon to open up their repos.
 Failed Workaround:
 I attempted to use a normal red-hat ami, however the current deploy scripts 
 assume the user and the install path are root. While the deploy script allows 
 you to override the user, they don't work if you set ec2-user basically 
 preventing you from using any current ami other than the default amazon one 
 which is unfortunate.
 So normally this would work, but because amazon 403s you get this if you want 
 to use python 2.7
 $ yum install -y python27.x86_64 python27-devel.x86_64 python27-pip.noarch
 Loaded plugins: priorities, security, update-motd, upgrade-helper
 http://packages.us-east-1.amazonaws.com/2015.03/main/20150301f40d/x86_64/repodata/repomd.xml:
  [Errno 14] PYCURL ERROR 22 - The requested URL returned error: 403 
 Forbidden
 Trying other mirror.
 http://packages.us-west-1.amazonaws.com/2015.03/main/20150301f40d/x86_64/repodata/repomd.xml:
  [Errno 14] PYCURL ERROR 22 - The requested URL returned error: 403 
 Forbidden
 Trying other mirror.
 http://packages.us-west-2.amazonaws.com/2015.03/main/20150301f40d/x86_64/repodata/repomd.xml:
  [Errno 14] PYCURL ERROR 22 - The requested URL returned error: 403 
 Forbidden
 Trying other mirror.
 http://packages.eu-west-1.amazonaws.com/2015.03/main/20150301f40d/x86_64/repodata/repomd.xml:
  [Errno 14] PYCURL ERROR 22 - The requested URL returned error: 403 
 Forbidden
 Trying other mirror.
 http://packages.ap-southeast-1.amazonaws.com/2015.03/main/20150301f40d/x86_64/repodata/repomd.xml:
  [Errno 14] PYCURL ERROR 22 - The requested URL returned error: 403 
 Forbidden
 Trying other mirror.
 http://packages.ap-northeast-1.amazonaws.com/2015.03/main/20150301f40d/x86_64/repodata/repomd.xml:
  [Errno 14] PYCURL ERROR 22 - The requested URL returned error: 403 
 Forbidden
 Trying other mirror.
 http://packages.sa-east-1.amazonaws.com/2015.03/main/20150301f40d/x86_64/repodata/repomd.xml:
  [Errno 14] PYCURL ERROR 22 - The requested URL returned error: 403 
 Forbidden
 Trying other mirror.
 http://packages.ap-southeast-2.amazonaws.com/2015.03/main/20150301f40d/x86_64/repodata/repomd.xml:
  [Errno 14] PYCURL ERROR 22 - The requested URL returned error: 403 
 Forbidden
 Trying other mirror.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7640) Private VPC with default Spark AMI breaks yum

2015-05-14 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544169#comment-14544169
 ] 

Nicholas Chammas commented on SPARK-7640:
-

{quote}
Switch everything to support ec2-user like most modern aws amis to make it 
easier for the user to pick an ami
{quote}

Just a side note that switching to {{ec2-user}} alone is not enough to make it 
easy for users to pick their own AMI, since the spark-ec2 AMI includes a bunch 
of non-standard stuff (related: SPARK-3821).

Going back to the original issue, I don't think there is an easy way out here 
if it really requires changing the base AMI or user.

We were planning on updating the AMIs to include 2.7 by default (SPARK-922), 
but that's been stalled. 

 Private VPC with default Spark AMI breaks yum
 -

 Key: SPARK-7640
 URL: https://issues.apache.org/jira/browse/SPARK-7640
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 1.3.0, 1.3.1
Reporter: Brad Willard
Priority: Minor

 If you create a spark cluster in a private vpc, the amazon yum repos return 
 403 permission denied because Amazon cannot discern the vms are in their 
 datacenter. This makes it incredibly annoying to install things like python 
 2.7 and different compression libs or consider updating anything.
 Potential fixes:
 Add fedora yum repos on the default ami to ones outside of amazon. 
 Change the ami to be based on a non amazon ami, like a standard red-hat one.
 Switch everything to support ec2-user like most modern aws amis to make it 
 easier for the user to pick an ami
 Petition amazon to open up their repos.
 Failed Workaround:
 I attempted to use a normal red-hat ami, however the current deploy scripts 
 assume the user and the install path are root. While the deploy script allows 
 you to override the user, they don't work if you set ec2-user basically 
 preventing you from using any current ami other than the default amazon one 
 which is unfortunate.
 So normally this would work, but because amazon 403s you get this if you want 
 to use python 2.7
 $ yum install -y python27.x86_64 python27-devel.x86_64 python27-pip.noarch
 Loaded plugins: priorities, security, update-motd, upgrade-helper
 http://packages.us-east-1.amazonaws.com/2015.03/main/20150301f40d/x86_64/repodata/repomd.xml:
  [Errno 14] PYCURL ERROR 22 - The requested URL returned error: 403 
 Forbidden
 Trying other mirror.
 http://packages.us-west-1.amazonaws.com/2015.03/main/20150301f40d/x86_64/repodata/repomd.xml:
  [Errno 14] PYCURL ERROR 22 - The requested URL returned error: 403 
 Forbidden
 Trying other mirror.
 http://packages.us-west-2.amazonaws.com/2015.03/main/20150301f40d/x86_64/repodata/repomd.xml:
  [Errno 14] PYCURL ERROR 22 - The requested URL returned error: 403 
 Forbidden
 Trying other mirror.
 http://packages.eu-west-1.amazonaws.com/2015.03/main/20150301f40d/x86_64/repodata/repomd.xml:
  [Errno 14] PYCURL ERROR 22 - The requested URL returned error: 403 
 Forbidden
 Trying other mirror.
 http://packages.ap-southeast-1.amazonaws.com/2015.03/main/20150301f40d/x86_64/repodata/repomd.xml:
  [Errno 14] PYCURL ERROR 22 - The requested URL returned error: 403 
 Forbidden
 Trying other mirror.
 http://packages.ap-northeast-1.amazonaws.com/2015.03/main/20150301f40d/x86_64/repodata/repomd.xml:
  [Errno 14] PYCURL ERROR 22 - The requested URL returned error: 403 
 Forbidden
 Trying other mirror.
 http://packages.sa-east-1.amazonaws.com/2015.03/main/20150301f40d/x86_64/repodata/repomd.xml:
  [Errno 14] PYCURL ERROR 22 - The requested URL returned error: 403 
 Forbidden
 Trying other mirror.
 http://packages.ap-southeast-2.amazonaws.com/2015.03/main/20150301f40d/x86_64/repodata/repomd.xml:
  [Errno 14] PYCURL ERROR 22 - The requested URL returned error: 403 
 Forbidden
 Trying other mirror.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7640) Private VPC with default Spark AMI breaks yum

2015-05-14 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7640?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14544288#comment-14544288
 ] 

Nicholas Chammas commented on SPARK-7640:
-

If there is no way around this (like, if Amazon doesn't think this is a problem 
with their distro) then I have to be honest that this will unlikely ever be 
fixed in spark-ec2.

It's just too much work right now to change the base AMI to support a 
relatively minor use case. Only if a motivated contributor (i.e. someone 
affected by this) who also knows their way around spark-ec2 steps in to do the 
work will this be fixed.

 Private VPC with default Spark AMI breaks yum
 -

 Key: SPARK-7640
 URL: https://issues.apache.org/jira/browse/SPARK-7640
 Project: Spark
  Issue Type: Improvement
  Components: EC2
Affects Versions: 1.3.0, 1.3.1
Reporter: Brad Willard
Priority: Minor

 If you create a spark cluster in a private vpc, the amazon yum repos return 
 403 permission denied because Amazon cannot discern the vms are in their 
 datacenter. This makes it incredibly annoying to install things like python 
 2.7 and different compression libs or consider updating anything.
 Potential fixes:
 Add fedora yum repos on the default ami to ones outside of amazon. 
 Change the ami to be based on a non amazon ami, like a standard red-hat one.
 Switch everything to support ec2-user like most modern aws amis to make it 
 easier for the user to pick an ami
 Petition amazon to open up their repos.
 Failed Workaround:
 I attempted to use a normal red-hat ami, however the current deploy scripts 
 assume the user and the install path are root. While the deploy script allows 
 you to override the user, they don't work if you set ec2-user basically 
 preventing you from using any current ami other than the default amazon one 
 which is unfortunate.
 So normally this would work, but because amazon 403s you get this if you want 
 to use python 2.7
 $ yum install -y python27.x86_64 python27-devel.x86_64 python27-pip.noarch
 Loaded plugins: priorities, security, update-motd, upgrade-helper
 http://packages.us-east-1.amazonaws.com/2015.03/main/20150301f40d/x86_64/repodata/repomd.xml:
  [Errno 14] PYCURL ERROR 22 - The requested URL returned error: 403 
 Forbidden
 Trying other mirror.
 http://packages.us-west-1.amazonaws.com/2015.03/main/20150301f40d/x86_64/repodata/repomd.xml:
  [Errno 14] PYCURL ERROR 22 - The requested URL returned error: 403 
 Forbidden
 Trying other mirror.
 http://packages.us-west-2.amazonaws.com/2015.03/main/20150301f40d/x86_64/repodata/repomd.xml:
  [Errno 14] PYCURL ERROR 22 - The requested URL returned error: 403 
 Forbidden
 Trying other mirror.
 http://packages.eu-west-1.amazonaws.com/2015.03/main/20150301f40d/x86_64/repodata/repomd.xml:
  [Errno 14] PYCURL ERROR 22 - The requested URL returned error: 403 
 Forbidden
 Trying other mirror.
 http://packages.ap-southeast-1.amazonaws.com/2015.03/main/20150301f40d/x86_64/repodata/repomd.xml:
  [Errno 14] PYCURL ERROR 22 - The requested URL returned error: 403 
 Forbidden
 Trying other mirror.
 http://packages.ap-northeast-1.amazonaws.com/2015.03/main/20150301f40d/x86_64/repodata/repomd.xml:
  [Errno 14] PYCURL ERROR 22 - The requested URL returned error: 403 
 Forbidden
 Trying other mirror.
 http://packages.sa-east-1.amazonaws.com/2015.03/main/20150301f40d/x86_64/repodata/repomd.xml:
  [Errno 14] PYCURL ERROR 22 - The requested URL returned error: 403 
 Forbidden
 Trying other mirror.
 http://packages.ap-southeast-2.amazonaws.com/2015.03/main/20150301f40d/x86_64/repodata/repomd.xml:
  [Errno 14] PYCURL ERROR 22 - The requested URL returned error: 403 
 Forbidden
 Trying other mirror.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-7606) Document all PySpark SQL/DataFrame public methods with @since tag

2015-05-13 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542701#comment-14542701
 ] 

Nicholas Chammas edited comment on SPARK-7606 at 5/13/15 8:57 PM:
--

Just looked into this. If we are using Sphinx for the docs, which I believe we 
are, then there are the {{.. versionadded::}} and {{.. versionchanged::}} tags.

http://sphinx-doc.org/markup/para.html#directive-versionadded

This is what the Python standard library uses. For example:

https://docs.python.org/3.5/library/subprocess.html#subprocess.run
https://github.com/python/cpython/commit/9ed5f6e6e7ec5afefdb17bab6106881d8fddba68#diff-7843f151fe534572c98062d58b998ba8R120

{quote}
New in version 3.5.
{quote}

What's the relevance of dynamic typing here?


was (Author: nchammas):
Just looked into this. If we are using Sphinx for the docs, which I believe we 
are, then there are the {{::versionadded}} and {{::versionchanged}} tags.

http://sphinx-doc.org/markup/para.html#directive-versionadded

This is what the Python standard library uses. For example:

https://docs.python.org/3.5/library/subprocess.html#subprocess.run
https://github.com/python/cpython/commit/9ed5f6e6e7ec5afefdb17bab6106881d8fddba68#diff-7843f151fe534572c98062d58b998ba8R120

{quote}
New in version 3.5.
{quote}

What's the relevance of dynamic typing here?

 Document all PySpark SQL/DataFrame public methods with @since tag
 -

 Key: SPARK-7606
 URL: https://issues.apache.org/jira/browse/SPARK-7606
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Nicholas Chammas
Priority: Blocker
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7606) Document all PySpark SQL/DataFrame public methods with @since tag

2015-05-13 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542701#comment-14542701
 ] 

Nicholas Chammas commented on SPARK-7606:
-

Just looked into this. If we are using Sphinx for the docs, which I believe we 
are, then there are the {{::versionadded}} and {{::versionchanged}} tags.

http://sphinx-doc.org/markup/para.html#directive-versionadded

This is what the Python standard library uses. For example:

https://docs.python.org/3.5/library/subprocess.html#subprocess.run
https://github.com/python/cpython/commit/9ed5f6e6e7ec5afefdb17bab6106881d8fddba68#diff-7843f151fe534572c98062d58b998ba8R120

{quote}
New in version 3.5.
{quote}

What's the relevance of dynamic typing here?

 Document all PySpark SQL/DataFrame public methods with @since tag
 -

 Key: SPARK-7606
 URL: https://issues.apache.org/jira/browse/SPARK-7606
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Nicholas Chammas
Priority: Blocker
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Created] (SPARK-7606) Document all PySpark SQL/DataFrame public methods with @since tag

2015-05-13 Thread Nicholas Chammas (JIRA)

Nicholas Chammas created SPARK-7606:
---

 Summary: Document all PySpark SQL/DataFrame public methods with 
@since tag
 Key: SPARK-7606
 URL: https://issues.apache.org/jira/browse/SPARK-7606
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Nicholas Chammas
Assignee: Reynold Xin
Priority: Blocker
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-7606) Document all PySpark SQL/DataFrame public methods with @since tag

2015-05-13 Thread Nicholas Chammas (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-7606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14542176#comment-14542176
 ] 

Nicholas Chammas commented on SPARK-7606:
-

(I just cloned SPARK-7588.)

Dunno what mechanism we're gonna use to label stuff in Python, but we should do 
it.

cc [~davies]

 Document all PySpark SQL/DataFrame public methods with @since tag
 -

 Key: SPARK-7606
 URL: https://issues.apache.org/jira/browse/SPARK-7606
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Nicholas Chammas
Assignee: Reynold Xin
Priority: Blocker
 Fix For: 1.4.0






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

< 5 6 7 8 9 10 11 12 13 14 >

901 - 1000 of 1992 matches

Mail list logo