Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-04-06 Thread Nicholas Chammas
Thank you Josh! I confirmed that the Spark 1.6.1 / Hadoop 2.6 package on S3
is now working, and the SHA512 checks out.

On Wed, Apr 6, 2016 at 3:19 PM Josh Rosen <joshro...@databricks.com> wrote:

> I downloaded the Spark 1.6.1 artifacts from the Apache mirror network and
> re-uploaded them to the spark-related-packages S3 bucket, so hopefully
> these packages should be fixed now.
>
> On Mon, Apr 4, 2016 at 3:37 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Thanks, that was the command. :thumbsup:
>>
>> On Mon, Apr 4, 2016 at 6:28 PM Jakob Odersky <ja...@odersky.com> wrote:
>>
>>> I just found out how the hash is calculated:
>>>
>>> gpg --print-md sha512 .tgz
>>>
>>> you can use that to check if the resulting output matches the contents
>>> of .tgz.sha
>>>
>>> On Mon, Apr 4, 2016 at 3:19 PM, Jakob Odersky <ja...@odersky.com> wrote:
>>> > The published hash is a SHA512.
>>> >
>>> > You can verify the integrity of the packages by running `sha512sum` on
>>> > the archive and comparing the computed hash with the published one.
>>> > Unfortunately however, I don't know what tool is used to generate the
>>> > hash and I can't reproduce the format, so I ended up manually
>>> > comparing the hashes.
>>> >
>>> > On Mon, Apr 4, 2016 at 2:39 PM, Nicholas Chammas
>>> > <nicholas.cham...@gmail.com> wrote:
>>> >> An additional note: The Spark packages being served off of CloudFront
>>> (i.e.
>>> >> the “direct download” option on spark.apache.org) are also corrupt.
>>> >>
>>> >> Btw what’s the correct way to verify the SHA of a Spark package? I’ve
>>> tried
>>> >> a few commands on working packages downloaded from Apache mirrors,
>>> but I
>>> >> can’t seem to reproduce the published SHA for
>>> spark-1.6.1-bin-hadoop2.6.tgz.
>>> >>
>>> >>
>>> >> On Mon, Apr 4, 2016 at 11:45 AM Ted Yu <yuzhih...@gmail.com> wrote:
>>> >>>
>>> >>> Maybe temporarily take out the artifacts on S3 before the root cause
>>> is
>>> >>> found.
>>> >>>
>>> >>> On Thu, Mar 24, 2016 at 7:25 AM, Nicholas Chammas
>>> >>> <nicholas.cham...@gmail.com> wrote:
>>> >>>>
>>> >>>> Just checking in on this again as the builds on S3 are still
>>> broken. :/
>>> >>>>
>>> >>>> Could it have something to do with us moving release-build.sh?
>>> >>>>
>>> >>>>
>>> >>>> On Mon, Mar 21, 2016 at 1:43 PM Nicholas Chammas
>>> >>>> <nicholas.cham...@gmail.com> wrote:
>>> >>>>>
>>> >>>>> Is someone going to retry fixing these packages? It's still a
>>> problem.
>>> >>>>>
>>> >>>>> Also, it would be good to understand why this is happening.
>>> >>>>>
>>> >>>>> On Fri, Mar 18, 2016 at 6:49 PM Jakob Odersky <ja...@odersky.com>
>>> wrote:
>>> >>>>>>
>>> >>>>>> I just realized you're using a different download site. Sorry for
>>> the
>>> >>>>>> confusion, the link I get for a direct download of Spark 1.6.1 /
>>> >>>>>> Hadoop 2.6 is
>>> >>>>>>
>>> http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz
>>> >>>>>>
>>> >>>>>> On Fri, Mar 18, 2016 at 3:20 PM, Nicholas Chammas
>>> >>>>>> <nicholas.cham...@gmail.com> wrote:
>>> >>>>>> > I just retried the Spark 1.6.1 / Hadoop 2.6 download and got a
>>> >>>>>> > corrupt ZIP
>>> >>>>>> > file.
>>> >>>>>> >
>>> >>>>>> > Jakob, are you sure the ZIP unpacks correctly for you? Is it
>>> the same
>>> >>>>>> > Spark
>>> >>>>>> > 1.6.1/Hadoop 2.6 package you had a success with?
>>> >>>>>> >
>>> >>>>>> > On Fri, Mar 18, 2016 at 6:11 PM Jakob Odersky <
>>> ja...@odersky.com>
>>> >>>>>> > wrote:
>>> >>>>>> >>
>>> &

Re: [Distutils] How to deprecate a python package

2016-04-06 Thread Nicholas Chammas
FYI, there is an existing issue on Warehouse's tracker for this:
https://github.com/pypa/warehouse/issues/345
___
Distutils-SIG maillist  -  Distutils-SIG@python.org
https://mail.python.org/mailman/listinfo/distutils-sig


[jira] [Closed] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2016-04-05 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas closed SPARK-3821.
---
Resolution: Won't Fix

I'm resolving this as "Won't Fix" due to lack of interest, both on my part and 
on part of the Spark / spark-ec2 project maintainers.

If anyone's interested in picking this up, the code is here: 
https://github.com/nchammas/spark-ec2/tree/packer/image-build

I've mostly moved on from spark-ec2 to work on 
[Flintrock|https://github.com/nchammas/flintrock], which doesn't require custom 
AMIs.

> Develop an automated way of creating Spark images (AMI, Docker, and others)
> ---
>
> Key: SPARK-3821
> URL: https://issues.apache.org/jira/browse/SPARK-3821
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, EC2
>    Reporter: Nicholas Chammas
>Assignee: Nicholas Chammas
> Attachments: packer-proposal.html
>
>
> Right now the creation of Spark AMIs or Docker containers is done manually. 
> With tools like [Packer|http://www.packer.io/], we should be able to automate 
> this work, and do so in such a way that multiple types of machine images can 
> be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-04-04 Thread Nicholas Chammas
Thanks, that was the command. :thumbsup:

On Mon, Apr 4, 2016 at 6:28 PM Jakob Odersky <ja...@odersky.com> wrote:

> I just found out how the hash is calculated:
>
> gpg --print-md sha512 .tgz
>
> you can use that to check if the resulting output matches the contents
> of .tgz.sha
>
> On Mon, Apr 4, 2016 at 3:19 PM, Jakob Odersky <ja...@odersky.com> wrote:
> > The published hash is a SHA512.
> >
> > You can verify the integrity of the packages by running `sha512sum` on
> > the archive and comparing the computed hash with the published one.
> > Unfortunately however, I don't know what tool is used to generate the
> > hash and I can't reproduce the format, so I ended up manually
> > comparing the hashes.
> >
> > On Mon, Apr 4, 2016 at 2:39 PM, Nicholas Chammas
> > <nicholas.cham...@gmail.com> wrote:
> >> An additional note: The Spark packages being served off of CloudFront
> (i.e.
> >> the “direct download” option on spark.apache.org) are also corrupt.
> >>
> >> Btw what’s the correct way to verify the SHA of a Spark package? I’ve
> tried
> >> a few commands on working packages downloaded from Apache mirrors, but I
> >> can’t seem to reproduce the published SHA for
> spark-1.6.1-bin-hadoop2.6.tgz.
> >>
> >>
> >> On Mon, Apr 4, 2016 at 11:45 AM Ted Yu <yuzhih...@gmail.com> wrote:
> >>>
> >>> Maybe temporarily take out the artifacts on S3 before the root cause is
> >>> found.
> >>>
> >>> On Thu, Mar 24, 2016 at 7:25 AM, Nicholas Chammas
> >>> <nicholas.cham...@gmail.com> wrote:
> >>>>
> >>>> Just checking in on this again as the builds on S3 are still broken.
> :/
> >>>>
> >>>> Could it have something to do with us moving release-build.sh?
> >>>>
> >>>>
> >>>> On Mon, Mar 21, 2016 at 1:43 PM Nicholas Chammas
> >>>> <nicholas.cham...@gmail.com> wrote:
> >>>>>
> >>>>> Is someone going to retry fixing these packages? It's still a
> problem.
> >>>>>
> >>>>> Also, it would be good to understand why this is happening.
> >>>>>
> >>>>> On Fri, Mar 18, 2016 at 6:49 PM Jakob Odersky <ja...@odersky.com>
> wrote:
> >>>>>>
> >>>>>> I just realized you're using a different download site. Sorry for
> the
> >>>>>> confusion, the link I get for a direct download of Spark 1.6.1 /
> >>>>>> Hadoop 2.6 is
> >>>>>> http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz
> >>>>>>
> >>>>>> On Fri, Mar 18, 2016 at 3:20 PM, Nicholas Chammas
> >>>>>> <nicholas.cham...@gmail.com> wrote:
> >>>>>> > I just retried the Spark 1.6.1 / Hadoop 2.6 download and got a
> >>>>>> > corrupt ZIP
> >>>>>> > file.
> >>>>>> >
> >>>>>> > Jakob, are you sure the ZIP unpacks correctly for you? Is it the
> same
> >>>>>> > Spark
> >>>>>> > 1.6.1/Hadoop 2.6 package you had a success with?
> >>>>>> >
> >>>>>> > On Fri, Mar 18, 2016 at 6:11 PM Jakob Odersky <ja...@odersky.com>
> >>>>>> > wrote:
> >>>>>> >>
> >>>>>> >> I just experienced the issue, however retrying the download a
> second
> >>>>>> >> time worked. Could it be that there is some load balancer/cache
> in
> >>>>>> >> front of the archive and some nodes still serve the corrupt
> >>>>>> >> packages?
> >>>>>> >>
> >>>>>> >> On Fri, Mar 18, 2016 at 8:00 AM, Nicholas Chammas
> >>>>>> >> <nicholas.cham...@gmail.com> wrote:
> >>>>>> >> > I'm seeing the same. :(
> >>>>>> >> >
> >>>>>> >> > On Fri, Mar 18, 2016 at 10:57 AM Ted Yu <yuzhih...@gmail.com>
> >>>>>> >> > wrote:
> >>>>>> >> >>
> >>>>>> >> >> I tried again this morning :
> >>>>>> >> >>
> >>>>>> >> >> $ wget
> >>>>>> >> >>
> >>>>>> >> >>
> 

Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-04-04 Thread Nicholas Chammas
An additional note: The Spark packages being served off of CloudFront (i.e.
the “direct download” option on spark.apache.org) are also corrupt.

Btw what’s the correct way to verify the SHA of a Spark package? I’ve tried
a few commands on working packages downloaded from Apache mirrors, but I
can’t seem to reproduce the published SHA for spark-1.6.1-bin-hadoop2.6.tgz
<http://www.apache.org/dist/spark/spark-1.6.1/spark-1.6.1-bin-hadoop2.6.tgz.sha>
.
​

On Mon, Apr 4, 2016 at 11:45 AM Ted Yu <yuzhih...@gmail.com> wrote:

> Maybe temporarily take out the artifacts on S3 before the root cause is
> found.
>
> On Thu, Mar 24, 2016 at 7:25 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Just checking in on this again as the builds on S3 are still broken. :/
>>
>> Could it have something to do with us moving release-build.sh
>> <https://github.com/apache/spark/commits/master/dev/create-release/release-build.sh>
>> ?
>> ​
>>
>> On Mon, Mar 21, 2016 at 1:43 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Is someone going to retry fixing these packages? It's still a problem.
>>>
>>> Also, it would be good to understand why this is happening.
>>>
>>> On Fri, Mar 18, 2016 at 6:49 PM Jakob Odersky <ja...@odersky.com> wrote:
>>>
>>>> I just realized you're using a different download site. Sorry for the
>>>> confusion, the link I get for a direct download of Spark 1.6.1 /
>>>> Hadoop 2.6 is
>>>> http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz
>>>>
>>>> On Fri, Mar 18, 2016 at 3:20 PM, Nicholas Chammas
>>>> <nicholas.cham...@gmail.com> wrote:
>>>> > I just retried the Spark 1.6.1 / Hadoop 2.6 download and got a
>>>> corrupt ZIP
>>>> > file.
>>>> >
>>>> > Jakob, are you sure the ZIP unpacks correctly for you? Is it the same
>>>> Spark
>>>> > 1.6.1/Hadoop 2.6 package you had a success with?
>>>> >
>>>> > On Fri, Mar 18, 2016 at 6:11 PM Jakob Odersky <ja...@odersky.com>
>>>> wrote:
>>>> >>
>>>> >> I just experienced the issue, however retrying the download a second
>>>> >> time worked. Could it be that there is some load balancer/cache in
>>>> >> front of the archive and some nodes still serve the corrupt packages?
>>>> >>
>>>> >> On Fri, Mar 18, 2016 at 8:00 AM, Nicholas Chammas
>>>> >> <nicholas.cham...@gmail.com> wrote:
>>>> >> > I'm seeing the same. :(
>>>> >> >
>>>> >> > On Fri, Mar 18, 2016 at 10:57 AM Ted Yu <yuzhih...@gmail.com>
>>>> wrote:
>>>> >> >>
>>>> >> >> I tried again this morning :
>>>> >> >>
>>>> >> >> $ wget
>>>> >> >>
>>>> >> >>
>>>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>>>> >> >> --2016-03-18 07:55:30--
>>>> >> >>
>>>> >> >>
>>>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>>>> >> >> Resolving s3.amazonaws.com... 54.231.19.163
>>>> >> >> ...
>>>> >> >> $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz
>>>> >> >>
>>>> >> >> gzip: stdin: unexpected end of file
>>>> >> >> tar: Unexpected EOF in archive
>>>> >> >> tar: Unexpected EOF in archive
>>>> >> >> tar: Error is not recoverable: exiting now
>>>> >> >>
>>>> >> >> On Thu, Mar 17, 2016 at 8:57 AM, Michael Armbrust
>>>> >> >> <mich...@databricks.com>
>>>> >> >> wrote:
>>>> >> >>>
>>>> >> >>> Patrick reuploaded the artifacts, so it should be fixed now.
>>>> >> >>>
>>>> >> >>> On Mar 16, 2016 5:48 PM, "Nicholas Chammas"
>>>> >> >>> <nicholas.cham...@gmail.com>
>>>> >> >>> wrote:
>>>> >> >>>>
>>>> >> >>>> Looks like the other packages may also be corrupt. I’m getting
>>>> the
>>>> >> >>>> same
>>>> >> >>>> error for the Spark 1.6.1 / Hadoop 2.4 package.
>>>> >> >>>>
>>>> >> >>>>
>>>> >> >>>>
>>>> >> >>>>
>>>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.4.tgz
>>>> >> >>>>
>>>> >> >>>> Nick
>>>> >> >>>>
>>>> >> >>>>
>>>> >> >>>> On Wed, Mar 16, 2016 at 8:28 PM Ted Yu <yuzhih...@gmail.com>
>>>> wrote:
>>>> >> >>>>>
>>>> >> >>>>> On Linux, I got:
>>>> >> >>>>>
>>>> >> >>>>> $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz
>>>> >> >>>>>
>>>> >> >>>>> gzip: stdin: unexpected end of file
>>>> >> >>>>> tar: Unexpected EOF in archive
>>>> >> >>>>> tar: Unexpected EOF in archive
>>>> >> >>>>> tar: Error is not recoverable: exiting now
>>>> >> >>>>>
>>>> >> >>>>> On Wed, Mar 16, 2016 at 5:15 PM, Nicholas Chammas
>>>> >> >>>>> <nicholas.cham...@gmail.com> wrote:
>>>> >> >>>>>>
>>>> >> >>>>>>
>>>> >> >>>>>>
>>>> >> >>>>>>
>>>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>>>> >> >>>>>>
>>>> >> >>>>>> Does anyone else have trouble unzipping this? How did this
>>>> happen?
>>>> >> >>>>>>
>>>> >> >>>>>> What I get is:
>>>> >> >>>>>>
>>>> >> >>>>>> $ gzip -t spark-1.6.1-bin-hadoop2.6.tgz
>>>> >> >>>>>> gzip: spark-1.6.1-bin-hadoop2.6.tgz: unexpected end of file
>>>> >> >>>>>> gzip: spark-1.6.1-bin-hadoop2.6.tgz: uncompress failed
>>>> >> >>>>>>
>>>> >> >>>>>> Seems like a strange type of problem to come across.
>>>> >> >>>>>>
>>>> >> >>>>>> Nick
>>>> >> >>>>>
>>>> >> >>>>>
>>>> >> >>
>>>> >> >
>>>>
>>>
>


Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-04-04 Thread Nicholas Chammas
This is still an issue. The Spark 1.6.1 packages on S3 are corrupt.

Is anyone looking into this issue? Is there anything contributors can do to
help solve this problem?

Nick

On Sun, Mar 27, 2016 at 8:49 PM Nicholas Chammas <nicholas.cham...@gmail.com>
wrote:

> Pingity-ping-pong since this is still a problem.
>
>
> On Thu, Mar 24, 2016 at 4:08 PM Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> Patrick is investigating.
>>
>> On Thu, Mar 24, 2016 at 7:25 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Just checking in on this again as the builds on S3 are still broken. :/
>>>
>>> Could it have something to do with us moving release-build.sh
>>> <https://github.com/apache/spark/commits/master/dev/create-release/release-build.sh>
>>> ?
>>> ​
>>>
>>> On Mon, Mar 21, 2016 at 1:43 PM Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> Is someone going to retry fixing these packages? It's still a problem.
>>>>
>>>> Also, it would be good to understand why this is happening.
>>>>
>>>> On Fri, Mar 18, 2016 at 6:49 PM Jakob Odersky <ja...@odersky.com>
>>>> wrote:
>>>>
>>>>> I just realized you're using a different download site. Sorry for the
>>>>> confusion, the link I get for a direct download of Spark 1.6.1 /
>>>>> Hadoop 2.6 is
>>>>> http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz
>>>>>
>>>>> On Fri, Mar 18, 2016 at 3:20 PM, Nicholas Chammas
>>>>> <nicholas.cham...@gmail.com> wrote:
>>>>> > I just retried the Spark 1.6.1 / Hadoop 2.6 download and got a
>>>>> corrupt ZIP
>>>>> > file.
>>>>> >
>>>>> > Jakob, are you sure the ZIP unpacks correctly for you? Is it the
>>>>> same Spark
>>>>> > 1.6.1/Hadoop 2.6 package you had a success with?
>>>>> >
>>>>> > On Fri, Mar 18, 2016 at 6:11 PM Jakob Odersky <ja...@odersky.com>
>>>>> wrote:
>>>>> >>
>>>>> >> I just experienced the issue, however retrying the download a second
>>>>> >> time worked. Could it be that there is some load balancer/cache in
>>>>> >> front of the archive and some nodes still serve the corrupt
>>>>> packages?
>>>>> >>
>>>>> >> On Fri, Mar 18, 2016 at 8:00 AM, Nicholas Chammas
>>>>> >> <nicholas.cham...@gmail.com> wrote:
>>>>> >> > I'm seeing the same. :(
>>>>> >> >
>>>>> >> > On Fri, Mar 18, 2016 at 10:57 AM Ted Yu <yuzhih...@gmail.com>
>>>>> wrote:
>>>>> >> >>
>>>>> >> >> I tried again this morning :
>>>>> >> >>
>>>>> >> >> $ wget
>>>>> >> >>
>>>>> >> >>
>>>>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>>>>> >> >> --2016-03-18 07:55:30--
>>>>> >> >>
>>>>> >> >>
>>>>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>>>>> >> >> Resolving s3.amazonaws.com... 54.231.19.163
>>>>> >> >> ...
>>>>> >> >> $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz
>>>>> >> >>
>>>>> >> >> gzip: stdin: unexpected end of file
>>>>> >> >> tar: Unexpected EOF in archive
>>>>> >> >> tar: Unexpected EOF in archive
>>>>> >> >> tar: Error is not recoverable: exiting now
>>>>> >> >>
>>>>> >> >> On Thu, Mar 17, 2016 at 8:57 AM, Michael Armbrust
>>>>> >> >> <mich...@databricks.com>
>>>>> >> >> wrote:
>>>>> >> >>>
>>>>> >> >>> Patrick reuploaded the artifacts, so it should be fixed now.
>>>>> >> >>>
>>>>> >> >>> On Mar 16, 2016 5:48 PM, "Nicholas Chammas"
>>>>> >> >>> <nicholas.cham...@gmail.com>
>>>>> >> >>> wrote:
>>>>> >> >>>>
>>>>> >> >>>> Looks like the other packages may also be corrupt. I’m getting
>>>>> the
>>>>> >> >>>> same
>>>>> >> >>>> error for the Spark 1.6.1 / Hadoop 2.4 package.
>>>>> >> >>>>
>>>>> >> >>>>
>>>>> >> >>>>
>>>>> >> >>>>
>>>>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.4.tgz
>>>>> >> >>>>
>>>>> >> >>>> Nick
>>>>> >> >>>>
>>>>> >> >>>>
>>>>> >> >>>> On Wed, Mar 16, 2016 at 8:28 PM Ted Yu <yuzhih...@gmail.com>
>>>>> wrote:
>>>>> >> >>>>>
>>>>> >> >>>>> On Linux, I got:
>>>>> >> >>>>>
>>>>> >> >>>>> $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz
>>>>> >> >>>>>
>>>>> >> >>>>> gzip: stdin: unexpected end of file
>>>>> >> >>>>> tar: Unexpected EOF in archive
>>>>> >> >>>>> tar: Unexpected EOF in archive
>>>>> >> >>>>> tar: Error is not recoverable: exiting now
>>>>> >> >>>>>
>>>>> >> >>>>> On Wed, Mar 16, 2016 at 5:15 PM, Nicholas Chammas
>>>>> >> >>>>> <nicholas.cham...@gmail.com> wrote:
>>>>> >> >>>>>>
>>>>> >> >>>>>>
>>>>> >> >>>>>>
>>>>> >> >>>>>>
>>>>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>>>>> >> >>>>>>
>>>>> >> >>>>>> Does anyone else have trouble unzipping this? How did this
>>>>> happen?
>>>>> >> >>>>>>
>>>>> >> >>>>>> What I get is:
>>>>> >> >>>>>>
>>>>> >> >>>>>> $ gzip -t spark-1.6.1-bin-hadoop2.6.tgz
>>>>> >> >>>>>> gzip: spark-1.6.1-bin-hadoop2.6.tgz: unexpected end of file
>>>>> >> >>>>>> gzip: spark-1.6.1-bin-hadoop2.6.tgz: uncompress failed
>>>>> >> >>>>>>
>>>>> >> >>>>>> Seems like a strange type of problem to come across.
>>>>> >> >>>>>>
>>>>> >> >>>>>> Nick
>>>>> >> >>>>>
>>>>> >> >>>>>
>>>>> >> >>
>>>>> >> >
>>>>>
>>>>
>>


[jira] [Commented] (SPARK-3533) Add saveAsTextFileByKey() method to RDDs

2016-03-28 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15214577#comment-15214577
 ] 

Nicholas Chammas commented on SPARK-3533:
-

I've added 2 workaround to this issue to the description body.

> Add saveAsTextFileByKey() method to RDDs
> 
>
> Key: SPARK-3533
> URL: https://issues.apache.org/jira/browse/SPARK-3533
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 1.1.0
>    Reporter: Nicholas Chammas
>
> Users often have a single RDD of key-value pairs that they want to save to 
> multiple locations based on the keys.
> For example, say I have an RDD like this:
> {code}
> >>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 
> >>> 'Frankie']).keyBy(lambda x: x[0])
> >>> a.collect()
> [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')]
> >>> a.keys().distinct().collect()
> ['B', 'F', 'N']
> {code}
> Now I want to write the RDD out to different paths depending on the keys, so 
> that I have one output directory per distinct key. Each output directory 
> could potentially have multiple {{part-}} files, one per RDD partition.
> So the output would look something like:
> {code}
> /path/prefix/B [/part-1, /part-2, etc]
> /path/prefix/F [/part-1, /part-2, etc]
> /path/prefix/N [/part-1, /part-2, etc]
> {code}
> Though it may be possible to do this with some combination of 
> {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the 
> {{MultipleTextOutputFormat}} output format class, it isn't straightforward. 
> It's not clear if it's even possible at all in PySpark.
> Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs 
> that makes it easy to save RDDs out to multiple locations at once.
> ---
> Update: March 2016
> There are two workarounds to this problem:
> 1. See [this answer on Stack 
> Overflow|http://stackoverflow.com/a/26051042/877069], which implements 
> {{MultipleTextOutputFormat}}. (Scala-only)
> 2. See [this comment by Davies 
> Liu|https://github.com/apache/spark/pull/8375#issuecomment-202458325], which 
> uses DataFrames:
> {code}
> val df = rdd.map(t => Row(gen_key(t), t)).toDF("key", "text")
> df.write.partitionBy("key").text(path){code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3533) Add saveAsTextFileByKey() method to RDDs

2016-03-28 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3533?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-3533:

Description: 
Users often have a single RDD of key-value pairs that they want to save to 
multiple locations based on the keys.

For example, say I have an RDD like this:
{code}
>>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 'Frankie']).keyBy(lambda 
>>> x: x[0])
>>> a.collect()
[('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')]
>>> a.keys().distinct().collect()
['B', 'F', 'N']
{code}

Now I want to write the RDD out to different paths depending on the keys, so 
that I have one output directory per distinct key. Each output directory could 
potentially have multiple {{part-}} files, one per RDD partition.

So the output would look something like:

{code}
/path/prefix/B [/part-1, /part-2, etc]
/path/prefix/F [/part-1, /part-2, etc]
/path/prefix/N [/part-1, /part-2, etc]
{code}

Though it may be possible to do this with some combination of 
{{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the 
{{MultipleTextOutputFormat}} output format class, it isn't straightforward. 
It's not clear if it's even possible at all in PySpark.

Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs that 
makes it easy to save RDDs out to multiple locations at once.

---

Update: March 2016

There are two workarounds to this problem:

1. See [this answer on Stack 
Overflow|http://stackoverflow.com/a/26051042/877069], which implements 
{{MultipleTextOutputFormat}}. (Scala-only)
2. See [this comment by Davies 
Liu|https://github.com/apache/spark/pull/8375#issuecomment-202458325], which 
uses DataFrames:
{code}
val df = rdd.map(t => Row(gen_key(t), t)).toDF("key", "text")
df.write.partitionBy("key").text(path){code}


  was:
Users often have a single RDD of key-value pairs that they want to save to 
multiple locations based on the keys.

For example, say I have an RDD like this:
{code}
>>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 'Frankie']).keyBy(lambda 
>>> x: x[0])
>>> a.collect()
[('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')]
>>> a.keys().distinct().collect()
['B', 'F', 'N']
{code}

Now I want to write the RDD out to different paths depending on the keys, so 
that I have one output directory per distinct key. Each output directory could 
potentially have multiple {{part-}} files, one per RDD partition.

So the output would look something like:

{code}
/path/prefix/B [/part-1, /part-2, etc]
/path/prefix/F [/part-1, /part-2, etc]
/path/prefix/N [/part-1, /part-2, etc]
{code}

Though it may be possible to do this with some combination of 
{{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the 
{{MultipleTextOutputFormat}} output format class, it isn't straightforward. 
It's not clear if it's even possible at all in PySpark.

Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs that 
makes it easy to save RDDs out to multiple locations at once.


> Add saveAsTextFileByKey() method to RDDs
> 
>
> Key: SPARK-3533
> URL: https://issues.apache.org/jira/browse/SPARK-3533
> Project: Spark
>      Issue Type: Improvement
>  Components: PySpark, Spark Core
>Affects Versions: 1.1.0
>Reporter: Nicholas Chammas
>
> Users often have a single RDD of key-value pairs that they want to save to 
> multiple locations based on the keys.
> For example, say I have an RDD like this:
> {code}
> >>> a = sc.parallelize(['Nick', 'Nancy', 'Bob', 'Ben', 
> >>> 'Frankie']).keyBy(lambda x: x[0])
> >>> a.collect()
> [('N', 'Nick'), ('N', 'Nancy'), ('B', 'Bob'), ('B', 'Ben'), ('F', 'Frankie')]
> >>> a.keys().distinct().collect()
> ['B', 'F', 'N']
> {code}
> Now I want to write the RDD out to different paths depending on the keys, so 
> that I have one output directory per distinct key. Each output directory 
> could potentially have multiple {{part-}} files, one per RDD partition.
> So the output would look something like:
> {code}
> /path/prefix/B [/part-1, /part-2, etc]
> /path/prefix/F [/part-1, /part-2, etc]
> /path/prefix/N [/part-1, /part-2, etc]
> {code}
> Though it may be possible to do this with some combination of 
> {{saveAsNewAPIHadoopFile()}}, {{saveAsHadoopFile()}}, and the 
> {{MultipleTextOutputFormat}} output format class, it isn't straightforward. 
> It's not clear if it's even possible at all in PySpark.
> Please add a {{saveAsTextFileByKey()}} method or something similar to RDDs 
> that makes it easy to save RDDs out to multiple locations at

[Distutils] Thank you for the ability to do `pip install git+https://...`

2016-03-28 Thread Nicholas Chammas
Dunno how old/new this feature is, or what people did before it existed,
but I just wanted to thank the people who thought of and built the ability
to do installs from git+https.

It lets me offer the following to my users when they want the “bleeding
edge”  version
of my project:

pip install git+https://github.com/nchammas/flintrock

I also use this capability to install and test contributors’ branches when
they open PRs against my project. For example:

pip install git+https://github.com/contributor/flintrock@branch

It’s a great feature and makes my work a bit easier. Thank you for building
it.

I’m still waiting for when I can give the PyPA some money
 for all the good and sorely
needed work that y’all do…

Anyway, keep it up.

Nick
​
___
Distutils-SIG maillist  -  Distutils-SIG@python.org
https://mail.python.org/mailman/listinfo/distutils-sig


Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-27 Thread Nicholas Chammas
Pingity-ping-pong since this is still a problem.

On Thu, Mar 24, 2016 at 4:08 PM Michael Armbrust <mich...@databricks.com>
wrote:

> Patrick is investigating.
>
> On Thu, Mar 24, 2016 at 7:25 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Just checking in on this again as the builds on S3 are still broken. :/
>>
>> Could it have something to do with us moving release-build.sh
>> <https://github.com/apache/spark/commits/master/dev/create-release/release-build.sh>
>> ?
>> ​
>>
>> On Mon, Mar 21, 2016 at 1:43 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Is someone going to retry fixing these packages? It's still a problem.
>>>
>>> Also, it would be good to understand why this is happening.
>>>
>>> On Fri, Mar 18, 2016 at 6:49 PM Jakob Odersky <ja...@odersky.com> wrote:
>>>
>>>> I just realized you're using a different download site. Sorry for the
>>>> confusion, the link I get for a direct download of Spark 1.6.1 /
>>>> Hadoop 2.6 is
>>>> http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz
>>>>
>>>> On Fri, Mar 18, 2016 at 3:20 PM, Nicholas Chammas
>>>> <nicholas.cham...@gmail.com> wrote:
>>>> > I just retried the Spark 1.6.1 / Hadoop 2.6 download and got a
>>>> corrupt ZIP
>>>> > file.
>>>> >
>>>> > Jakob, are you sure the ZIP unpacks correctly for you? Is it the same
>>>> Spark
>>>> > 1.6.1/Hadoop 2.6 package you had a success with?
>>>> >
>>>> > On Fri, Mar 18, 2016 at 6:11 PM Jakob Odersky <ja...@odersky.com>
>>>> wrote:
>>>> >>
>>>> >> I just experienced the issue, however retrying the download a second
>>>> >> time worked. Could it be that there is some load balancer/cache in
>>>> >> front of the archive and some nodes still serve the corrupt packages?
>>>> >>
>>>> >> On Fri, Mar 18, 2016 at 8:00 AM, Nicholas Chammas
>>>> >> <nicholas.cham...@gmail.com> wrote:
>>>> >> > I'm seeing the same. :(
>>>> >> >
>>>> >> > On Fri, Mar 18, 2016 at 10:57 AM Ted Yu <yuzhih...@gmail.com>
>>>> wrote:
>>>> >> >>
>>>> >> >> I tried again this morning :
>>>> >> >>
>>>> >> >> $ wget
>>>> >> >>
>>>> >> >>
>>>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>>>> >> >> --2016-03-18 07:55:30--
>>>> >> >>
>>>> >> >>
>>>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>>>> >> >> Resolving s3.amazonaws.com... 54.231.19.163
>>>> >> >> ...
>>>> >> >> $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz
>>>> >> >>
>>>> >> >> gzip: stdin: unexpected end of file
>>>> >> >> tar: Unexpected EOF in archive
>>>> >> >> tar: Unexpected EOF in archive
>>>> >> >> tar: Error is not recoverable: exiting now
>>>> >> >>
>>>> >> >> On Thu, Mar 17, 2016 at 8:57 AM, Michael Armbrust
>>>> >> >> <mich...@databricks.com>
>>>> >> >> wrote:
>>>> >> >>>
>>>> >> >>> Patrick reuploaded the artifacts, so it should be fixed now.
>>>> >> >>>
>>>> >> >>> On Mar 16, 2016 5:48 PM, "Nicholas Chammas"
>>>> >> >>> <nicholas.cham...@gmail.com>
>>>> >> >>> wrote:
>>>> >> >>>>
>>>> >> >>>> Looks like the other packages may also be corrupt. I’m getting
>>>> the
>>>> >> >>>> same
>>>> >> >>>> error for the Spark 1.6.1 / Hadoop 2.4 package.
>>>> >> >>>>
>>>> >> >>>>
>>>> >> >>>>
>>>> >> >>>>
>>>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.4.tgz
>>>> >> >>>>
>>>> >> >>>> Nick
>>>> >> >>>>
>>>> >> >>>>
>>>> >> >>>> On Wed, Mar 16, 2016 at 8:28 PM Ted Yu <yuzhih...@gmail.com>
>>>> wrote:
>>>> >> >>>>>
>>>> >> >>>>> On Linux, I got:
>>>> >> >>>>>
>>>> >> >>>>> $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz
>>>> >> >>>>>
>>>> >> >>>>> gzip: stdin: unexpected end of file
>>>> >> >>>>> tar: Unexpected EOF in archive
>>>> >> >>>>> tar: Unexpected EOF in archive
>>>> >> >>>>> tar: Error is not recoverable: exiting now
>>>> >> >>>>>
>>>> >> >>>>> On Wed, Mar 16, 2016 at 5:15 PM, Nicholas Chammas
>>>> >> >>>>> <nicholas.cham...@gmail.com> wrote:
>>>> >> >>>>>>
>>>> >> >>>>>>
>>>> >> >>>>>>
>>>> >> >>>>>>
>>>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>>>> >> >>>>>>
>>>> >> >>>>>> Does anyone else have trouble unzipping this? How did this
>>>> happen?
>>>> >> >>>>>>
>>>> >> >>>>>> What I get is:
>>>> >> >>>>>>
>>>> >> >>>>>> $ gzip -t spark-1.6.1-bin-hadoop2.6.tgz
>>>> >> >>>>>> gzip: spark-1.6.1-bin-hadoop2.6.tgz: unexpected end of file
>>>> >> >>>>>> gzip: spark-1.6.1-bin-hadoop2.6.tgz: uncompress failed
>>>> >> >>>>>>
>>>> >> >>>>>> Seems like a strange type of problem to come across.
>>>> >> >>>>>>
>>>> >> >>>>>> Nick
>>>> >> >>>>>
>>>> >> >>>>>
>>>> >> >>
>>>> >> >
>>>>
>>>
>


Re: Reading Back a Cached RDD

2016-03-24 Thread Nicholas Chammas
Isn’t persist() only for reusing an RDD within an active application? Maybe
checkpoint() is what you’re looking for instead?
​

On Thu, Mar 24, 2016 at 2:02 PM Afshartous, Nick 
wrote:

>
> Hi,
>
>
> After calling RDD.persist(), is then possible to come back later and
> access the persisted RDD.
>
> Let's say for instance coming back and starting a new Spark shell
> session.  How would one access the persisted RDD in the new shell session ?
>
>
> Thanks,
>
> --
>
>Nick
>


Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-24 Thread Nicholas Chammas
Just checking in on this again as the builds on S3 are still broken. :/

Could it have something to do with us moving release-build.sh
<https://github.com/apache/spark/commits/master/dev/create-release/release-build.sh>
?
​

On Mon, Mar 21, 2016 at 1:43 PM Nicholas Chammas <nicholas.cham...@gmail.com>
wrote:

> Is someone going to retry fixing these packages? It's still a problem.
>
> Also, it would be good to understand why this is happening.
>
> On Fri, Mar 18, 2016 at 6:49 PM Jakob Odersky <ja...@odersky.com> wrote:
>
>> I just realized you're using a different download site. Sorry for the
>> confusion, the link I get for a direct download of Spark 1.6.1 /
>> Hadoop 2.6 is
>> http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz
>>
>> On Fri, Mar 18, 2016 at 3:20 PM, Nicholas Chammas
>> <nicholas.cham...@gmail.com> wrote:
>> > I just retried the Spark 1.6.1 / Hadoop 2.6 download and got a corrupt
>> ZIP
>> > file.
>> >
>> > Jakob, are you sure the ZIP unpacks correctly for you? Is it the same
>> Spark
>> > 1.6.1/Hadoop 2.6 package you had a success with?
>> >
>> > On Fri, Mar 18, 2016 at 6:11 PM Jakob Odersky <ja...@odersky.com>
>> wrote:
>> >>
>> >> I just experienced the issue, however retrying the download a second
>> >> time worked. Could it be that there is some load balancer/cache in
>> >> front of the archive and some nodes still serve the corrupt packages?
>> >>
>> >> On Fri, Mar 18, 2016 at 8:00 AM, Nicholas Chammas
>> >> <nicholas.cham...@gmail.com> wrote:
>> >> > I'm seeing the same. :(
>> >> >
>> >> > On Fri, Mar 18, 2016 at 10:57 AM Ted Yu <yuzhih...@gmail.com> wrote:
>> >> >>
>> >> >> I tried again this morning :
>> >> >>
>> >> >> $ wget
>> >> >>
>> >> >>
>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>> >> >> --2016-03-18 07:55:30--
>> >> >>
>> >> >>
>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>> >> >> Resolving s3.amazonaws.com... 54.231.19.163
>> >> >> ...
>> >> >> $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz
>> >> >>
>> >> >> gzip: stdin: unexpected end of file
>> >> >> tar: Unexpected EOF in archive
>> >> >> tar: Unexpected EOF in archive
>> >> >> tar: Error is not recoverable: exiting now
>> >> >>
>> >> >> On Thu, Mar 17, 2016 at 8:57 AM, Michael Armbrust
>> >> >> <mich...@databricks.com>
>> >> >> wrote:
>> >> >>>
>> >> >>> Patrick reuploaded the artifacts, so it should be fixed now.
>> >> >>>
>> >> >>> On Mar 16, 2016 5:48 PM, "Nicholas Chammas"
>> >> >>> <nicholas.cham...@gmail.com>
>> >> >>> wrote:
>> >> >>>>
>> >> >>>> Looks like the other packages may also be corrupt. I’m getting the
>> >> >>>> same
>> >> >>>> error for the Spark 1.6.1 / Hadoop 2.4 package.
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> >> >>>>
>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.4.tgz
>> >> >>>>
>> >> >>>> Nick
>> >> >>>>
>> >> >>>>
>> >> >>>> On Wed, Mar 16, 2016 at 8:28 PM Ted Yu <yuzhih...@gmail.com>
>> wrote:
>> >> >>>>>
>> >> >>>>> On Linux, I got:
>> >> >>>>>
>> >> >>>>> $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz
>> >> >>>>>
>> >> >>>>> gzip: stdin: unexpected end of file
>> >> >>>>> tar: Unexpected EOF in archive
>> >> >>>>> tar: Unexpected EOF in archive
>> >> >>>>> tar: Error is not recoverable: exiting now
>> >> >>>>>
>> >> >>>>> On Wed, Mar 16, 2016 at 5:15 PM, Nicholas Chammas
>> >> >>>>> <nicholas.cham...@gmail.com> wrote:
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> >> >>>>>>
>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>> >> >>>>>>
>> >> >>>>>> Does anyone else have trouble unzipping this? How did this
>> happen?
>> >> >>>>>>
>> >> >>>>>> What I get is:
>> >> >>>>>>
>> >> >>>>>> $ gzip -t spark-1.6.1-bin-hadoop2.6.tgz
>> >> >>>>>> gzip: spark-1.6.1-bin-hadoop2.6.tgz: unexpected end of file
>> >> >>>>>> gzip: spark-1.6.1-bin-hadoop2.6.tgz: uncompress failed
>> >> >>>>>>
>> >> >>>>>> Seems like a strange type of problem to come across.
>> >> >>>>>>
>> >> >>>>>> Nick
>> >> >>>>>
>> >> >>>>>
>> >> >>
>> >> >
>>
>


Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-21 Thread Nicholas Chammas
Is someone going to retry fixing these packages? It's still a problem.

Also, it would be good to understand why this is happening.

On Fri, Mar 18, 2016 at 6:49 PM Jakob Odersky <ja...@odersky.com> wrote:

> I just realized you're using a different download site. Sorry for the
> confusion, the link I get for a direct download of Spark 1.6.1 /
> Hadoop 2.6 is
> http://d3kbcqa49mib13.cloudfront.net/spark-1.6.1-bin-hadoop2.6.tgz
>
> On Fri, Mar 18, 2016 at 3:20 PM, Nicholas Chammas
> <nicholas.cham...@gmail.com> wrote:
> > I just retried the Spark 1.6.1 / Hadoop 2.6 download and got a corrupt
> ZIP
> > file.
> >
> > Jakob, are you sure the ZIP unpacks correctly for you? Is it the same
> Spark
> > 1.6.1/Hadoop 2.6 package you had a success with?
> >
> > On Fri, Mar 18, 2016 at 6:11 PM Jakob Odersky <ja...@odersky.com> wrote:
> >>
> >> I just experienced the issue, however retrying the download a second
> >> time worked. Could it be that there is some load balancer/cache in
> >> front of the archive and some nodes still serve the corrupt packages?
> >>
> >> On Fri, Mar 18, 2016 at 8:00 AM, Nicholas Chammas
> >> <nicholas.cham...@gmail.com> wrote:
> >> > I'm seeing the same. :(
> >> >
> >> > On Fri, Mar 18, 2016 at 10:57 AM Ted Yu <yuzhih...@gmail.com> wrote:
> >> >>
> >> >> I tried again this morning :
> >> >>
> >> >> $ wget
> >> >>
> >> >>
> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
> >> >> --2016-03-18 07:55:30--
> >> >>
> >> >>
> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
> >> >> Resolving s3.amazonaws.com... 54.231.19.163
> >> >> ...
> >> >> $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz
> >> >>
> >> >> gzip: stdin: unexpected end of file
> >> >> tar: Unexpected EOF in archive
> >> >> tar: Unexpected EOF in archive
> >> >> tar: Error is not recoverable: exiting now
> >> >>
> >> >> On Thu, Mar 17, 2016 at 8:57 AM, Michael Armbrust
> >> >> <mich...@databricks.com>
> >> >> wrote:
> >> >>>
> >> >>> Patrick reuploaded the artifacts, so it should be fixed now.
> >> >>>
> >> >>> On Mar 16, 2016 5:48 PM, "Nicholas Chammas"
> >> >>> <nicholas.cham...@gmail.com>
> >> >>> wrote:
> >> >>>>
> >> >>>> Looks like the other packages may also be corrupt. I’m getting the
> >> >>>> same
> >> >>>> error for the Spark 1.6.1 / Hadoop 2.4 package.
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>>
> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.4.tgz
> >> >>>>
> >> >>>> Nick
> >> >>>>
> >> >>>>
> >> >>>> On Wed, Mar 16, 2016 at 8:28 PM Ted Yu <yuzhih...@gmail.com>
> wrote:
> >> >>>>>
> >> >>>>> On Linux, I got:
> >> >>>>>
> >> >>>>> $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz
> >> >>>>>
> >> >>>>> gzip: stdin: unexpected end of file
> >> >>>>> tar: Unexpected EOF in archive
> >> >>>>> tar: Unexpected EOF in archive
> >> >>>>> tar: Error is not recoverable: exiting now
> >> >>>>>
> >> >>>>> On Wed, Mar 16, 2016 at 5:15 PM, Nicholas Chammas
> >> >>>>> <nicholas.cham...@gmail.com> wrote:
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> >> >>>>>>
> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
> >> >>>>>>
> >> >>>>>> Does anyone else have trouble unzipping this? How did this
> happen?
> >> >>>>>>
> >> >>>>>> What I get is:
> >> >>>>>>
> >> >>>>>> $ gzip -t spark-1.6.1-bin-hadoop2.6.tgz
> >> >>>>>> gzip: spark-1.6.1-bin-hadoop2.6.tgz: unexpected end of file
> >> >>>>>> gzip: spark-1.6.1-bin-hadoop2.6.tgz: uncompress failed
> >> >>>>>>
> >> >>>>>> Seems like a strange type of problem to come across.
> >> >>>>>>
> >> >>>>>> Nick
> >> >>>>>
> >> >>>>>
> >> >>
> >> >
>


Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-20 Thread Nicholas Chammas
I'm seeing the same. :(

On Fri, Mar 18, 2016 at 10:57 AM Ted Yu <yuzhih...@gmail.com> wrote:

> I tried again this morning :
>
> $ wget
> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
> --2016-03-18 07:55:30--
> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
> Resolving s3.amazonaws.com... 54.231.19.163
> ...
> $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz
>
> gzip: stdin: unexpected end of file
> tar: Unexpected EOF in archive
> tar: Unexpected EOF in archive
> tar: Error is not recoverable: exiting now
>
> On Thu, Mar 17, 2016 at 8:57 AM, Michael Armbrust <mich...@databricks.com>
> wrote:
>
>> Patrick reuploaded the artifacts, so it should be fixed now.
>> On Mar 16, 2016 5:48 PM, "Nicholas Chammas" <nicholas.cham...@gmail.com>
>> wrote:
>>
>>> Looks like the other packages may also be corrupt. I’m getting the same
>>> error for the Spark 1.6.1 / Hadoop 2.4 package.
>>>
>>>
>>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.4.tgz
>>>
>>> Nick
>>> ​
>>>
>>> On Wed, Mar 16, 2016 at 8:28 PM Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>>> On Linux, I got:
>>>>
>>>> $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz
>>>>
>>>> gzip: stdin: unexpected end of file
>>>> tar: Unexpected EOF in archive
>>>> tar: Unexpected EOF in archive
>>>> tar: Error is not recoverable: exiting now
>>>>
>>>> On Wed, Mar 16, 2016 at 5:15 PM, Nicholas Chammas <
>>>> nicholas.cham...@gmail.com> wrote:
>>>>
>>>>>
>>>>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>>>>>
>>>>> Does anyone else have trouble unzipping this? How did this happen?
>>>>>
>>>>> What I get is:
>>>>>
>>>>> $ gzip -t spark-1.6.1-bin-hadoop2.6.tgz
>>>>> gzip: spark-1.6.1-bin-hadoop2.6.tgz: unexpected end of file
>>>>> gzip: spark-1.6.1-bin-hadoop2.6.tgz: uncompress failed
>>>>>
>>>>> Seems like a strange type of problem to come across.
>>>>>
>>>>> Nick
>>>>> ​
>>>>>
>>>>
>>>>
>


[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors

2016-03-19 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15197451#comment-15197451
 ] 

Nicholas Chammas commented on SPARK-7481:
-

(Sorry Steve; can't comment on your proposal since I don't know much about 
these kinds of build decisions.)

Just to add some more evidence to the record that this problem appears to 
affect many people, take a look at this: 
http://stackoverflow.com/search?q=%5Bapache-spark%5D+S3+Hadoop+2.6

Lots of confusion about how to access S3, with the recommended solution as 
before being to [use Spark built against Hadoop 
2.4|http://stackoverflow.com/a/30852341/877069].

> Add Hadoop 2.6+ profile to pull in object store FS accessors
> 
>
> Key: SPARK-7481
> URL: https://issues.apache.org/jira/browse/SPARK-7481
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.3.1
>Reporter: Steve Loughran
>
> To keep the s3n classpath right, to add s3a, swift & azure, the dependencies 
> of spark in a 2.6+ profile need to add the relevant object store packages 
> (hadoop-aws, hadoop-openstack, hadoop-azure)
> this adds more stuff to the client bundle, but will mean a single spark 
> package can talk to all of the stores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-19 Thread Nicholas Chammas
https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz

Does anyone else have trouble unzipping this? How did this happen?

What I get is:

$ gzip -t spark-1.6.1-bin-hadoop2.6.tgz
gzip: spark-1.6.1-bin-hadoop2.6.tgz: unexpected end of file
gzip: spark-1.6.1-bin-hadoop2.6.tgz: uncompress failed

Seems like a strange type of problem to come across.

Nick
​


Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-19 Thread Nicholas Chammas
OK cool. I'll test the hadoop-2.6 package and check back here if it's still
broken.

Just curious: How did those packages all get corrupted (if we know)? Seems
like a strange thing to happen.
2016년 3월 17일 (목) 오전 11:57, Michael Armbrust <mich...@databricks.com>님이 작성:

> Patrick reuploaded the artifacts, so it should be fixed now.
> On Mar 16, 2016 5:48 PM, "Nicholas Chammas" <nicholas.cham...@gmail.com>
> wrote:
>
>> Looks like the other packages may also be corrupt. I’m getting the same
>> error for the Spark 1.6.1 / Hadoop 2.4 package.
>>
>>
>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.4.tgz
>>
>> Nick
>> ​
>>
>> On Wed, Mar 16, 2016 at 8:28 PM Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> On Linux, I got:
>>>
>>> $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz
>>>
>>> gzip: stdin: unexpected end of file
>>> tar: Unexpected EOF in archive
>>> tar: Unexpected EOF in archive
>>> tar: Error is not recoverable: exiting now
>>>
>>> On Wed, Mar 16, 2016 at 5:15 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>>
>>>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>>>>
>>>> Does anyone else have trouble unzipping this? How did this happen?
>>>>
>>>> What I get is:
>>>>
>>>> $ gzip -t spark-1.6.1-bin-hadoop2.6.tgz
>>>> gzip: spark-1.6.1-bin-hadoop2.6.tgz: unexpected end of file
>>>> gzip: spark-1.6.1-bin-hadoop2.6.tgz: uncompress failed
>>>>
>>>> Seems like a strange type of problem to come across.
>>>>
>>>> Nick
>>>> ​
>>>>
>>>
>>>


Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-19 Thread Nicholas Chammas
Looks like the other packages may also be corrupt. I’m getting the same
error for the Spark 1.6.1 / Hadoop 2.4 package.

https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.4.tgz

Nick
​

On Wed, Mar 16, 2016 at 8:28 PM Ted Yu <yuzhih...@gmail.com> wrote:

> On Linux, I got:
>
> $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz
>
> gzip: stdin: unexpected end of file
> tar: Unexpected EOF in archive
> tar: Unexpected EOF in archive
> tar: Error is not recoverable: exiting now
>
> On Wed, Mar 16, 2016 at 5:15 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>>
>> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
>>
>> Does anyone else have trouble unzipping this? How did this happen?
>>
>> What I get is:
>>
>> $ gzip -t spark-1.6.1-bin-hadoop2.6.tgz
>> gzip: spark-1.6.1-bin-hadoop2.6.tgz: unexpected end of file
>> gzip: spark-1.6.1-bin-hadoop2.6.tgz: uncompress failed
>>
>> Seems like a strange type of problem to come across.
>>
>> Nick
>> ​
>>
>
>


Re: Spark 1.6.1 Hadoop 2.6 package on S3 corrupt?

2016-03-18 Thread Nicholas Chammas
I just retried the Spark 1.6.1 / Hadoop 2.6 download and got a corrupt ZIP
file.

Jakob, are you sure the ZIP unpacks correctly for you? Is it the same Spark
1.6.1/Hadoop 2.6 package you had a success with?

On Fri, Mar 18, 2016 at 6:11 PM Jakob Odersky <ja...@odersky.com> wrote:

> I just experienced the issue, however retrying the download a second
> time worked. Could it be that there is some load balancer/cache in
> front of the archive and some nodes still serve the corrupt packages?
>
> On Fri, Mar 18, 2016 at 8:00 AM, Nicholas Chammas
> <nicholas.cham...@gmail.com> wrote:
> > I'm seeing the same. :(
> >
> > On Fri, Mar 18, 2016 at 10:57 AM Ted Yu <yuzhih...@gmail.com> wrote:
> >>
> >> I tried again this morning :
> >>
> >> $ wget
> >>
> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
> >> --2016-03-18 07:55:30--
> >>
> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
> >> Resolving s3.amazonaws.com... 54.231.19.163
> >> ...
> >> $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz
> >>
> >> gzip: stdin: unexpected end of file
> >> tar: Unexpected EOF in archive
> >> tar: Unexpected EOF in archive
> >> tar: Error is not recoverable: exiting now
> >>
> >> On Thu, Mar 17, 2016 at 8:57 AM, Michael Armbrust <
> mich...@databricks.com>
> >> wrote:
> >>>
> >>> Patrick reuploaded the artifacts, so it should be fixed now.
> >>>
> >>> On Mar 16, 2016 5:48 PM, "Nicholas Chammas" <
> nicholas.cham...@gmail.com>
> >>> wrote:
> >>>>
> >>>> Looks like the other packages may also be corrupt. I’m getting the
> same
> >>>> error for the Spark 1.6.1 / Hadoop 2.4 package.
> >>>>
> >>>>
> >>>>
> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.4.tgz
> >>>>
> >>>> Nick
> >>>>
> >>>>
> >>>> On Wed, Mar 16, 2016 at 8:28 PM Ted Yu <yuzhih...@gmail.com> wrote:
> >>>>>
> >>>>> On Linux, I got:
> >>>>>
> >>>>> $ tar zxf spark-1.6.1-bin-hadoop2.6.tgz
> >>>>>
> >>>>> gzip: stdin: unexpected end of file
> >>>>> tar: Unexpected EOF in archive
> >>>>> tar: Unexpected EOF in archive
> >>>>> tar: Error is not recoverable: exiting now
> >>>>>
> >>>>> On Wed, Mar 16, 2016 at 5:15 PM, Nicholas Chammas
> >>>>> <nicholas.cham...@gmail.com> wrote:
> >>>>>>
> >>>>>>
> >>>>>>
> https://s3.amazonaws.com/spark-related-packages/spark-1.6.1-bin-hadoop2.6.tgz
> >>>>>>
> >>>>>> Does anyone else have trouble unzipping this? How did this happen?
> >>>>>>
> >>>>>> What I get is:
> >>>>>>
> >>>>>> $ gzip -t spark-1.6.1-bin-hadoop2.6.tgz
> >>>>>> gzip: spark-1.6.1-bin-hadoop2.6.tgz: unexpected end of file
> >>>>>> gzip: spark-1.6.1-bin-hadoop2.6.tgz: uncompress failed
> >>>>>>
> >>>>>> Seems like a strange type of problem to come across.
> >>>>>>
> >>>>>> Nick
> >>>>>
> >>>>>
> >>
> >
>


[jira] [Commented] (SPARK-7505) Update PySpark DataFrame docs: encourage __getitem__, mark as experimental, etc.

2016-03-05 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7505?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15181776#comment-15181776
 ] 

Nicholas Chammas commented on SPARK-7505:
-

I believe items 1, 3, and 4 still apply. They're minor documentation issues, 
but I think they should still be addressed.

> Update PySpark DataFrame docs: encourage __getitem__, mark as experimental, 
> etc.
> 
>
> Key: SPARK-7505
> URL: https://issues.apache.org/jira/browse/SPARK-7505
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, PySpark, SQL
>Affects Versions: 1.3.1, 1.4.1, 1.5.0
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> The PySpark docs for DataFrame need the following fixes and improvements:
> # Per [SPARK-7035], we should encourage the use of {{\_\_getitem\_\_}} over 
> {{\_\_getattr\_\_}} and change all our examples accordingly.
> # *We should say clearly that the API is experimental.* (That is currently 
> not the case for the PySpark docs.)
> # We should provide an example of how to join and select from 2 DataFrames 
> that have identically named columns, because it is not obvious:
>   {code}
> >>> df1 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4, "other": "I know"}']))
> >>> df2 = sqlContext.jsonRDD(sc.parallelize(['{"a": 4, "other": "I dunno"}']))
> >>> df12 = df1.join(df2, df1['a'] == df2['a'])
> >>> df12.select(df1['a'], df2['other']).show()
> a other   
> 
> 4 I dunno  {code}
> # 
> [{{DF.orderBy}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.orderBy]
>  and 
> [{{DF.sort}}|https://spark.apache.org/docs/1.3.1/api/python/pyspark.sql.html#pyspark.sql.DataFrame.sort]
>  should be marked as aliases if that's what they are.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13596) Move misc top-level build files into appropriate subdirs

2016-03-04 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13596?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15180072#comment-15180072
 ] 

Nicholas Chammas commented on SPARK-13596:
--

Looks like {{tox.ini}} is only used by {{pep8}}, so if you move it into 
{{dev/}}, where the Python lint checks run from, that should work.

> Move misc top-level build files into appropriate subdirs
> 
>
> Key: SPARK-13596
> URL: https://issues.apache.org/jira/browse/SPARK-13596
> Project: Spark
>  Issue Type: Sub-task
>  Components: Build
>Affects Versions: 2.0.0
>Reporter: Sean Owen
>
> I'd like to file away a bunch of misc files that are in the top level of the 
> project in order to further tidy the build for 2.0.0. See also SPARK-13529, 
> SPARK-13548.
> Some of these may turn out to be difficult or impossible to move.
> I'd ideally like to move these files into {{build/}}:
> - {{.rat-excludes}}
> - {{checkstyle.xml}}
> - {{checkstyle-suppressions.xml}}
> - {{pylintrc}}
> - {{scalastyle-config.xml}}
> - {{tox.ini}}
> - {{project/}} (or does SBT need this in the root?)
> And ideally, these would go under {{dev/}}
> - {{make-distribution.sh}}
> And remove these
> - {{sbt/sbt}} (backwards-compatible location of {{build/sbt}} right?)
> Edited to add: apparently this can go in {{.github}} now:
> - {{CONTRIBUTING.md}}
> Other files in the top level seem to need to be there, like {{README.md}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-02 Thread Nicholas Chammas
We’re veering off from the original question of this thread, but to
clarify, my comment earlier was this:

So in short, DataFrames are the “new RDD”—i.e. the new base structure you
should be using in your Spark programs wherever possible.

RDDs are not going away, and clearly in your case DataFrames are not that
helpful, so sure, continue to use RDDs. There’s nothing wrong with that.
No-one is saying you *must* use DataFrames, and Spark will continue to
offer its RDD API.

However, my original comment to Jules still stands: If you can, use
DataFrames. In most cases they will offer you a better development
experience and better performance across languages, and future Spark
optimizations will mostly be enabled by the structure that DataFrames
provide.

DataFrames are the “new RDD” in the sense that they are the new foundation
for much of the new work that has been done in recent versions and that is
coming in Spark 2.0 and beyond.

Many people work with semi-structured data and have a relatively easy path
to DataFrames, as I explained in my previous email. If, however, you’re
working with data that has very little structure, like in Darren’s case,
then yes, DataFrames are probably not going to help that much. Stick with
RDDs and you’ll be fine.
​

On Wed, Mar 2, 2016 at 6:28 PM Darren Govoni <dar...@ontrenet.com> wrote:

> Our data is made up of single text documents scraped off the web. We store
> these in a  RDD. A Dataframe or similar structure makes no sense at that
> point. And the RDD is transient.
>
> So my point is. Dataframes should not replace plain old rdd since rdds
> allow for more flexibility and sql etc is not even usable on our data while
> in rdd. So all those nice dataframe apis aren't usable until it's
> structured. Which is the core problem anyway.
>
>
>
> Sent from my Verizon Wireless 4G LTE smartphone
>
>
> ---- Original message 
> From: Nicholas Chammas <nicholas.cham...@gmail.com>
> Date: 03/02/2016 5:43 PM (GMT-05:00)
> To: Darren Govoni <dar...@ontrenet.com>, Jules Damji <dmat...@comcast.net>,
> Joshua Sorrell <jsor...@gmail.com>
> Cc: user@spark.apache.org
> Subject: Re: Does pyspark still lag far behind the Scala API in terms of
> features
>
> Plenty of people get their data in Parquet, Avro, or ORC files; or from a
> database; or do their initial loading of un- or semi-structured data using
> one of the various data source libraries
> <http://spark-packages.org/?q=tags%3A%22Data%20Sources%22> which help
> with type-/schema-inference.
>
> All of these paths help you get to a DataFrame very quickly.
>
> Nick
>
> On Wed, Mar 2, 2016 at 5:22 PM Darren Govoni <dar...@ontrenet.com> wrote:
>
> Dataframes are essentially structured tables with schemas. So where does
>> the non typed data sit before it becomes structured if not in a traditional
>> RDD?
>>
>> For us almost all the processing comes before there is structure to it.
>>
>>
>>
>>
>>
>> Sent from my Verizon Wireless 4G LTE smartphone
>>
>>
>>  Original message 
>> From: Nicholas Chammas <nicholas.cham...@gmail.com>
>> Date: 03/02/2016 5:13 PM (GMT-05:00)
>> To: Jules Damji <dmat...@comcast.net>, Joshua Sorrell <jsor...@gmail.com>
>>
>> Cc: user@spark.apache.org
>> Subject: Re: Does pyspark still lag far behind the Scala API in terms of
>> features
>>
>> > However, I believe, investing (or having some members of your group)
>> learn and invest in Scala is worthwhile for few reasons. One, you will get
>> the performance gain, especially now with Tungsten (not sure how it relates
>> to Python, but some other knowledgeable people on the list, please chime
>> in).
>>
>> The more your workload uses DataFrames, the less of a difference there
>> will be between the languages (Scala, Java, Python, or R) in terms of
>> performance.
>>
>> One of the main benefits of Catalyst (which DFs enable) is that it
>> automatically optimizes DataFrame operations, letting you focus on _what_
>> you want while Spark will take care of figuring out _how_.
>>
>> Tungsten takes things further by tightly managing memory using the type
>> information made available to it via DataFrames. This benefit comes into
>> play regardless of the language used.
>>
>> So in short, DataFrames are the "new RDD"--i.e. the new base structure
>> you should be using in your Spark programs wherever possible. And with
>> DataFrames, what language you use matters much less in terms of performance.
>>
>> Nick
>>
>> On Tue, Mar 1, 2016 at 12:07 PM Jules Damji <dmat...@comcast.net> wrote:
>>

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-02 Thread Nicholas Chammas
Plenty of people get their data in Parquet, Avro, or ORC files; or from a
database; or do their initial loading of un- or semi-structured data using
one of the various data source libraries
<http://spark-packages.org/?q=tags%3A%22Data%20Sources%22> which help with
type-/schema-inference.

All of these paths help you get to a DataFrame very quickly.

Nick

On Wed, Mar 2, 2016 at 5:22 PM Darren Govoni <dar...@ontrenet.com> wrote:

Dataframes are essentially structured tables with schemas. So where does
> the non typed data sit before it becomes structured if not in a traditional
> RDD?
>
> For us almost all the processing comes before there is structure to it.
>
>
>
>
>
> Sent from my Verizon Wireless 4G LTE smartphone
>
>
> ---- Original message 
> From: Nicholas Chammas <nicholas.cham...@gmail.com>
> Date: 03/02/2016 5:13 PM (GMT-05:00)
> To: Jules Damji <dmat...@comcast.net>, Joshua Sorrell <jsor...@gmail.com>
> Cc: user@spark.apache.org
> Subject: Re: Does pyspark still lag far behind the Scala API in terms of
> features
>
> > However, I believe, investing (or having some members of your group)
> learn and invest in Scala is worthwhile for few reasons. One, you will get
> the performance gain, especially now with Tungsten (not sure how it relates
> to Python, but some other knowledgeable people on the list, please chime
> in).
>
> The more your workload uses DataFrames, the less of a difference there
> will be between the languages (Scala, Java, Python, or R) in terms of
> performance.
>
> One of the main benefits of Catalyst (which DFs enable) is that it
> automatically optimizes DataFrame operations, letting you focus on _what_
> you want while Spark will take care of figuring out _how_.
>
> Tungsten takes things further by tightly managing memory using the type
> information made available to it via DataFrames. This benefit comes into
> play regardless of the language used.
>
> So in short, DataFrames are the "new RDD"--i.e. the new base structure you
> should be using in your Spark programs wherever possible. And with
> DataFrames, what language you use matters much less in terms of performance.
>
> Nick
>
> On Tue, Mar 1, 2016 at 12:07 PM Jules Damji <dmat...@comcast.net> wrote:
>
>> Hello Joshua,
>>
>> comments are inline...
>>
>> On Mar 1, 2016, at 5:03 AM, Joshua Sorrell <jsor...@gmail.com> wrote:
>>
>> I haven't used Spark in the last year and a half. I am about to start a
>> project with a new team, and we need to decide whether to use pyspark or
>> Scala.
>>
>>
>> Indeed, good questions, and they do come up lot in trainings that I have
>> attended, where this inevitable question is raised.
>> I believe, it depends on your level of comfort zone or adventure into
>> newer things.
>>
>> True, for the most part that Apache Spark committers have been committed
>> to keep the APIs at parity across all the language offerings, even though
>> in some cases, in particular Python, they have lagged by a minor release.
>> To the the extent that they’re committed to level-parity is a good sign. It
>> might to be the case with some experimental APIs, where they lag behind,
>>  but for the most part, they have been admirably consistent.
>>
>> With Python there’s a minor performance hit, since there’s an extra level
>> of indirection in the architecture and an additional Python PID that the
>> executors launch to execute your pickled Python lambdas. Other than that it
>> boils down to your comfort zone. I recommend looking at Sameer’s slides on
>> (Advanced Spark for DevOps Training) where he walks through the pySpark and
>> Python architecture.
>>
>>
>> We are NOT a java shop. So some of the build tools/procedures will
>> require some learning overhead if we go the Scala route. What I want to
>> know is: is the Scala version of Spark still far enough ahead of pyspark to
>> be well worth any initial training overhead?
>>
>>
>> If you are a very advanced Python shop and if you’ve in-house libraries
>> that you have written in Python that don’t exist in Scala or some ML libs
>> that don’t exist in the Scala version and will require fair amount of
>> porting and gap is too large, then perhaps it makes sense to stay put with
>> Python.
>>
>> However, I believe, investing (or having some members of your group)
>> learn and invest in Scala is worthwhile for few reasons. One, you will get
>> the performance gain, especially now with Tungsten (not sure how it relates
>> to Python, but some other knowledgeable people on the list, please ch

Re: Does pyspark still lag far behind the Scala API in terms of features

2016-03-02 Thread Nicholas Chammas
> However, I believe, investing (or having some members of your group)
learn and invest in Scala is worthwhile for few reasons. One, you will get
the performance gain, especially now with Tungsten (not sure how it relates
to Python, but some other knowledgeable people on the list, please chime
in).

The more your workload uses DataFrames, the less of a difference there will
be between the languages (Scala, Java, Python, or R) in terms of
performance.

One of the main benefits of Catalyst (which DFs enable) is that it
automatically optimizes DataFrame operations, letting you focus on _what_
you want while Spark will take care of figuring out _how_.

Tungsten takes things further by tightly managing memory using the type
information made available to it via DataFrames. This benefit comes into
play regardless of the language used.

So in short, DataFrames are the "new RDD"--i.e. the new base structure you
should be using in your Spark programs wherever possible. And with
DataFrames, what language you use matters much less in terms of performance.

Nick

On Tue, Mar 1, 2016 at 12:07 PM Jules Damji  wrote:

> Hello Joshua,
>
> comments are inline...
>
> On Mar 1, 2016, at 5:03 AM, Joshua Sorrell  wrote:
>
> I haven't used Spark in the last year and a half. I am about to start a
> project with a new team, and we need to decide whether to use pyspark or
> Scala.
>
>
> Indeed, good questions, and they do come up lot in trainings that I have
> attended, where this inevitable question is raised.
> I believe, it depends on your level of comfort zone or adventure into
> newer things.
>
> True, for the most part that Apache Spark committers have been committed
> to keep the APIs at parity across all the language offerings, even though
> in some cases, in particular Python, they have lagged by a minor release.
> To the the extent that they’re committed to level-parity is a good sign. It
> might to be the case with some experimental APIs, where they lag behind,
>  but for the most part, they have been admirably consistent.
>
> With Python there’s a minor performance hit, since there’s an extra level
> of indirection in the architecture and an additional Python PID that the
> executors launch to execute your pickled Python lambdas. Other than that it
> boils down to your comfort zone. I recommend looking at Sameer’s slides on
> (Advanced Spark for DevOps Training) where he walks through the pySpark and
> Python architecture.
>
>
> We are NOT a java shop. So some of the build tools/procedures will require
> some learning overhead if we go the Scala route. What I want to know is: is
> the Scala version of Spark still far enough ahead of pyspark to be well
> worth any initial training overhead?
>
>
> If you are a very advanced Python shop and if you’ve in-house libraries
> that you have written in Python that don’t exist in Scala or some ML libs
> that don’t exist in the Scala version and will require fair amount of
> porting and gap is too large, then perhaps it makes sense to stay put with
> Python.
>
> However, I believe, investing (or having some members of your group) learn
> and invest in Scala is worthwhile for few reasons. One, you will get the
> performance gain, especially now with Tungsten (not sure how it relates to
> Python, but some other knowledgeable people on the list, please chime in).
> Two, since Spark is written in Scala, it gives you an enormous advantage to
> read sources (which are well documented and highly readable) should you
> have to consult or learn nuances of certain API method or action not
> covered comprehensively in the docs. And finally, there’s a long term
> benefit in learning Scala for reasons other than Spark. For example,
> writing other scalable and distributed applications.
>
>
> Particularly, we will be using Spark Streaming. I know a couple of years
> ago that practically forced the decision to use Scala.  Is this still the
> case?
>
>
> You’ll notice that certain APIs call are not available, at least for now,
> in Python.
> http://spark.apache.org/docs/latest/streaming-programming-guide.html
>
>
> Cheers
> Jules
>
> --
> The Best Ideas Are Simple
> Jules S. Damji
> e-mail:dmat...@comcast.net
> e-mail:jules.da...@gmail.com
>
>


[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors

2016-03-02 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176559#comment-15176559
 ] 

Nicholas Chammas commented on SPARK-7481:
-

I'm not comfortable working with Maven so I can't comment on the details of the 
approach we should take, but I will appreciate any progress towards making 
Spark built against Hadoop 2.6+ work with S3 out of the box, or as close to out 
of the box as possible.

Given Spark's close relation to S3 and EC2 (as far as Spark's user base is 
concerned), a good out of the box experience here is critical. Many people just 
expect it.

> Add Hadoop 2.6+ profile to pull in object store FS accessors
> 
>
> Key: SPARK-7481
> URL: https://issues.apache.org/jira/browse/SPARK-7481
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.3.1
>Reporter: Steve Loughran
>
> To keep the s3n classpath right, to add s3a, swift & azure, the dependencies 
> of spark in a 2.6+ profile need to add the relevant object store packages 
> (hadoop-aws, hadoop-openstack, hadoop-azure)
> this adds more stuff to the client bundle, but will mean a single spark 
> package can talk to all of the stores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors

2016-03-02 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176551#comment-15176551
 ] 

Nicholas Chammas commented on SPARK-7481:
-

{quote}
One issue here that hadoop 2.6's hadoop-aws pulls in the whole AWT toolkit, 
which is pretty weighty, for s3a ... which isn't something I'd use in 2.6 
anyway.
{quote}

Did you mean something other than s3a here?

> Add Hadoop 2.6+ profile to pull in object store FS accessors
> 
>
> Key: SPARK-7481
> URL: https://issues.apache.org/jira/browse/SPARK-7481
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.3.1
>Reporter: Steve Loughran
>
> To keep the s3n classpath right, to add s3a, swift & azure, the dependencies 
> of spark in a 2.6+ profile need to add the relevant object store packages 
> (hadoop-aws, hadoop-openstack, hadoop-azure)
> this adds more stuff to the client bundle, but will mean a single spark 
> package can talk to all of the stores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7481) Add Hadoop 2.6+ profile to pull in object store FS accessors

2016-03-01 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7481?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174438#comment-15174438
 ] 

Nicholas Chammas commented on SPARK-7481:
-

Many people seem to be downgrading to use Spark built against Hadoop 2.4 
because the Spark / Hadoop 2.6 package doesn't work against S3 out of the box.

* [Example 
1|https://issues.apache.org/jira/browse/SPARK-7442?focusedCommentId=14582965=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14582965]
* [Example 
2|https://issues.apache.org/jira/browse/SPARK-7442?focusedCommentId=14903750=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14903750]
* [Example 
3|https://github.com/nchammas/flintrock/issues/88#issuecomment-190905262]

If this proposal eliminates that bit of friction for users without being too 
burdensome on the team, then I'm for it.

Ideally, we want people using Spark built against the latest version of Hadoop 
anyway, right? This proposal would nudge people in that direction.

> Add Hadoop 2.6+ profile to pull in object store FS accessors
> 
>
> Key: SPARK-7481
> URL: https://issues.apache.org/jira/browse/SPARK-7481
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>Affects Versions: 1.3.1
>Reporter: Steve Loughran
>
> To keep the s3n classpath right, to add s3a, swift & azure, the dependencies 
> of spark in a 2.6+ profile need to add the relevant object store packages 
> (hadoop-aws, hadoop-openstack, hadoop-azure)
> this adds more stuff to the client bundle, but will mean a single spark 
> package can talk to all of the stores.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[issue26463] asyncio-related (?) segmentation fault

2016-03-01 Thread Nicholas Chammas

Nicholas Chammas added the comment:

Thanks for the tip. Enabling the fault handler reveals that the crash is 
happening from the Cryptography library. I'll move this issue there.

Thank you.

--
resolution:  -> not a bug
status: open -> closed
Added file: http://bugs.python.org/file42055/faulthandler-stacktrace.txt

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue26463>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26463] asyncio-related (?) segmentation fault

2016-02-29 Thread Nicholas Chammas

Changes by Nicholas Chammas <nicholas.cham...@gmail.com>:


Added file: http://bugs.python.org/file42052/stacktrace.txt

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue26463>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26463] asyncio-related (?) segmentation fault

2016-02-29 Thread Nicholas Chammas

New submission from Nicholas Chammas:

Python 3.5.1, OS X 10.11.3.

I have an application that uses asyncio and Cryptography (via the AsyncSSH 
library). Cryptography has some parts written in C, I believe.

I'm testing my application by sending a keyboard interrupt while 2 tasks are 
working. My application doesn't clean up after itself correctly, so I get these 
warnings about pending tasks being destroyed, but I don't think I should ever 
be getting segfaults. I am able to consistently get this segfault by 
interrupting my application at roughly the same point.

I'm frankly intimidated by the segfault (it's been many years since I dug into 
one), but the most likely culprits are either Python or Cryptography since 
they're the only components of my application that have parts written in C, as 
far as I know.

I'm willing to help boil this down to something more minimal with some help. 
Right now I just have the repro at this branch of my application (which isn't 
too helpful for people other than myself): 

https://github.com/nchammas/flintrock/pull/77

Basically, launch a cluster on EC2, and as soon as one task reports that SSH is 
online, interrupt Flintrock with Control + C. You'll get this segfault.

--
components: Macintosh, asyncio
files: segfault.txt
messages: 261036
nosy: Nicholas Chammas, gvanrossum, haypo, ned.deily, ronaldoussoren, yselivanov
priority: normal
severity: normal
status: open
title: asyncio-related (?) segmentation fault
type: crash
versions: Python 3.5
Added file: http://bugs.python.org/file42051/segfault.txt

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue26463>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



Re: Is this likely to cause any problems?

2016-02-19 Thread Nicholas Chammas
The docs mention spark-ec2 because it is part of the Spark project. There
are many, many alternatives to spark-ec2 out there like EMR, but it's
probably not the place of the official docs to promote any one of those
third-party solutions.

On Fri, Feb 19, 2016 at 11:05 AM James Hammerton  wrote:

> Hi,
>
> Having looked at how easy it is to use EMR, I reckon you may be right,
> especially if using Java 8 is no more difficult with that than with
> spark-ec2 (where I had to install it on the master and slaves and edit the
> spark-env.sh).
>
> I'm now curious as to why the Spark documentation (
> http://spark.apache.org/docs/latest/index.html) mentions EC2 but not EMR.
>
> Regards,
>
> James
>
>
> On 19 February 2016 at 14:25, Daniel Siegmann  > wrote:
>
>> With EMR supporting Spark, I don't see much reason to use the spark-ec2
>> script unless it is important for you to be able to launch clusters using
>> the bleeding edge version of Spark. EMR does seem to do a pretty decent job
>> of keeping up to date - the latest version (4.3.0) supports the latest
>> Spark version (1.6.0).
>>
>> So I'd flip the question around and ask: is there any reason to continue
>> using the spark-ec2 script rather than EMR?
>>
>> On Thu, Feb 18, 2016 at 11:39 AM, James Hammerton  wrote:
>>
>>> I have now... So far  I think the issues I've had are not related to
>>> this, but I wanted to be sure in case it should be something that needs to
>>> be patched. I've had some jobs run successfully but this warning appears in
>>> the logs.
>>>
>>> Regards,
>>>
>>> James
>>>
>>> On 18 February 2016 at 12:23, Ted Yu  wrote:
>>>
 Have you seen this ?

 HADOOP-10988

 Cheers

 On Thu, Feb 18, 2016 at 3:39 AM, James Hammerton 
 wrote:

> HI,
>
> I am seeing warnings like this in the logs when I run Spark jobs:
>
> OpenJDK 64-Bit Server VM warning: You have loaded library 
> /root/ephemeral-hdfs/lib/native/libhadoop.so.1.0.0 which might have 
> disabled stack guard. The VM will try to fix the stack guard now.
> It's highly recommended that you fix the library with 'execstack -c 
> ', or link it with '-z noexecstack'.
>
>
> I used spark-ec2 to launch the cluster with the default AMI, Spark
> 1.5.2, hadoop major version 2.4. I altered the jdk to be openjdk 8 as I'd
> written some jobs in Java 8. The 6 workers nodes are m4.2xlarge and master
> is m4.large.
>
> Could this contribute to any problems running the jobs?
>
> Regards,
>
> James
>


>>>
>>
>


Re: [core-workflow] Help needed: best way to convert hg repos to git?

2016-02-15 Thread Nicholas Chammas
Response from GitHub staff regarding using their Importer
 to import CPython:

Unfortunately, the repository is too large to migrate using the importer.
I’d recommend converting it to git locally using something like
hg-fast-export. Due to its size, you’ll need to push the local repo to
GitHub in chunks.

So it seems like Importer is a no-go.

Nick
​
___
core-workflow mailing list
core-workflow@python.org
https://mail.python.org/mailman/listinfo/core-workflow
This list is governed by the PSF Code of Conduct: 
https://www.python.org/psf/codeofconduct

[issue8706] accept keyword arguments on most base type methods and builtins

2016-02-14 Thread Nicholas Chammas

Changes by Nicholas Chammas <nicholas.cham...@gmail.com>:


--
nosy: +Nicholas Chammas

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue8706>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26334] bytes.translate() doesn't take keyword arguments; docs suggests it does

2016-02-12 Thread Nicholas Chammas

Nicholas Chammas added the comment:

Yep, you're right. I'm just understanding now that we have lots of methods 
defined in C which have signatures like this.

Is there an umbrella issue, perhaps, that covers adding support for 
keyword-based arguments to functions defined in C, like `translate()`?

--
resolution:  -> duplicate
status: open -> closed

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue26334>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



Re: [core-workflow] Help needed: best way to convert hg repos to git?

2016-02-11 Thread Nicholas Chammas
> I'm currently trying to import to see how it looks, have been stuck at
0% for a few minutes now.

Doing the same myself. Got to 73% and it restarted. Am back at 73% now.

Already reached out to GitHub to make them aware of the issue.

Will report here when/if I have results.

Nick
___
core-workflow mailing list
core-workflow@python.org
https://mail.python.org/mailman/listinfo/core-workflow
This list is governed by the PSF Code of Conduct: 
https://www.python.org/psf/codeofconduct

[issue26334] bytes.translate() doesn't take keyword arguments; docs suggests it does

2016-02-10 Thread Nicholas Chammas

Nicholas Chammas added the comment:

So you're saying if `bytes.translate()` accepted keyword arguments, its 
signature would look something like this?

```
bytes.translate(table, delete=None)
```

I guess I was under the mistaken assumption that argument names in the docs 
always matched keyword arguments in the signature.

But you're right, a strictly positional argument (I guess specified via 
something like `args*`?) doesn't have a name.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue26334>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26334] bytes.translate() doesn't take keyword arguments; docs suggests it does

2016-02-10 Thread Nicholas Chammas

New submission from Nicholas Chammas:

The docs for `bytes.translate()` [0] show the following signature:

```
bytes.translate(table[, delete])
```

However, calling this method with keyword arguments yields:

```
>>> b''.translate(table='la table', delete=b'delete')
Traceback (most recent call last):
  File "", line 1, in 
TypeError: translate() takes no keyword arguments
```

I'm guessing other methods have this same issue. (e.g. `str.translate()`)

Do the docs need to be updated, or should these methods be updated to accept 
keyword arguments, or something else?

[0] https://docs.python.org/3/library/stdtypes.html#bytes.translate

--
assignee: docs@python
components: Documentation, Library (Lib)
messages: 260034
nosy: Nicholas Chammas, docs@python
priority: normal
severity: normal
status: open
title: bytes.translate() doesn't take keyword arguments; docs suggests it does
versions: Python 3.5

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue26334>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue26188] Provide more helpful error message when `await` is called inside non-`async` method

2016-02-02 Thread Nicholas Chammas

Nicholas Chammas added the comment:

Related discussions about providing more helpful syntax error messages:

* http://bugs.python.org/issue1634034
* http://bugs.python.org/issue400734
* http://bugs.python.org/issue20608

>From the discussion on issue1634034, it looks like providing better messages 
>in the general case of a syntax error is quite difficult. But perhaps in 
>limited cases like this one we can do better.

Parsers are a bit over my head. Martin, is it difficult to distinguish between 
`await` as a regular name and `await` as a special token?

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue26188>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue7850] platform.system() should be "macosx" instead of "Darwin" on OSX

2016-01-30 Thread Nicholas Chammas

Nicholas Chammas added the comment:

As of Python 3.5.1 [0], it looks like

1) the `aliased` and `terse` parameters of `platform.platform()` are documented 
to take integers instead of booleans (contrary to what Marc-Andre requested), 
and 

2) calling `platform.platform()` with `aliased` set to 1 or True still returns 
"Darwin" on OS X.

Is this by design?

[0] https://docs.python.org/3.5/library/platform.html#platform.platform

--
nosy: +Nicholas Chammas

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue7850>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



Re: [PyInstaller] Examples of projects using PyInstaller

2016-01-28 Thread Nicholas Chammas


You may have a look at brog -backup: https://github.com/borgbackup/borg
>
Thanks for the reference! Looks like this is the money shot 

.

As long as you can pass all required parameters on the command line, I'd 
> suggest *not* using a .spec-file. This eases updating for the case the 
> .spec-file would change or get new features.
>
Makes sense. I’ll report back if I find there is something I need that I 
can’t do from the command line interface. 


>- Do they restructure their project in any way to better support 
>PyInstaller?
>
> If you are using virtual environments, this should not be necessary. Just 
> setup the virtual env, install everything required into it, an run 
> pyinstaller.
>
Ah, in my case I had to make a small change to add a script that can be 
called directly, like python3 my-program.py. That’s because the normal way 
to invoke my program is via a console_scripts entry point, and PyInstaller 
doesn’t know how to interface with that yet 
 (as you know). 


>- Do they use CI services like Travis and AppVeyor to automatically 
>test their packaging process and generate release artifacts for the 
> various 
>OSes?
>
> borg backup uses travis and Vagrant. (I personally find the later one very 
> interesting and adopted it for building PyInstallers bootloader.)
>
Interesting. I’ll take a look at Travis first since it supports uploading 
artifacts to S3 . So 
I can have PyInstaller run as one of my “tests” and publish the result to 
S3.

Nick
​

-- 
You received this message because you are subscribed to the Google Groups 
"PyInstaller" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to pyinstaller+unsubscr...@googlegroups.com.
To post to this group, send email to pyinstaller@googlegroups.com.
Visit this group at https://groups.google.com/group/pyinstaller.
For more options, visit https://groups.google.com/d/optout.


Re: Is spark-ec2 going away?

2016-01-27 Thread Nicholas Chammas
I noticed that in the main branch, the ec2 directory along with the
spark-ec2 script is no longer present.

It’s been moved out of the main repo to its own location:
https://github.com/amplab/spark-ec2/pull/21

Is spark-ec2 going away in the next release? If so, what would be the best
alternative at that time?

It’s not going away. It’s just being removed from the main Spark repo and
maintained separately.

There are many alternatives like EMR, which was already mentioned, as well
as more full-service solutions like Databricks. It depends on what you’re
looking for.

If you want something as close to spark-ec2 as possible but more actively
developed, you might be interested in checking out Flintrock
, which I built.

Is there any way to add/remove additional workers while the cluster is
running without stopping/starting the EC2 cluster?

Not currently possible with spark-ec2 and a bit difficult to add. See:
https://issues.apache.org/jira/browse/SPARK-2008

For 1, if no such capability is provided with the current script., do we
have to write it ourselves? Or is there any plan in the future to add such
functions?

No "official" plans to add this to spark-ec2. It’s up to a contributor to
step up and implement this feature, basically. Otherwise it won’t happen.

Nick

On Wed, Jan 27, 2016 at 5:13 PM Alexander Pivovarov 
wrote:

you can use EMR-4.3.0 run on spot instances to control the price
>
> yes, you can add/remove instances to the cluster on fly  (CORE instances
> support add only, TASK instances - add and remove)
>
>
>
> On Wed, Jan 27, 2016 at 2:07 PM, Sung Hwan Chung  > wrote:
>
>> I noticed that in the main branch, the ec2 directory along with the
>> spark-ec2 script is no longer present.
>>
>> Is spark-ec2 going away in the next release? If so, what would be the
>> best alternative at that time?
>>
>> A couple more additional questions:
>> 1. Is there any way to add/remove additional workers while the cluster is
>> running without stopping/starting the EC2 cluster?
>> 2. For 1, if no such capability is provided with the current script., do
>> we have to write it ourselves? Or is there any plan in the future to add
>> such functions?
>> 2. In PySpark, is it possible to dynamically change driver/executor
>> memory, number of cores per executor without having to restart it? (e.g.
>> via changing sc configuration or recreating sc?)
>>
>> Our ideal scenario is to keep running PySpark (in our case, as a
>> notebook) and connect/disconnect to any spark clusters on demand.
>>
>
> ​


Re: Mutiple spark contexts

2016-01-27 Thread Nicholas Chammas
There is a lengthy discussion about this on the JIRA:
https://issues.apache.org/jira/browse/SPARK-2243

On Wed, Jan 27, 2016 at 1:43 PM Herman van Hövell tot Westerflier <
hvanhov...@questtec.nl> wrote:

> Just out of curiousity. What is the use case for having multiple active
> contexts in a single JVM?
>
> Kind regards,
>
> Herman van Hövell
>
> 2016-01-27 19:41 GMT+01:00 Ashish Soni :
>
>> There is a property you need to set which is
>> spark.driver.allowMultipleContexts=true
>>
>> Ashish
>>
>> On Wed, Jan 27, 2016 at 1:39 PM, Jakob Odersky  wrote:
>>
>>> A while ago, I remember reading that multiple active Spark contexts
>>> per JVM was a possible future enhancement.
>>> I was wondering if this is still on the roadmap, what the major
>>> obstacles are and if I can be of any help in adding this feature?
>>>
>>> regards,
>>> --Jakob
>>>
>>> -
>>> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: dev-h...@spark.apache.org
>>>
>>>
>>
>


[jira] [Commented] (SPARK-5189) Reorganize EC2 scripts so that nodes can be provisioned independent of Spark master

2016-01-27 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-5189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15119220#comment-15119220
 ] 

Nicholas Chammas commented on SPARK-5189:
-

FWIW, I found this issue to be practically unsolvable without rewriting most of 
spark-ec2, so I started a new project that aims to replace spark-ec2 for most 
of its use cases: [Flintrock|https://github.com/nchammas/flintrock]

> Reorganize EC2 scripts so that nodes can be provisioned independent of Spark 
> master
> ---
>
> Key: SPARK-5189
> URL: https://issues.apache.org/jira/browse/SPARK-5189
> Project: Spark
>  Issue Type: Improvement
>  Components: EC2
>    Reporter: Nicholas Chammas
>
> As of 1.2.0, we launch Spark clusters on EC2 by setting up the master first, 
> then setting up all the slaves together. This includes broadcasting files 
> from the lonely master to potentially hundreds of slaves.
> There are 2 main problems with this approach:
> # Broadcasting files from the master to all slaves using 
> [{{copy-dir}}|https://github.com/mesos/spark-ec2/blob/branch-1.3/copy-dir.sh] 
> (e.g. during [ephemeral-hdfs 
> init|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/ephemeral-hdfs/init.sh#L36],
>  or during [Spark 
> setup|https://github.com/mesos/spark-ec2/blob/3a95101c70e6892a8a48cc54094adaed1458487a/spark/setup.sh#L3])
>  takes a long time. This time increases as the number of slaves increases.
>  I did some testing in {{us-east-1}}. This is, concretely, what the problem 
> looks like:
>  || number of slaves ({{m3.large}}) || launch time (best of 6 tries) ||
> | 1 | 8m 44s |
> | 10 | 13m 45s |
> | 25 | 22m 50s |
> | 50 | 37m 30s |
> | 75 | 51m 30s |
> | 99 | 1h 5m 30s |
>  Unfortunately, I couldn't report on 100 slaves or more due to SPARK-6246, 
> but I think the point is clear enough.
>  We can extrapolate from this data that *every additional slave adds roughly 
> 35 seconds to the launch time* (so a cluster with 100 slaves would take 1h 6m 
> 5s to launch).
> # It's more complicated to add slaves to an existing cluster (a la 
> [SPARK-2008]), since slaves are only configured through the master during the 
> setup of the master itself.
> Logically, the operations we want to implement are:
> * Provision a Spark node
> * Join a node to a cluster (including an empty cluster) as either a master or 
> a slave
> * Remove a node from a cluster
> We need our scripts to roughly be organized to match the above operations. 
> The goals would be:
> # When launching a cluster, enable all cluster nodes to be provisioned in 
> parallel, removing the master-to-slave file broadcast bottleneck.
> # Facilitate cluster modifications like adding or removing nodes.
> # Enable exploration of infrastructure tools like 
> [Terraform|https://www.terraform.io/] that might simplify {{spark-ec2}} 
> internals and perhaps even allow us to build [one tool that launches Spark 
> clusters on several different cloud 
> platforms|https://groups.google.com/forum/#!topic/terraform-tool/eD23GLLkfDw].
> More concretely, the modifications we need to make are:
> * Replace all occurrences of {{copy-dir}} or {{rsync}}-to-slaves with 
> equivalent, slave-side operations.
> * Repurpose {{setup-slave.sh}} as {{provision-spark-node.sh}} and make sure 
> it fully creates a node that can be used as either a master or slave.
> * Create a new script, {{join-to-cluster.sh}}, that takes a provisioned node, 
> configures it as a master or slave, and joins it to a cluster.
> * Move any remaining logic in {{setup.sh}} up to {{spark_ec2.py}} and delete 
> that script.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[PyInstaller] Examples of projects using PyInstaller

2016-01-26 Thread Nicholas Chammas


Howdy,

I’m looking for examples of projects using PyInstaller on a regular basis 
to package and release their work. I’m getting ready to make my first 
release using PyInstaller , 
and as a Python newbie I think it would be instructive for me to review how 
other people use PyInstaller as I dive in myself.

The specific things I’m looking to see examples of in other people’s 
projects are:

   - How do they invoke PyInstaller? If they use a .spec file, what does it 
   look like and do they check it in to their VCS? 
   - Do they restructure their project in any way to better support 
   PyInstaller? 
   - Do they use CI services like Travis and AppVeyor to automatically test 
   their packaging process and generate release artifacts for the various OSes? 
   - How do they instruct their users to install and invoke their 
   PyInstaller packages? 
   - What naming conventions do they use for their PyInstaller packages? 

I’m not looking for anyone to answer all these questions directly; a link 
to a project that uses PyInstaller to make releases is all I need.

Of course, if you contribute to a project that uses PyInstaller and would 
like to chime in with advice, I would welcome that too.

Nick
​

-- 
You received this message because you are subscribed to the Google Groups 
"PyInstaller" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to pyinstaller+unsubscr...@googlegroups.com.
To post to this group, send email to pyinstaller@googlegroups.com.
Visit this group at https://groups.google.com/group/pyinstaller.
For more options, visit https://groups.google.com/d/optout.


[issue26188] Provide more helpful error message when `await` is called inside non-`async` method

2016-01-23 Thread Nicholas Chammas

New submission from Nicholas Chammas:

Here is the user interaction:

```python
$ python3
Python 3.5.1 (default, Dec  7 2015, 21:59:10) 
[GCC 4.2.1 Compatible Apple LLVM 7.0.0 (clang-700.1.76)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> def oh_hai():
... await something()
  File "", line 2
await something()
  ^
SyntaxError: invalid syntax
```

It would be helpful if Python could tell the user something more specific about 
_why_ the syntax is invalid. Is that possible?

For example, in the case above, an error message along the following lines 
would be much more helpful:

```
SyntaxError: Cannot call `await` inside non-`async` method.
```

Without a hint like this, it's too easy to miss the obvious and waste time 
eye-balling the code, like I did. :-)

--
components: Interpreter Core
messages: 258879
nosy: Nicholas Chammas
priority: normal
severity: normal
status: open
title: Provide more helpful error message when `await` is called inside 
non-`async` method
versions: Python 3.5, Python 3.6

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue26188>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[jira] [Commented] (SPARK-12824) Failure to maintain consistent RDD references in pyspark

2016-01-14 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15098887#comment-15098887
 ] 

Nicholas Chammas commented on SPARK-12824:
--

Ah, good catch. This appears to be a known behavior of lambdas in Python: 
http://docs.python-guide.org/en/latest/writing/gotchas/#late-binding-closures

> Failure to maintain consistent RDD references in pyspark
> 
>
> Key: SPARK-12824
> URL: https://issues.apache.org/jira/browse/SPARK-12824
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
> Environment: Spark 1.5.2, Python 2.7.10, and IPython 4.0.0.
>Reporter: Paul Shearer
>
> Below is a simple {{pyspark}} script that tries to split an RDD into a 
> dictionary containing several RDDs. 
> As the *sample run* shows, the script only works if we do a {{collect()}} on 
> the intermediate RDDs as they are created. Of course I would not want to do 
> that in practice, since it doesn't scale.
> What's really strange is, I'm not assigning the intermediate {{collect()}} 
> results to any variable. So the difference in behavior is due solely to a 
> hidden side-effect of the computation triggered by the {{collect()}} call. 
> Spark is supposed to be a very functional framework with minimal side 
> effects. Why is it only possible to get the desired behavior by triggering 
> some mysterious side effect using {{collect()}}? 
> It seems that all the keys in the dictionary are referencing the same object 
> even though in the code they are clearly supposed to be different objects.
> The run below is with Spark 1.5.2, Python 2.7.10, and IPython 4.0.0.
> h3. spark_script.py
> {noformat}
> from pprint import PrettyPrinter
> pp = PrettyPrinter(indent=4).pprint
> logger = sc._jvm.org.apache.log4j
> logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR )
> logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR )
> 
> def split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False):
> d = dict()
> for key_value in key_values:
> d[key_value] = rdd.filter(lambda row: row[key_field] == key_value)
> if collect_in_loop:
> d[key_value].collect()
> return d
> def print_results(d):
> for k in d:
> print k
> pp(d[k].collect())
> 
> rdd = sc.parallelize([
> {'color':'red','size':3},
> {'color':'red', 'size':7},
> {'color':'red', 'size':8},
> {'color':'red', 'size':10},
> {'color':'green', 'size':9},
> {'color':'green', 'size':5},
> {'color':'green', 'size':50},
> {'color':'blue', 'size':4},
> {'color':'purple', 'size':6}])
> key_field = 'color'
> key_values = ['red', 'green', 'blue', 'purple']
> 
> print '### run WITH collect in loop: '
> d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=True)
> print_results(d)
> print '### run WITHOUT collect in loop: '
> d = split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False)
> print_results(d)
> {noformat}
> h3. Sample run in IPython shell
> {noformat}
> In [1]: execfile('spark_script.py')
> ### run WITH collect in loop: 
> blue
> [{   'color': 'blue', 'size': 4}]
> purple
> [{   'color': 'purple', 'size': 6}]
> green
> [   {   'color': 'green', 'size': 9},
> {   'color': 'green', 'size': 5},
> {   'color': 'green', 'size': 50}]
> red
> [   {   'color': 'red', 'size': 3},
> {   'color': 'red', 'size': 7},
> {   'color': 'red', 'size': 8},
> {   'color': 'red', 'size': 10}]
> ### run WITHOUT collect in loop: 
> blue
> [{   'color': 'purple', 'size': 6}]
> purple
> [{   'color': 'purple', 'size': 6}]
> green
> [{   'color': 'purple', 'size': 6}]
> red
> [{   'color': 'purple', 'size': 6}]
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12824) Failure to maintain consistent RDD references in pyspark

2016-01-14 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12824?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15098336#comment-15098336
 ] 

Nicholas Chammas commented on SPARK-12824:
--

I can reproduce this issue. Here's a more concise reproduction:

{code}
from __future__ import print_function

rdd = sc.parallelize([
{'color':'red','size':3},
{'color':'red', 'size':7},
{'color':'red', 'size':8},
{'color':'red', 'size':10},
{'color':'green', 'size':9},
{'color':'green', 'size':5},
{'color':'green', 'size':50},
{'color':'blue', 'size':4},
{'color':'purple', 'size':6}])


colors = ['purple', 'red', 'green', 'blue']

# Defer collect() till print
color_rdds = {
color: rdd.filter(lambda x: x['color'] == color)
for color in colors
}
for k, v in color_rdds.items():
print(k, v.collect())


# collect() upfront
color_rdds = {
color: rdd.filter(lambda x: x['color'] == color).collect()
for color in colors
}
for k, v in color_rdds.items():
print(k, v)
{code}

Output:

{code}
# Defer collect() till print
purple [{'color': 'blue', 'size': 4}]
blue [{'color': 'blue', 'size': 4}]
green [{'color': 'blue', 'size': 4}]
red [{'color': 'blue', 'size': 4}]

---

# collect() upfront
purple [{'color': 'purple', 'size': 6}]
blue [{'color': 'blue', 'size': 4}]
green [{'color': 'green', 'size': 9}, {'color': 'green', 'size': 5}, {'color': 
'green', 'size': 50}]
red [{'color': 'red', 'size': 3}, {'color': 'red', 'size': 7}, {'color': 'red', 
'size': 8}, {'color': 'red', 'size': 10}]
{code}

Observations:
* The color that gets repeated in the first block of output is always the last 
color in {{colors}}.
* This happens on Python 2 and 3, and with both {{items()}} and {{iteritems()}}.

This smells like an RDD naming issue, or something related to lazy evaluation. 
The filtered RDDs that get generated in the first block under {{color_rdds}} 
don't have names. Then, when they all get {{collect()}}-ed at once, they all 
evaluate to the last filtered RDD.

cc [~davies] / [~joshrosen]

> Failure to maintain consistent RDD references in pyspark
> 
>
> Key: SPARK-12824
> URL: https://issues.apache.org/jira/browse/SPARK-12824
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
> Environment: Spark 1.5.2, Python 2.7.10, and IPython 4.0.0.
>Reporter: Paul Shearer
>
> Below is a simple {{pyspark}} script that tries to split an RDD into a 
> dictionary containing several RDDs. 
> As the *sample run* shows, the script only works if we do a {{collect()}} on 
> the intermediate RDDs as they are created. Of course I would not want to do 
> that in practice, since it doesn't scale.
> What's really strange is, I'm not assigning the intermediate {{collect()}} 
> results to any variable. So the difference in behavior is due solely to a 
> hidden side-effect of the computation triggered by the {{collect()}} call. 
> Spark is supposed to be a very functional framework with minimal side 
> effects. Why is it only possible to get the desired behavior by triggering 
> some mysterious side effect using {{collect()}}? 
> It seems that all the keys in the dictionary are referencing the same object 
> even though in the code they are clearly supposed to be different objects.
> The run below is with Spark 1.5.2, Python 2.7.10, and IPython 4.0.0.
> h3. spark_script.py
> {noformat}
> from pprint import PrettyPrinter
> pp = PrettyPrinter(indent=4).pprint
> logger = sc._jvm.org.apache.log4j
> logger.LogManager.getLogger("org"). setLevel( logger.Level.ERROR )
> logger.LogManager.getLogger("akka").setLevel( logger.Level.ERROR )
> 
> def split_RDD_by_key(rdd, key_field, key_values, collect_in_loop=False):
> d = dict()
> for key_value in key_values:
> d[key_value] = rdd.filter(lambda row: row[key_field] == key_value)
> if collect_in_loop:
> d[key_value].collect()
> return d
> def print_results(d):
> for k in d:
> print k
> pp(d[k].collect())
> 
> rdd = sc.parallelize([
> {'color':'red','size':3},
> {'color':'red', 'size':7},
> {'color':'red', 'size':8},
> {'color':'red', 'size':10},
> {'color':'green', 'size':9},
> {'color':'green', 'size':5},
> {'color':'green', 'size':50},
> {'color':'blue', 'size':4},
> {'color':'purple', 'size':6}])
> key_field = 'color'
> key_values = ['red', 'green', 'blue', 'purple']
> 
> print '### run WITH collect in loop: '
&

[issue26035] traceback.print_tb() takes `tb`, not `traceback` as a keyword argument

2016-01-06 Thread Nicholas Chammas

New submission from Nicholas Chammas:

Here is traceback.print_tb()'s signature [0]:

```
def print_tb(tb, limit=None, file=None):
```

However, its documentation reads [1]:

```
.. function:: print_tb(traceback, limit=None, file=None)
```

Did the keyword argument change recently, or was this particular doc always 
wrong?

[0] 
https://github.com/python/cpython/blob/1fe0fd9feb6a4472a9a1b186502eb9c0b2366326/Lib/traceback.py#L43
[1] 
https://raw.githubusercontent.com/python/cpython/1fe0fd9feb6a4472a9a1b186502eb9c0b2366326/Doc/library/traceback.rst

--
assignee: docs@python
components: Documentation
messages: 257670
nosy: Nicholas Chammas, docs@python
priority: normal
severity: normal
status: open
title: traceback.print_tb() takes `tb`, not `traceback` as a keyword argument
versions: Python 3.5, Python 3.6

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue26035>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
+1

Red Hat supports Python 2.6 on REHL 5 until 2020
, but
otherwise yes, Python 2.6 is ancient history and the core Python developers
stopped supporting it in 2013. REHL 5 is not a good enough reason to
continue support for Python 2.6 IMO.

We should aim to support Python 2.7 and Python 3.3+ (which I believe we
currently do).

Nick

On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang  wrote:

> plus 1,
>
> we are currently using python 2.7.2 in production environment.
>
>
>
>
>
> 在 2016-01-05 18:11:45,"Meethu Mathew"  写道:
>
> +1
> We use Python 2.7
>
> Regards,
>
> Meethu Mathew
>
> On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin  wrote:
>
>> Does anybody here care about us dropping support for Python 2.6 in Spark
>> 2.0?
>>
>> Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json
>> parsing) when compared with Python 2.7. Some libraries that Spark depend on
>> stopped supporting 2.6. We can still convince the library maintainers to
>> support 2.6, but it will be extra work. I'm curious if anybody still uses
>> Python 2.6 to run Spark.
>>
>> Thanks.
>>
>>
>>
>


Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
+1

Red Hat supports Python 2.6 on REHL 5 until 2020
, but
otherwise yes, Python 2.6 is ancient history and the core Python developers
stopped supporting it in 2013. REHL 5 is not a good enough reason to
continue support for Python 2.6 IMO.

We should aim to support Python 2.7 and Python 3.3+ (which I believe we
currently do).

Nick

On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang  wrote:

> plus 1,
>
> we are currently using python 2.7.2 in production environment.
>
>
>
>
>
> 在 2016-01-05 18:11:45,"Meethu Mathew"  写道:
>
> +1
> We use Python 2.7
>
> Regards,
>
> Meethu Mathew
>
> On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin  wrote:
>
>> Does anybody here care about us dropping support for Python 2.6 in Spark
>> 2.0?
>>
>> Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json
>> parsing) when compared with Python 2.7. Some libraries that Spark depend on
>> stopped supporting 2.6. We can still convince the library maintainers to
>> support 2.6, but it will be extra work. I'm curious if anybody still uses
>> Python 2.6 to run Spark.
>>
>> Thanks.
>>
>>
>>
>


Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
As I pointed out in my earlier email, RHEL will support Python 2.6 until
2020. So I'm assuming these large companies will have the option of riding
out Python 2.6 until then.

Are we seriously saying that Spark should likewise support Python 2.6 for
the next several years? Even though the core Python devs stopped supporting
it in 2013?

If that's not what we're suggesting, then when, roughly, can we drop
support? What are the criteria?

I understand the practical concern here. If companies are stuck using 2.6,
it doesn't matter to them that it is deprecated. But balancing that concern
against the maintenance burden on this project, I would say that "upgrade
to Python 2.7 or stay on Spark 1.6.x" is a reasonable position to take.
There are many tiny annoyances one has to put up with to support 2.6.

I suppose if our main PySpark contributors are fine putting up with those
annoyances, then maybe we don't need to drop support just yet...

Nick
2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente <ju...@esbet.es>님이
작성:

> Unfortunately, Koert is right.
>
> I've been in a couple of projects using Spark (banking industry) where
> CentOS + Python 2.6 is the toolbox available.
>
> That said, I believe it should not be a concern for Spark. Python 2.6 is
> old and busted, which is totally opposite to the Spark philosophy IMO.
>
>
> El 5 ene 2016, a las 20:07, Koert Kuipers <ko...@tresata.com> escribió:
>
> rhel/centos 6 ships with python 2.6, doesnt it?
>
> if so, i still know plenty of large companies where python 2.6 is the only
> option. asking them for python 2.7 is not going to work
>
> so i think its a bad idea
>
> On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland <juliet.hougl...@gmail.com
> > wrote:
>
>> I don't see a reason Spark 2.0 would need to support Python 2.6. At this
>> point, Python 3 should be the default that is encouraged.
>> Most organizations acknowledge the 2.7 is common, but lagging behind the
>> version they should theoretically use. Dropping python 2.6
>> support sounds very reasonable to me.
>>
>> On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> +1
>>>
>>> Red Hat supports Python 2.6 on REHL 5 until 2020
>>> <https://alexgaynor.net/2015/mar/30/red-hat-open-source-community/>,
>>> but otherwise yes, Python 2.6 is ancient history and the core Python
>>> developers stopped supporting it in 2013. REHL 5 is not a good enough
>>> reason to continue support for Python 2.6 IMO.
>>>
>>> We should aim to support Python 2.7 and Python 3.3+ (which I believe we
>>> currently do).
>>>
>>> Nick
>>>
>>> On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang <allenzhang...@126.com>
>>> wrote:
>>>
>>>> plus 1,
>>>>
>>>> we are currently using python 2.7.2 in production environment.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 在 2016-01-05 18:11:45,"Meethu Mathew" <meethu.mat...@flytxt.com> 写道:
>>>>
>>>> +1
>>>> We use Python 2.7
>>>>
>>>> Regards,
>>>>
>>>> Meethu Mathew
>>>>
>>>> On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin <r...@databricks.com>
>>>> wrote:
>>>>
>>>>> Does anybody here care about us dropping support for Python 2.6 in
>>>>> Spark 2.0?
>>>>>
>>>>> Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json
>>>>> parsing) when compared with Python 2.7. Some libraries that Spark depend 
>>>>> on
>>>>> stopped supporting 2.6. We can still convince the library maintainers to
>>>>> support 2.6, but it will be extra work. I'm curious if anybody still uses
>>>>> Python 2.6 to run Spark.
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>>
>>>>
>>
>


Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
As I pointed out in my earlier email, RHEL will support Python 2.6 until
2020. So I'm assuming these large companies will have the option of riding
out Python 2.6 until then.

Are we seriously saying that Spark should likewise support Python 2.6 for
the next several years? Even though the core Python devs stopped supporting
it in 2013?

If that's not what we're suggesting, then when, roughly, can we drop
support? What are the criteria?

I understand the practical concern here. If companies are stuck using 2.6,
it doesn't matter to them that it is deprecated. But balancing that concern
against the maintenance burden on this project, I would say that "upgrade
to Python 2.7 or stay on Spark 1.6.x" is a reasonable position to take.
There are many tiny annoyances one has to put up with to support 2.6.

I suppose if our main PySpark contributors are fine putting up with those
annoyances, then maybe we don't need to drop support just yet...

Nick
2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente <ju...@esbet.es>님이
작성:

> Unfortunately, Koert is right.
>
> I've been in a couple of projects using Spark (banking industry) where
> CentOS + Python 2.6 is the toolbox available.
>
> That said, I believe it should not be a concern for Spark. Python 2.6 is
> old and busted, which is totally opposite to the Spark philosophy IMO.
>
>
> El 5 ene 2016, a las 20:07, Koert Kuipers <ko...@tresata.com> escribió:
>
> rhel/centos 6 ships with python 2.6, doesnt it?
>
> if so, i still know plenty of large companies where python 2.6 is the only
> option. asking them for python 2.7 is not going to work
>
> so i think its a bad idea
>
> On Tue, Jan 5, 2016 at 1:52 PM, Juliet Hougland <juliet.hougl...@gmail.com
> > wrote:
>
>> I don't see a reason Spark 2.0 would need to support Python 2.6. At this
>> point, Python 3 should be the default that is encouraged.
>> Most organizations acknowledge the 2.7 is common, but lagging behind the
>> version they should theoretically use. Dropping python 2.6
>> support sounds very reasonable to me.
>>
>> On Tue, Jan 5, 2016 at 5:45 AM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> +1
>>>
>>> Red Hat supports Python 2.6 on REHL 5 until 2020
>>> <https://alexgaynor.net/2015/mar/30/red-hat-open-source-community/>,
>>> but otherwise yes, Python 2.6 is ancient history and the core Python
>>> developers stopped supporting it in 2013. REHL 5 is not a good enough
>>> reason to continue support for Python 2.6 IMO.
>>>
>>> We should aim to support Python 2.7 and Python 3.3+ (which I believe we
>>> currently do).
>>>
>>> Nick
>>>
>>> On Tue, Jan 5, 2016 at 8:01 AM Allen Zhang <allenzhang...@126.com>
>>> wrote:
>>>
>>>> plus 1,
>>>>
>>>> we are currently using python 2.7.2 in production environment.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> 在 2016-01-05 18:11:45,"Meethu Mathew" <meethu.mat...@flytxt.com> 写道:
>>>>
>>>> +1
>>>> We use Python 2.7
>>>>
>>>> Regards,
>>>>
>>>> Meethu Mathew
>>>>
>>>> On Tue, Jan 5, 2016 at 12:47 PM, Reynold Xin <r...@databricks.com>
>>>> wrote:
>>>>
>>>>> Does anybody here care about us dropping support for Python 2.6 in
>>>>> Spark 2.0?
>>>>>
>>>>> Python 2.6 is ancient, and is pretty slow in many aspects (e.g. json
>>>>> parsing) when compared with Python 2.7. Some libraries that Spark depend 
>>>>> on
>>>>> stopped supporting 2.6. We can still convince the library maintainers to
>>>>> support 2.6, but it will be extra work. I'm curious if anybody still uses
>>>>> Python 2.6 to run Spark.
>>>>>
>>>>> Thanks.
>>>>>
>>>>>
>>>>>
>>>>
>>
>


Re: [core-workflow] My initial thoughts on the steps/blockers of the transition

2016-01-05 Thread Nicholas Chammas
We can set a commit status that will show red if the user hasn’t signed the
CLA (just like if Travis tests failed or so). No need to use a banner or
anything.

This is a great idea. Almost any automated check we want to run against PRs
can be captured as a Travis/CI test that shows up on the PR with its own
status .

Nick
​

On Tue, Jan 5, 2016 at 5:40 PM Brett Cannon  wrote:

> On Tue, 5 Jan 2016 at 14:19 Eric Snow  wrote:
>
>> On Tue, Jan 5, 2016 at 11:13 AM, Brett Cannon  wrote:
>> > Day 1 summary
>> > 
>> >
>> > Decisions made
>> > ---
>> >
>> > Open issues
>> > ---
>>
>> And a couple things that we are punting on:
>>
>> * code review tool (if GH proves undesirable)
>>
>
> Well, that's implicit if we find GitHub doesn't work for us for code
> review. I don't think it requires explicitly calling it out.
>
>
>> * separate (sub)repos for docs/tutorials--they could have a less
>> restricted workflow than the rest of the cpython repo, a la the
>> devguide
>>
>
> Sure, it can be mentioned.
>
>
>>
>> Both of these can wait until later, though they still deserve mention
>> in the PEP.
>
>
> You really don't like GitHub's review tool, huh? ;)
> ___
> core-workflow mailing list
> core-workflow@python.org
> https://mail.python.org/mailman/listinfo/core-workflow
> This list is governed by the PSF Code of Conduct:
> https://www.python.org/psf/codeofconduct
___
core-workflow mailing list
core-workflow@python.org
https://mail.python.org/mailman/listinfo/core-workflow
This list is governed by the PSF Code of Conduct: 
https://www.python.org/psf/codeofconduct

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
I think all the slaves need the same (or a compatible) version of Python
installed since they run Python code in PySpark jobs natively.

On Tue, Jan 5, 2016 at 6:02 PM Koert Kuipers <ko...@tresata.com> wrote:

> interesting i didnt know that!
>
> On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> even if python 2.7 was needed only on this one machine that launches the
>> app we can not ship it with our software because its gpl licensed
>>
>> Not to nitpick, but maybe this is important. The Python license is 
>> GPL-compatible
>> but not GPL <https://docs.python.org/3/license.html>:
>>
>> Note GPL-compatible doesn’t mean that we’re distributing Python under the
>> GPL. All Python licenses, unlike the GPL, let you distribute a modified
>> version without making your changes open source. The GPL-compatible
>> licenses make it possible to combine Python with other software that is
>> released under the GPL; the others don’t.
>>
>> Nick
>> ​
>>
>> On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> i do not think so.
>>>
>>> does the python 2.7 need to be installed on all slaves? if so, we do not
>>> have direct access to those.
>>>
>>> also, spark is easy for us to ship with our software since its apache 2
>>> licensed, and it only needs to be present on the machine that launches the
>>> app (thanks to yarn).
>>> even if python 2.7 was needed only on this one machine that launches the
>>> app we can not ship it with our software because its gpl licensed, so the
>>> client would have to download it and install it themselves, and this would
>>> mean its an independent install which has to be audited and approved and
>>> now you are in for a lot of fun. basically it will never happen.
>>>
>>>
>>> On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen <joshro...@databricks.com>
>>> wrote:
>>>
>>>> If users are able to install Spark 2.0 on their RHEL clusters, then I
>>>> imagine that they're also capable of installing a standalone Python
>>>> alongside that Spark version (without changing Python systemwide). For
>>>> instance, Anaconda/Miniconda make it really easy to install Python
>>>> 2.7.x/3.x without impacting / changing the system Python and doesn't
>>>> require any special permissions to install (you don't need root / sudo
>>>> access). Does this address the Python versioning concerns for RHEL users?
>>>>
>>>> On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers <ko...@tresata.com>
>>>> wrote:
>>>>
>>>>> yeah, the practical concern is that we have no control over java or
>>>>> python version on large company clusters. our current reality for the vast
>>>>> majority of them is java 7 and python 2.6, no matter how outdated that is.
>>>>>
>>>>> i dont like it either, but i cannot change it.
>>>>>
>>>>> we currently don't use pyspark so i have no stake in this, but if we
>>>>> did i can assure you we would not upgrade to spark 2.x if python 2.6 was
>>>>> dropped. no point in developing something that doesnt run for majority of
>>>>> customers.
>>>>>
>>>>> On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas <
>>>>> nicholas.cham...@gmail.com> wrote:
>>>>>
>>>>>> As I pointed out in my earlier email, RHEL will support Python 2.6
>>>>>> until 2020. So I'm assuming these large companies will have the option of
>>>>>> riding out Python 2.6 until then.
>>>>>>
>>>>>> Are we seriously saying that Spark should likewise support Python 2.6
>>>>>> for the next several years? Even though the core Python devs stopped
>>>>>> supporting it in 2013?
>>>>>>
>>>>>> If that's not what we're suggesting, then when, roughly, can we drop
>>>>>> support? What are the criteria?
>>>>>>
>>>>>> I understand the practical concern here. If companies are stuck using
>>>>>> 2.6, it doesn't matter to them that it is deprecated. But balancing that
>>>>>> concern against the maintenance burden on this project, I would say that
>>>>>> "upgrade to Python 2.7 or stay on Spark 1.6.x" is a reasonable position 
>>>>>> to
>>>>>> take. T

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
I think all the slaves need the same (or a compatible) version of Python
installed since they run Python code in PySpark jobs natively.

On Tue, Jan 5, 2016 at 6:02 PM Koert Kuipers <ko...@tresata.com> wrote:

> interesting i didnt know that!
>
> On Tue, Jan 5, 2016 at 5:57 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> even if python 2.7 was needed only on this one machine that launches the
>> app we can not ship it with our software because its gpl licensed
>>
>> Not to nitpick, but maybe this is important. The Python license is 
>> GPL-compatible
>> but not GPL <https://docs.python.org/3/license.html>:
>>
>> Note GPL-compatible doesn’t mean that we’re distributing Python under the
>> GPL. All Python licenses, unlike the GPL, let you distribute a modified
>> version without making your changes open source. The GPL-compatible
>> licenses make it possible to combine Python with other software that is
>> released under the GPL; the others don’t.
>>
>> Nick
>> ​
>>
>> On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> i do not think so.
>>>
>>> does the python 2.7 need to be installed on all slaves? if so, we do not
>>> have direct access to those.
>>>
>>> also, spark is easy for us to ship with our software since its apache 2
>>> licensed, and it only needs to be present on the machine that launches the
>>> app (thanks to yarn).
>>> even if python 2.7 was needed only on this one machine that launches the
>>> app we can not ship it with our software because its gpl licensed, so the
>>> client would have to download it and install it themselves, and this would
>>> mean its an independent install which has to be audited and approved and
>>> now you are in for a lot of fun. basically it will never happen.
>>>
>>>
>>> On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen <joshro...@databricks.com>
>>> wrote:
>>>
>>>> If users are able to install Spark 2.0 on their RHEL clusters, then I
>>>> imagine that they're also capable of installing a standalone Python
>>>> alongside that Spark version (without changing Python systemwide). For
>>>> instance, Anaconda/Miniconda make it really easy to install Python
>>>> 2.7.x/3.x without impacting / changing the system Python and doesn't
>>>> require any special permissions to install (you don't need root / sudo
>>>> access). Does this address the Python versioning concerns for RHEL users?
>>>>
>>>> On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers <ko...@tresata.com>
>>>> wrote:
>>>>
>>>>> yeah, the practical concern is that we have no control over java or
>>>>> python version on large company clusters. our current reality for the vast
>>>>> majority of them is java 7 and python 2.6, no matter how outdated that is.
>>>>>
>>>>> i dont like it either, but i cannot change it.
>>>>>
>>>>> we currently don't use pyspark so i have no stake in this, but if we
>>>>> did i can assure you we would not upgrade to spark 2.x if python 2.6 was
>>>>> dropped. no point in developing something that doesnt run for majority of
>>>>> customers.
>>>>>
>>>>> On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas <
>>>>> nicholas.cham...@gmail.com> wrote:
>>>>>
>>>>>> As I pointed out in my earlier email, RHEL will support Python 2.6
>>>>>> until 2020. So I'm assuming these large companies will have the option of
>>>>>> riding out Python 2.6 until then.
>>>>>>
>>>>>> Are we seriously saying that Spark should likewise support Python 2.6
>>>>>> for the next several years? Even though the core Python devs stopped
>>>>>> supporting it in 2013?
>>>>>>
>>>>>> If that's not what we're suggesting, then when, roughly, can we drop
>>>>>> support? What are the criteria?
>>>>>>
>>>>>> I understand the practical concern here. If companies are stuck using
>>>>>> 2.6, it doesn't matter to them that it is deprecated. But balancing that
>>>>>> concern against the maintenance burden on this project, I would say that
>>>>>> "upgrade to Python 2.7 or stay on Spark 1.6.x" is a reasonable position 
>>>>>> to
>>>>>> take. T

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
even if python 2.7 was needed only on this one machine that launches the
app we can not ship it with our software because its gpl licensed

Not to nitpick, but maybe this is important. The Python license is
GPL-compatible
but not GPL <https://docs.python.org/3/license.html>:

Note GPL-compatible doesn’t mean that we’re distributing Python under the
GPL. All Python licenses, unlike the GPL, let you distribute a modified
version without making your changes open source. The GPL-compatible
licenses make it possible to combine Python with other software that is
released under the GPL; the others don’t.

Nick
​

On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers <ko...@tresata.com> wrote:

> i do not think so.
>
> does the python 2.7 need to be installed on all slaves? if so, we do not
> have direct access to those.
>
> also, spark is easy for us to ship with our software since its apache 2
> licensed, and it only needs to be present on the machine that launches the
> app (thanks to yarn).
> even if python 2.7 was needed only on this one machine that launches the
> app we can not ship it with our software because its gpl licensed, so the
> client would have to download it and install it themselves, and this would
> mean its an independent install which has to be audited and approved and
> now you are in for a lot of fun. basically it will never happen.
>
>
> On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen <joshro...@databricks.com>
> wrote:
>
>> If users are able to install Spark 2.0 on their RHEL clusters, then I
>> imagine that they're also capable of installing a standalone Python
>> alongside that Spark version (without changing Python systemwide). For
>> instance, Anaconda/Miniconda make it really easy to install Python
>> 2.7.x/3.x without impacting / changing the system Python and doesn't
>> require any special permissions to install (you don't need root / sudo
>> access). Does this address the Python versioning concerns for RHEL users?
>>
>> On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> yeah, the practical concern is that we have no control over java or
>>> python version on large company clusters. our current reality for the vast
>>> majority of them is java 7 and python 2.6, no matter how outdated that is.
>>>
>>> i dont like it either, but i cannot change it.
>>>
>>> we currently don't use pyspark so i have no stake in this, but if we did
>>> i can assure you we would not upgrade to spark 2.x if python 2.6 was
>>> dropped. no point in developing something that doesnt run for majority of
>>> customers.
>>>
>>> On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> As I pointed out in my earlier email, RHEL will support Python 2.6
>>>> until 2020. So I'm assuming these large companies will have the option of
>>>> riding out Python 2.6 until then.
>>>>
>>>> Are we seriously saying that Spark should likewise support Python 2.6
>>>> for the next several years? Even though the core Python devs stopped
>>>> supporting it in 2013?
>>>>
>>>> If that's not what we're suggesting, then when, roughly, can we drop
>>>> support? What are the criteria?
>>>>
>>>> I understand the practical concern here. If companies are stuck using
>>>> 2.6, it doesn't matter to them that it is deprecated. But balancing that
>>>> concern against the maintenance burden on this project, I would say that
>>>> "upgrade to Python 2.7 or stay on Spark 1.6.x" is a reasonable position to
>>>> take. There are many tiny annoyances one has to put up with to support 2.6.
>>>>
>>>> I suppose if our main PySpark contributors are fine putting up with
>>>> those annoyances, then maybe we don't need to drop support just yet...
>>>>
>>>> Nick
>>>> 2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente <ju...@esbet.es>님이
>>>> 작성:
>>>>
>>>>> Unfortunately, Koert is right.
>>>>>
>>>>> I've been in a couple of projects using Spark (banking industry) where
>>>>> CentOS + Python 2.6 is the toolbox available.
>>>>>
>>>>> That said, I believe it should not be a concern for Spark. Python 2.6
>>>>> is old and busted, which is totally opposite to the Spark philosophy IMO.
>>>>>
>>>>>
>>>>> El 5 ene 2016, a las 20:07, Koert Kuipers <ko...@tresata.com>
>>>>> 

Re: [discuss] dropping Python 2.6 support

2016-01-05 Thread Nicholas Chammas
even if python 2.7 was needed only on this one machine that launches the
app we can not ship it with our software because its gpl licensed

Not to nitpick, but maybe this is important. The Python license is
GPL-compatible
but not GPL <https://docs.python.org/3/license.html>:

Note GPL-compatible doesn’t mean that we’re distributing Python under the
GPL. All Python licenses, unlike the GPL, let you distribute a modified
version without making your changes open source. The GPL-compatible
licenses make it possible to combine Python with other software that is
released under the GPL; the others don’t.

Nick
​

On Tue, Jan 5, 2016 at 5:49 PM Koert Kuipers <ko...@tresata.com> wrote:

> i do not think so.
>
> does the python 2.7 need to be installed on all slaves? if so, we do not
> have direct access to those.
>
> also, spark is easy for us to ship with our software since its apache 2
> licensed, and it only needs to be present on the machine that launches the
> app (thanks to yarn).
> even if python 2.7 was needed only on this one machine that launches the
> app we can not ship it with our software because its gpl licensed, so the
> client would have to download it and install it themselves, and this would
> mean its an independent install which has to be audited and approved and
> now you are in for a lot of fun. basically it will never happen.
>
>
> On Tue, Jan 5, 2016 at 5:35 PM, Josh Rosen <joshro...@databricks.com>
> wrote:
>
>> If users are able to install Spark 2.0 on their RHEL clusters, then I
>> imagine that they're also capable of installing a standalone Python
>> alongside that Spark version (without changing Python systemwide). For
>> instance, Anaconda/Miniconda make it really easy to install Python
>> 2.7.x/3.x without impacting / changing the system Python and doesn't
>> require any special permissions to install (you don't need root / sudo
>> access). Does this address the Python versioning concerns for RHEL users?
>>
>> On Tue, Jan 5, 2016 at 2:33 PM, Koert Kuipers <ko...@tresata.com> wrote:
>>
>>> yeah, the practical concern is that we have no control over java or
>>> python version on large company clusters. our current reality for the vast
>>> majority of them is java 7 and python 2.6, no matter how outdated that is.
>>>
>>> i dont like it either, but i cannot change it.
>>>
>>> we currently don't use pyspark so i have no stake in this, but if we did
>>> i can assure you we would not upgrade to spark 2.x if python 2.6 was
>>> dropped. no point in developing something that doesnt run for majority of
>>> customers.
>>>
>>> On Tue, Jan 5, 2016 at 5:19 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> As I pointed out in my earlier email, RHEL will support Python 2.6
>>>> until 2020. So I'm assuming these large companies will have the option of
>>>> riding out Python 2.6 until then.
>>>>
>>>> Are we seriously saying that Spark should likewise support Python 2.6
>>>> for the next several years? Even though the core Python devs stopped
>>>> supporting it in 2013?
>>>>
>>>> If that's not what we're suggesting, then when, roughly, can we drop
>>>> support? What are the criteria?
>>>>
>>>> I understand the practical concern here. If companies are stuck using
>>>> 2.6, it doesn't matter to them that it is deprecated. But balancing that
>>>> concern against the maintenance burden on this project, I would say that
>>>> "upgrade to Python 2.7 or stay on Spark 1.6.x" is a reasonable position to
>>>> take. There are many tiny annoyances one has to put up with to support 2.6.
>>>>
>>>> I suppose if our main PySpark contributors are fine putting up with
>>>> those annoyances, then maybe we don't need to drop support just yet...
>>>>
>>>> Nick
>>>> 2016년 1월 5일 (화) 오후 2:27, Julio Antonio Soto de Vicente <ju...@esbet.es>님이
>>>> 작성:
>>>>
>>>>> Unfortunately, Koert is right.
>>>>>
>>>>> I've been in a couple of projects using Spark (banking industry) where
>>>>> CentOS + Python 2.6 is the toolbox available.
>>>>>
>>>>> That said, I believe it should not be a concern for Spark. Python 2.6
>>>>> is old and busted, which is totally opposite to the Spark philosophy IMO.
>>>>>
>>>>>
>>>>> El 5 ene 2016, a las 20:07, Koert Kuipers <ko...@tresata.com>
>>>>> 

Re: [core-workflow] Standard library separation from core (was Re: My initial thoughts on the steps/blockers of the transition)

2016-01-04 Thread Nicholas Chammas
Thanks for sharing that background, Nick.

Instead, the main step which has been taken (driven in no small part
by the Python 3 transition) is the creation of PyPI counterparts for
modules that see substantial updates that are backwards compatible
with earlier versions (importlib2, for example, lets you use the
Python 3 import system in Python 2).

So is the intention that, over the long term, these PyPI counterparts would
cannibalize their standard library equivalents in terms of usage?

Nick
​

On Mon, Jan 4, 2016 at 10:38 PM Nick Coghlan <ncogh...@gmail.com> wrote:

> On 5 January 2016 at 12:50, Nicholas Chammas <nicholas.cham...@gmail.com>
> wrote:
> > Something else to consider. We’ve long talked about splitting out the
> stdlib
> > to make it easier for the alternative implementations to import. If some
> or
> > all of them also switch to git, we could do that pretty easily with git
> > submodules.
> >
> > Not to derail here, but wasn’t there a discussion (perhaps on
> python-ideas)
> > about slowly moving to a model where we distribute a barebones Python
> > “core”, allowing the standard modules to be updated and released on a
> more
> > frequent cycle? Would this be one small step towards such a model?
>
> That discussion has been going on for years :)
>
> The most extensive elaboration is in the related PEPs:
>
> PEP 407 considered the idea of distinguishing normal releases and LTS
> releases: https://www.python.org/dev/peps/pep-0407/
> PEP 413 considered decoupling standard library versions from language
> versions: https://www.python.org/dev/peps/pep-0413/
>
> The ripple effect of either proposal on the wider community would have
> been huge though, hence why 407 is Deferred and 413 Withdrawn.
>
> Instead, the main step which has been taken (driven in no small part
> by the Python 3 transition) is the creation of PyPI counterparts for
> modules that see substantial updates that are backwards compatible
> with earlier versions (importlib2, for example, lets you use the
> Python 3 import system in Python 2). Shipping pip by default with the
> interpreter runtime is also pushing people more towards the notion
> that "if you're limiting yourself to the standard library, you're
> experiencing only a fraction of what the Python ecosystem has to offer
> you".
>
> We don't currently do a great job of making those libraries
> *discoverable* by end users, but they're available if you know to look
> for them (there's an incomplete list at
>
> https://wiki.python.org/moin/Python2orPython3#Supporting_Python_2_and_Python_3_in_a_common_code_base
> )
>
> pip's inclusion was also the first instance of CPython shipping a
> *bundled* library that isn't maintained through the CPython
> development process - each new maintenance release of CPython ships
> the latest upstream version of pip, rather than being locked to the
> version of pip that shipped with the corresponding x.y.0 release.
>
> Cheers,
> Nick.
>
> --
> Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
>
___
core-workflow mailing list
core-workflow@python.org
https://mail.python.org/mailman/listinfo/core-workflow
This list is governed by the PSF Code of Conduct: 
https://www.python.org/psf/codeofconduct

Re: Downloading Hadoop from s3://spark-related-packages/

2015-12-24 Thread Nicholas Chammas
not that likely to get an answer as it’s really a support call, not a
bug/task.

The first question is about proper documentation of all the stuff we’ve
been discussing in this thread, so one would think that’s a valid task. It
doesn’t seem right that closer.lua, for example, is undocumented. Either
it’s not meant for public use (and I am not an intended user), or there
should be something out there that explains how to use it.

I’m not looking for much; just some basic info that covers the various
things I’ve had to piece together from mailing lists and Google.

there’s no mirroring, if you install to lots of machines your download time
will be slow. You could automate it though, do something like D/L, upload
to your own bucket, do an s3 GET.

Yeah, this is what I’m probably going to do eventually—just use my own S3
bucket.

It’s disappointing that, at least as far as I can tell, the Apache
foundation doesn’t have a fast CDN or something like that to serve its
files. So users like me are left needing to come up with their own solution
if they regularly download Apache software to many machines in an automated
fashion.

Now, perhaps Apache mirrors are not meant to be used in this way. Perhaps
they’re just meant for people to do the one-off download to their personal
machines and that’s it. That’s totally fine! But that goes back to my first
question from the ticket—there should be a simple doc that spells this out
for us if that’s the case: “Don’t use the mirror network for automated
provisioning/deployments.” That would suffice. But as things stand now, I
have to guess and wonder at this stuff.

Nick
​

On Thu, Dec 24, 2015 at 5:43 AM Steve Loughran <ste...@hortonworks.com>
wrote:

>
> On 24 Dec 2015, at 05:59, Nicholas Chammas <nicholas.cham...@gmail.com>
> wrote:
>
> FYI: I opened an INFRA ticket with questions about how best to use the
> Apache mirror network.
>
> https://issues.apache.org/jira/browse/INFRA-10999
>
> Nick
>
>
>
> not that likely to get an answer as it's really a support call, not a
> bug/task. You never know though.
>
> There's another way to get at binaries, which is check them out direct
> from SVN
>
> https://dist.apache.org/repos/dist/release/
>
> This is a direct view into how you release things in the ASF (you just
> create a new dir under your project, copy the files and then do an svn
> commit; I believe the replicated servers may just do svn update on their
> local cache.
>
> there's no mirroring, if you install to lots of machines your download
> time will be slow. You could automate it though, do something like D/L,
> upload to your own bucket, do an s3 GET.
>


Re: A proposal for Spark 2.0

2015-12-23 Thread Nicholas Chammas
Yeah, I'd also favor maintaining docs with strictly temporary relevance on
JIRA when possible. The wiki is like this weird backwater I only rarely
visit.

Don't we typically do this kind of stuff with an umbrella issue on JIRA?
Tom, wouldn't that work well for you?

Nick

On Wed, Dec 23, 2015 at 5:06 AM Sean Owen  wrote:

> I think this will be hard to maintain; we already have JIRA as the de
> facto central place to store discussions and prioritize work, and the
> 2.x stuff is already a JIRA. The wiki doesn't really hurt, just
> probably will never be looked at again. Let's point people in all
> cases to JIRA.
>
> On Tue, Dec 22, 2015 at 11:52 PM, Reynold Xin  wrote:
> > I started a wiki page:
> >
> https://cwiki.apache.org/confluence/display/SPARK/Development+Discussions
> >
> >
> > On Tue, Dec 22, 2015 at 6:27 AM, Tom Graves 
> wrote:
> >>
> >> Do we have a summary of all the discussions and what is planned for 2.0
> >> then?  Perhaps we should put on the wiki for reference.
> >>
> >> Tom
> >>
> >>
> >> On Tuesday, December 22, 2015 12:12 AM, Reynold Xin <
> r...@databricks.com>
> >> wrote:
> >>
> >>
> >> FYI I updated the master branch's Spark version to 2.0.0-SNAPSHOT.
> >>
> >> On Tue, Nov 10, 2015 at 3:10 PM, Reynold Xin 
> wrote:
> >>
> >> I’m starting a new thread since the other one got intermixed with
> feature
> >> requests. Please refrain from making feature request in this thread. Not
> >> that we shouldn’t be adding features, but we can always add features in
> 1.7,
> >> 2.1, 2.2, ...
> >>
> >> First - I want to propose a premise for how to think about Spark 2.0 and
> >> major releases in Spark, based on discussion with several members of the
> >> community: a major release should be low overhead and minimally
> disruptive
> >> to the Spark community. A major release should not be very different
> from a
> >> minor release and should not be gated based on new features. The main
> >> purpose of a major release is an opportunity to fix things that are
> broken
> >> in the current API and remove certain deprecated APIs (examples follow).
> >>
> >> For this reason, I would *not* propose doing major releases to break
> >> substantial API's or perform large re-architecting that prevent users
> from
> >> upgrading. Spark has always had a culture of evolving architecture
> >> incrementally and making changes - and I don't think we want to change
> this
> >> model. In fact, we’ve released many architectural changes on the 1.X
> line.
> >>
> >> If the community likes the above model, then to me it seems reasonable
> to
> >> do Spark 2.0 either after Spark 1.6 (in lieu of Spark 1.7) or
> immediately
> >> after Spark 1.7. It will be 18 or 21 months since Spark 1.0. A cadence
> of
> >> major releases every 2 years seems doable within the above model.
> >>
> >> Under this model, here is a list of example things I would propose doing
> >> in Spark 2.0, separated into APIs and Operation/Deployment:
> >>
> >>
> >> APIs
> >>
> >> 1. Remove interfaces, configs, and modules (e.g. Bagel) deprecated in
> >> Spark 1.x.
> >>
> >> 2. Remove Akka from Spark’s API dependency (in streaming), so user
> >> applications can use Akka (SPARK-5293). We have gotten a lot of
> complaints
> >> about user applications being unable to use Akka due to Spark’s
> dependency
> >> on Akka.
> >>
> >> 3. Remove Guava from Spark’s public API (JavaRDD Optional).
> >>
> >> 4. Better class package structure for low level developer API’s. In
> >> particular, we have some DeveloperApi (mostly various listener-related
> >> classes) added over the years. Some packages include only one or two
> public
> >> classes but a lot of private classes. A better structure is to have
> public
> >> classes isolated to a few public packages, and these public packages
> should
> >> have minimal private classes for low level developer APIs.
> >>
> >> 5. Consolidate task metric and accumulator API. Although having some
> >> subtle differences, these two are very similar but have completely
> different
> >> code path.
> >>
> >> 6. Possibly making Catalyst, Dataset, and DataFrame more general by
> moving
> >> them to other package(s). They are already used beyond SQL, e.g. in ML
> >> pipelines, and will be used by streaming also.
> >>
> >>
> >> Operation/Deployment
> >>
> >> 1. Scala 2.11 as the default build. We should still support Scala 2.10,
> >> but it has been end-of-life.
> >>
> >> 2. Remove Hadoop 1 support.
> >>
> >> 3. Assembly-free distribution of Spark: don’t require building an
> enormous
> >> assembly jar in order to run Spark.
> >>
> >>
> >>
> >>
> >
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Re: Downloading Hadoop from s3://spark-related-packages/

2015-12-23 Thread Nicholas Chammas
FYI: I opened an INFRA ticket with questions about how best to use the
Apache mirror network.

https://issues.apache.org/jira/browse/INFRA-10999

Nick

On Mon, Nov 2, 2015 at 8:00 AM Luciano Resende <luckbr1...@gmail.com> wrote:

> I am getting the same results using closer.lua versus close.cgi, which
> seems to be downloading a page where the user can choose the closest
> mirror. I tried to add parameters to follow redirect without much success.
> There seems to be already a jira for a similar request with infra:
> https://issues.apache.org/jira/browse/INFRA-10240.
>
> A workaround is to use a url pointing to the mirror directly.
>
> curl -O -L
> http://ftp.unicamp.br/pub/apache/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz
>
> I second the lack of documentation on what is available with these
> scripts, I'll see if I can find the source and try to see other options.
>
>
> On Sun, Nov 1, 2015 at 8:40 PM, Shivaram Venkataraman <
> shiva...@eecs.berkeley.edu> wrote:
>
>> I think the lua one at
>>
>> https://svn.apache.org/repos/asf/infrastructure/site/trunk/content/dyn/closer.lua
>> has replaced the cgi one from before. Also it looks like the lua one
>> also supports `action=download` with a filename argument. So you could
>> just do something like
>>
>> wget
>> http://www.apache.org/dyn/closer.lua?filename=hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz=download
>>
>> Thanks
>> Shivaram
>>
>> On Sun, Nov 1, 2015 at 3:18 PM, Nicholas Chammas
>> <nicholas.cham...@gmail.com> wrote:
>> > Oh, sweet! For example:
>> >
>> >
>> http://www.apache.org/dyn/closer.cgi/hadoop/common/hadoop-2.7.1/hadoop-2.7.1.tar.gz?asjson=1
>> >
>> > Thanks for sharing that tip. Looks like you can also use as_json (vs.
>> > asjson).
>> >
>> > Nick
>> >
>> >
>> > On Sun, Nov 1, 2015 at 5:32 PM Shivaram Venkataraman
>> > <shiva...@eecs.berkeley.edu> wrote:
>> >>
>> >> On Sun, Nov 1, 2015 at 2:16 PM, Nicholas Chammas
>> >> <nicholas.cham...@gmail.com> wrote:
>> >> > OK, I’ll focus on the Apache mirrors going forward.
>> >> >
>> >> > The problem with the Apache mirrors, if I am not mistaken, is that
>> you
>> >> > cannot use a single URL that automatically redirects you to a working
>> >> > mirror
>> >> > to download Hadoop. You have to pick a specific mirror and pray it
>> >> > doesn’t
>> >> > disappear tomorrow.
>> >> >
>> >> > They don’t go away, especially http://mirror.ox.ac.uk , and in the
>> us
>> >> > the
>> >> > apache.osuosl.org, osu being a where a lot of the ASF servers are
>> kept.
>> >> >
>> >> > So does Apache offer no way to query a URL and automatically get the
>> >> > closest
>> >> > working mirror? If I’m installing HDFS onto servers in various EC2
>> >> > regions,
>> >> > the best mirror will vary depending on my location.
>> >> >
>> >> Not sure if this is officially documented somewhere but if you pass
>> >> '=1' you will get back a JSON which has a 'preferred' field set
>> >> to the closest mirror.
>> >>
>> >> Shivaram
>> >> > Nick
>> >> >
>> >> >
>> >> > On Sun, Nov 1, 2015 at 12:25 PM Shivaram Venkataraman
>> >> > <shiva...@eecs.berkeley.edu> wrote:
>> >> >>
>> >> >> I think that getting them from the ASF mirrors is a better strategy
>> in
>> >> >> general as it'll remove the overhead of keeping the S3 bucket up to
>> >> >> date. It works in the spark-ec2 case because we only support a
>> limited
>> >> >> number of Hadoop versions from the tool. FWIW I don't have write
>> >> >> access to the bucket and also haven't heard of any plans to support
>> >> >> newer versions in spark-ec2.
>> >> >>
>> >> >> Thanks
>> >> >> Shivaram
>> >> >>
>> >> >> On Sun, Nov 1, 2015 at 2:30 AM, Steve Loughran <
>> ste...@hortonworks.com>
>> >> >> wrote:
>> >> >> >
>> >> >> > On 1 Nov 2015, at 03:17, Nicholas Chammas
>> >> >> > <nicholas.cham...@gmail.com>
>> >> >> > wrote:
>> >> >> >
>> >> >&g

[issue25768] compileall functions do not document or test return values

2015-12-20 Thread Nicholas Chammas

Nicholas Chammas added the comment:

Alright, sounds good to me. Thank you for guiding me through the process!

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue25768>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25768] compileall functions do not document or test return values

2015-12-19 Thread Nicholas Chammas

Nicholas Chammas added the comment:

Ah, I see. The setup/teardown stuff runs for each test.

So this is what I did:
* Added a method to add a "bad" source file to the source directory. It gets 
cleaned up with the existing teardown method.
* Used test_importlib to temporarily mutate sys.path as you recommended.

I think this is much closer to what we want. Let me know what you think.

By the way, are there any docs on test_importlib? I couldn't find any.

--
Added file: http://bugs.python.org/file41364/compileall.patch

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue25768>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[jira] [Comment Edited] (SPARK-3821) Develop an automated way of creating Spark images (AMI, Docker, and others)

2015-12-18 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3821?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14203280#comment-14203280
 ] 

Nicholas Chammas edited comment on SPARK-3821 at 12/18/15 9:08 PM:
---

After much dilly-dallying, I am happy to present:
* A brief proposal / design doc ([fixed JIRA attachment | 
https://issues.apache.org/jira/secure/attachment/12680371/packer-proposal.html],
 [md file on GitHub | 
https://github.com/nchammas/spark-ec2/blob/packer/image-build/proposal.md])
* [Initial implementation | 
https://github.com/nchammas/spark-ec2/tree/packer/image-build] and [README | 
https://github.com/nchammas/spark-ec2/blob/packer/image-build/README.md]
* New AMIs generated by this implementation: [Base AMIs | 
https://github.com/nchammas/spark-ec2/tree/packer/ami-list/base], [Spark 1.1.0 
Pre-Installed | 
https://github.com/nchammas/spark-ec2/tree/packer/ami-list/1.1.0]

To try out the new AMIs with {{spark-ec2}}, you'll need to update [these | 
https://github.com/apache/spark/blob/7e9d975676d56ace0e84c2200137e4cd4eba074a/ec2/spark_ec2.py#L47]
 [two | 
https://github.com/apache/spark/blob/7e9d975676d56ace0e84c2200137e4cd4eba074a/ec2/spark_ec2.py#L593]
 lines (well, really, just the first one) to point to [my {{spark-ec2}} repo on 
the {{packer}} branch | 
https://github.com/nchammas/spark-ec2/tree/packer/image-build].

Your candid feedback and/or improvements are most welcome!


was (Author: nchammas):
After much dilly-dallying, I am happy to present:
* A brief proposal / design doc ([fixed JIRA attachment | 
https://issues.apache.org/jira/secure/attachment/12680371/packer-proposal.html],
 [md file on GitHub | 
https://github.com/nchammas/spark-ec2/blob/packer/packer/proposal.md])
* [Initial implementation | 
https://github.com/nchammas/spark-ec2/tree/packer/packer] and [README | 
https://github.com/nchammas/spark-ec2/blob/packer/packer/README.md]
* New AMIs generated by this implementation: [Base AMIs | 
https://github.com/nchammas/spark-ec2/tree/packer/ami-list/base], [Spark 1.1.0 
Pre-Installed | 
https://github.com/nchammas/spark-ec2/tree/packer/ami-list/1.1.0]

To try out the new AMIs with {{spark-ec2}}, you'll need to update [these | 
https://github.com/apache/spark/blob/7e9d975676d56ace0e84c2200137e4cd4eba074a/ec2/spark_ec2.py#L47]
 [two | 
https://github.com/apache/spark/blob/7e9d975676d56ace0e84c2200137e4cd4eba074a/ec2/spark_ec2.py#L593]
 lines (well, really, just the first one) to point to [my {{spark-ec2}} repo on 
the {{packer}} branch | 
https://github.com/nchammas/spark-ec2/tree/packer/packer].

Your candid feedback and/or improvements are most welcome!

> Develop an automated way of creating Spark images (AMI, Docker, and others)
> ---
>
> Key: SPARK-3821
> URL: https://issues.apache.org/jira/browse/SPARK-3821
> Project: Spark
>  Issue Type: Improvement
>  Components: Build, EC2
>        Reporter: Nicholas Chammas
>    Assignee: Nicholas Chammas
> Attachments: packer-proposal.html
>
>
> Right now the creation of Spark AMIs or Docker containers is done manually. 
> With tools like [Packer|http://www.packer.io/], we should be able to automate 
> this work, and do so in such a way that multiple types of machine images can 
> be created from a single template.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries

2015-12-11 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053977#comment-15053977
 ] 

Nicholas Chammas commented on SPARK-2870:
-

> Do you think its OK to close this issue?

I haven't tested 1.6 yet, but yeah if there is a way to get the functional 
equivalent of 

{code}
SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
{code}

without the waste, as explained in the issue description, then I think we're 
good.

But it looks like from your most recent comment that this is not the case.

> Thorough schema inference directly on RDDs of Python dictionaries
> -
>
> Key: SPARK-2870
> URL: https://issues.apache.org/jira/browse/SPARK-2870
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>    Reporter: Nicholas Chammas
>
> h4. Background
> I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. 
> They process JSON text directly and infer a schema that covers the entire 
> source data set. 
> This is very important with semi-structured data like JSON since individual 
> elements in the data set are free to have different structures. Matching 
> fields across elements may even have different value types.
> For example:
> {code}
> {"a": 5}
> {"a": "cow"}
> {code}
> To get a queryable schema that covers the whole data set, you need to infer a 
> schema by looking at the whole data set. The aforementioned 
> {{SQLContext.json...()}} methods do this very well. 
> h4. Feature Request
> What we need is for {{SQlContext.inferSchema()}} to do this, too. 
> Alternatively, we need a new {{SQLContext}} method that works on RDDs of 
> Python dictionaries and does something functionally equivalent to this:
> {code}
> SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
> {code}
> As of 1.0.2, 
> [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema]
>  just looks at the first element in the data set. This won't help much when 
> the structure of the elements in the target RDD is variable.
> h4. Example Use Case
> * You have some JSON text data that you want to analyze using Spark SQL. 
> * You would use one of the {{SQLContext.json...()}} methods, but you need to 
> do some filtering on the data first to remove bad elements--basically, some 
> minimal schema validation.
> * You deserialize the JSON objects to Python {{dict}} s and filter out the 
> bad ones. You now have an RDD of dictionaries.
> * From this RDD, you want a SchemaRDD that captures the schema for the whole 
> data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-2870) Thorough schema inference directly on RDDs of Python dictionaries

2015-12-11 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15053131#comment-15053131
 ] 

Nicholas Chammas commented on SPARK-2870:
-

Go for it. I don't think anyone else is.

> Thorough schema inference directly on RDDs of Python dictionaries
> -
>
> Key: SPARK-2870
> URL: https://issues.apache.org/jira/browse/SPARK-2870
> Project: Spark
>  Issue Type: Sub-task
>  Components: PySpark, SQL
>        Reporter: Nicholas Chammas
>
> h4. Background
> I love the {{SQLContext.jsonRDD()}} and {{SQLContext.jsonFile()}} methods. 
> They process JSON text directly and infer a schema that covers the entire 
> source data set. 
> This is very important with semi-structured data like JSON since individual 
> elements in the data set are free to have different structures. Matching 
> fields across elements may even have different value types.
> For example:
> {code}
> {"a": 5}
> {"a": "cow"}
> {code}
> To get a queryable schema that covers the whole data set, you need to infer a 
> schema by looking at the whole data set. The aforementioned 
> {{SQLContext.json...()}} methods do this very well. 
> h4. Feature Request
> What we need is for {{SQlContext.inferSchema()}} to do this, too. 
> Alternatively, we need a new {{SQLContext}} method that works on RDDs of 
> Python dictionaries and does something functionally equivalent to this:
> {code}
> SQLContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
> {code}
> As of 1.0.2, 
> [{{inferSchema()}}|http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema]
>  just looks at the first element in the data set. This won't help much when 
> the structure of the elements in the target RDD is variable.
> h4. Example Use Case
> * You have some JSON text data that you want to analyze using Spark SQL. 
> * You would use one of the {{SQLContext.json...()}} methods, but you need to 
> do some filtering on the data first to remove bad elements--basically, some 
> minimal schema validation.
> * You deserialize the JSON objects to Python {{dict}} s and filter out the 
> bad ones. You now have an RDD of dictionaries.
> * From this RDD, you want a SchemaRDD that captures the schema for the whole 
> data set.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[issue25768] compileall functions do not document or test return values

2015-12-09 Thread Nicholas Chammas

Nicholas Chammas added the comment:

I've added the tests as we discussed. A couple of comments:

* I found it difficult to reuse the existing setUp() code so had to essentially 
repeat a bunch of very similar code to create "bad" files. Let me know if you 
think there is a better way to do this.
* I'm having trouble with the test for compile_path(). Specifically, it doesn't 
seem to actually use the value for skip_curdir. Do you understand why?

--
Added file: http://bugs.python.org/file41277/compileall.patch

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue25768>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



Re: Fastest way to build Spark from scratch

2015-12-08 Thread Nicholas Chammas
Interesting. As long as Spark's dependencies don't change that often, the
same caches could save "from scratch" build time over many months of Spark
development. Is that right?

On Tue, Dec 8, 2015 at 12:33 PM Josh Rosen <joshro...@databricks.com> wrote:

> @Nick, on a fresh EC2 instance a significant chunk of the initial build
> time might be due to artifact resolution + downloading. Putting
> pre-populated Ivy and Maven caches onto your EC2 machine could shave a
> decent chunk of time off that first build.
>
> On Tue, Dec 8, 2015 at 9:16 AM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Thanks for the tips, Jakob and Steve.
>>
>> It looks like my original approach is the best for me since I'm
>> installing Spark on newly launched EC2 instances and can't take advantage
>> of incremental compilation.
>>
>> Nick
>>
>> On Tue, Dec 8, 2015 at 7:01 AM Steve Loughran <ste...@hortonworks.com>
>> wrote:
>>
>>> On 7 Dec 2015, at 19:07, Jakob Odersky <joder...@gmail.com> wrote:
>>>
>>> make-distribution and the second code snippet both create a distribution
>>> from a clean state. They therefore require that every source file be
>>> compiled and that takes time (you can maybe tweak some settings or use a
>>> newer compiler to gain some speed).
>>>
>>> I'm inferring from your question that for your use-case deployment speed
>>> is a critical issue, furthermore you'd like to build Spark for lots of
>>> (every?) commit in a systematic way. In that case I would suggest you try
>>> using the second code snippet without the `clean` task and only resort to
>>> it if the build fails.
>>>
>>> On my local machine, an assembly without a clean drops from 6 minutes to
>>> 2.
>>>
>>> regards,
>>> --Jakob
>>>
>>>
>>> 1. you can use zinc -where possible- to speed up scala compilations
>>> 2. you might also consider setting up a local jenkins VM, hooked to
>>> whatever git repo & branch you are working off, and have it do the builds
>>> and tests for you. Not so great for interactive dev,
>>>
>>> finally, on the mac, the "say" command is pretty handy at letting you
>>> know when some work in a terminal is ready, so you can do the
>>> first-thing-in-the morning build-of-the-SNAPSHOTS
>>>
>>> mvn install -DskipTests -Pyarn,hadoop-2.6 -Dhadoop.version=2.7.1; say moo
>>>
>>> After that you can work on the modules you care about (via the -pl)
>>> option). That doesn't work if you are running on an EC2 instance though
>>>
>>>
>>>
>>>
>>> On 23 November 2015 at 20:18, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
>>>> Say I want to build a complete Spark distribution against Hadoop 2.6+
>>>> as fast as possible from scratch.
>>>>
>>>> This is what I’m doing at the moment:
>>>>
>>>> ./make-distribution.sh -T 1C -Phadoop-2.6
>>>>
>>>> -T 1C instructs Maven to spin up 1 thread per available core. This
>>>> takes around 20 minutes on an m3.large instance.
>>>>
>>>> I see that spark-ec2, on the other hand, builds Spark as follows
>>>> <https://github.com/amplab/spark-ec2/blob/a990752575cd8b0ab25731d7820a55c714798ec3/spark/init.sh#L21-L22>
>>>> when you deploy Spark at a specific git commit:
>>>>
>>>> sbt/sbt clean assembly
>>>> sbt/sbt publish-local
>>>>
>>>> This seems slower than using make-distribution.sh, actually.
>>>>
>>>> Is there a faster way to do this?
>>>>
>>>> Nick
>>>> ​
>>>>
>>>
>>>
>>>
>


Re: Fastest way to build Spark from scratch

2015-12-08 Thread Nicholas Chammas
Thanks for the tips, Jakob and Steve.

It looks like my original approach is the best for me since I'm installing
Spark on newly launched EC2 instances and can't take advantage of
incremental compilation.

Nick

On Tue, Dec 8, 2015 at 7:01 AM Steve Loughran <ste...@hortonworks.com>
wrote:

> On 7 Dec 2015, at 19:07, Jakob Odersky <joder...@gmail.com> wrote:
>
> make-distribution and the second code snippet both create a distribution
> from a clean state. They therefore require that every source file be
> compiled and that takes time (you can maybe tweak some settings or use a
> newer compiler to gain some speed).
>
> I'm inferring from your question that for your use-case deployment speed
> is a critical issue, furthermore you'd like to build Spark for lots of
> (every?) commit in a systematic way. In that case I would suggest you try
> using the second code snippet without the `clean` task and only resort to
> it if the build fails.
>
> On my local machine, an assembly without a clean drops from 6 minutes to 2.
>
> regards,
> --Jakob
>
>
> 1. you can use zinc -where possible- to speed up scala compilations
> 2. you might also consider setting up a local jenkins VM, hooked to
> whatever git repo & branch you are working off, and have it do the builds
> and tests for you. Not so great for interactive dev,
>
> finally, on the mac, the "say" command is pretty handy at letting you know
> when some work in a terminal is ready, so you can do the first-thing-in-the
> morning build-of-the-SNAPSHOTS
>
> mvn install -DskipTests -Pyarn,hadoop-2.6 -Dhadoop.version=2.7.1; say moo
>
> After that you can work on the modules you care about (via the -pl)
> option). That doesn't work if you are running on an EC2 instance though
>
>
>
>
> On 23 November 2015 at 20:18, Nicholas Chammas <nicholas.cham...@gmail.com
> > wrote:
>
>> Say I want to build a complete Spark distribution against Hadoop 2.6+ as
>> fast as possible from scratch.
>>
>> This is what I’m doing at the moment:
>>
>> ./make-distribution.sh -T 1C -Phadoop-2.6
>>
>> -T 1C instructs Maven to spin up 1 thread per available core. This takes
>> around 20 minutes on an m3.large instance.
>>
>> I see that spark-ec2, on the other hand, builds Spark as follows
>> <https://github.com/amplab/spark-ec2/blob/a990752575cd8b0ab25731d7820a55c714798ec3/spark/init.sh#L21-L22>
>> when you deploy Spark at a specific git commit:
>>
>> sbt/sbt clean assembly
>> sbt/sbt publish-local
>>
>> This seems slower than using make-distribution.sh, actually.
>>
>> Is there a faster way to do this?
>>
>> Nick
>> ​
>>
>
>
>


[issue24931] _asdict breaks when inheriting from a namedtuple

2015-12-08 Thread Nicholas Chammas

Nicholas Chammas added the comment:

I know. I came across this issue after upgrading to the 3.5.1 release and 
seeing that vars(namedtuple) didn't work anymore.

I looked through the changelog [0] for an explanation of why that might be and 
couldn't find one, so I posted that question on Stack Overflow.

I'm guessing others will go through the same flow after they upgrade to 3.5.1 
and wonder why their vars(namedtuple) code broke, so I posted here asking if we 
should amend the changelog to call this change out.

But I gather from your comment that the changelog cannot be updated after the 
release, so I guess there is nothing to do here. (Sorry about the distraction. 
I'm new to the Python dev community.)

[0] https://docs.python.org/3.5/whatsnew/changelog.html#python-3-5-1-final

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue24931>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24931] _asdict breaks when inheriting from a namedtuple

2015-12-08 Thread Nicholas Chammas

Nicholas Chammas added the comment:

Should this change be called out in the 3.5.1 release docs? It makes some code 
that works on 3.5.0 break in 3.5.1.

See: http://stackoverflow.com/q/34166469/877069

--
nosy: +Nicholas Chammas

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue24931>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25768] compileall functions do not document or test return values

2015-12-05 Thread Nicholas Chammas

Nicholas Chammas added the comment:

Absolutely. I'll add a "bad source file" to `setUp()` [0] and check return 
values as part of the existing checks in `test_compile_files()` [1].

Does that sound like a good plan to you?

Also, I noticed that `compile_path()` has no tests. Should I test it as part of 
`test_compile_files()` or as part of a different test function?

[0] https://hg.python.org/cpython/file/tip/Lib/test/test_compileall.py#l14
[1] https://hg.python.org/cpython/file/tip/Lib/test/test_compileall.py#l57

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue25768>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



Re: Not all workers seem to run in a standalone cluster setup by spark-ec2 script

2015-12-04 Thread Nicholas Chammas
Quick question: Are you processing gzipped files by any chance? It's a
common stumbling block people hit.

See: http://stackoverflow.com/q/27531816/877069

Nick

On Fri, Dec 4, 2015 at 2:28 PM Kyohey Hamaguchi 
wrote:

> Hi,
>
> I have setup a Spark standalone-cluster, which involves 5 workers,
> using spark-ec2 script.
>
> After submitting my Spark application, I had noticed that just one
> worker seemed to run the application and other 4 workers were doing
> nothing. I had confirmed this by checking CPU and memory usage on the
> Spark Web UI (CPU usage indicates zero and memory is almost fully
> availabile.)
>
> This is the command used to launch:
>
> $ ~/spark/ec2/spark-ec2 -k awesome-keypair-name -i
> /path/to/.ssh/awesome-private-key.pem --region ap-northeast-1
> --zone=ap-northeast-1a --slaves 5 --instance-type m1.large
> --hadoop-major-version yarn launch awesome-spark-cluster
>
> And the command to run application:
>
> $ ssh -i ~/path/to/awesome-private-key.pem root@ec2-master-host-name
> "mkdir ~/awesome"
> $ scp -i ~/path/to/awesome-private-key.pem spark.jar
> root@ec2-master-host-name:~/awesome && ssh -i
> ~/path/to/awesome-private-key.pem root@ec2-master-host-name
> "~/spark-ec2/copy-dir ~/awesome"
> $ ssh -i ~/path/to/awesome-private-key.pem root@ec2-master-host-name
> "~/spark/bin/spark-submit --num-executors 5 --executor-cores 2
> --executor-memory 5G --total-executor-cores 10 --driver-cores 2
> --driver-memory 5G --class com.example.SparkIsAwesome
> awesome/spark.jar"
>
> How do I let the all of the workers execute the app?
>
> Or do I have wrong understanding on what workers, slaves and executors are?
>
> My understanding is: Spark driver(or maybe master?) sends a part of
> jobs to each worker (== executor == slave), so a Spark cluster
> automatically exploits all resources available in the cluster. Is this
> some sort of misconception?
>
> Thanks,
>
> --
> Kyohey Hamaguchi
> TEL:  080-6918-1708
> Mail: tnzk.ma...@gmail.com
> Blog: http://blog.tnzk.org/
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: [VOTE] Release Apache Spark 1.6.0 (RC1)

2015-12-02 Thread Nicholas Chammas
-0

If spark-ec2 is still a supported part of the project, then we should
update its version lists as new releases are made. 1.5.2 had the same issue.

https://github.com/apache/spark/blob/v1.6.0-rc1/ec2/spark_ec2.py#L54-L91

(I guess as part of the 2.0 discussions we should continue to discuss
whether spark-ec2 still belongs in the project. I'm starting to feel
awkward reporting spark-ec2 release issues...)

Nick

On Wed, Dec 2, 2015 at 3:27 PM Michael Armbrust 
wrote:

> Please vote on releasing the following candidate as Apache Spark version
> 1.6.0!
>
> The vote is open until Saturday, December 5, 2015 at 21:00 UTC and passes
> if a majority of at least 3 +1 PMC votes are cast.
>
> [ ] +1 Release this package as Apache Spark 1.6.0
> [ ] -1 Do not release this package because ...
>
> To learn more about Apache Spark, please see http://spark.apache.org/
>
> The tag to be voted on is *v1.6.0-rc1
> (bf525845cef159d2d4c9f4d64e158f037179b5c4)
> *
>
> The release files, including signatures, digests, etc. can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-v1.6.0-rc1-bin/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1165/
>
> The test repository (versioned as v1.6.0-rc1) for this release can be
> found at:
> https://repository.apache.org/content/repositories/orgapachespark-1164/
>
> The documentation corresponding to this release can be found at:
> http://people.apache.org/~pwendell/spark-releases/spark-1.6.0-rc1-docs/
>
>
> ===
> == How can I help test this release? ==
> ===
> If you are a Spark user, you can help us test this release by taking an
> existing Spark workload and running on this release candidate, then
> reporting any regressions.
>
> 
> == What justifies a -1 vote for this release? ==
> 
> This vote is happening towards the end of the 1.6 QA period, so -1 votes
> should only occur for significant regressions from 1.5. Bugs already
> present in 1.5, minor regressions, or bugs related to new features will not
> block this release.
>
> ===
> == What should happen to JIRA tickets still targeting 1.6.0? ==
> ===
> 1. It is OK for documentation patches to target 1.6.0 and still go into
> branch-1.6, since documentations will be published separately from the
> release.
> 2. New features for non-alpha-modules should target 1.7+.
> 3. Non-blocker bug fixes should target 1.6.1 or 1.7.0, or drop the target
> version.
>
>
> ==
> == Major changes to help you focus your testing ==
> ==
>
> Spark SQL
>
>- SPARK-10810 
>Session Management - The ability to create multiple isolated SQL
>Contexts that have their own configuration and default database.  This is
>turned on by default in the thrift server.
>- SPARK-   Dataset
>API - A type-safe API (similar to RDDs) that performs many operations
>on serialized binary data and code generation (i.e. Project Tungsten).
>- SPARK-1  Unified
>Memory Management - Shared memory for execution and caching instead of
>exclusive division of the regions.
>- SPARK-11197  SQL
>Queries on Files - Concise syntax for running SQL queries over files
>of any supported format without registering a table.
>- SPARK-11745  Reading
>non-standard JSON files - Added options to read non-standard JSON
>files (e.g. single-quotes, unquoted attributes)
>- SPARK-10412  
> Per-operator
>Metics for SQL Execution - Display statistics on a per-operator basis
>for memory usage and spilled data size.
>- SPARK-11329  Star
>(*) expansion for StructTypes - Makes it easier to nest and unest
>arbitrary numbers of columns
>- SPARK-10917 ,
>SPARK-11149  In-memory
>Columnar Cache Performance - Significant (up to 14x) speed up when
>caching data that contains complex types in DataFrames or SQL.
>- SPARK-1  Fast
>null-safe joins 

[jira] [Created] (SPARK-12107) Update spark-ec2 versions

2015-12-02 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-12107:


 Summary: Update spark-ec2 versions
 Key: SPARK-12107
 URL: https://issues.apache.org/jira/browse/SPARK-12107
 Project: Spark
  Issue Type: Bug
  Components: EC2
Affects Versions: 1.6.0
Reporter: Nicholas Chammas
Priority: Minor


spark-ec2's version strings are out-of-date. The latest versions of Spark need 
to be reflected in its internal version maps.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[issue25768] compileall functions do not document return values

2015-12-01 Thread Nicholas Chammas

Nicholas Chammas added the comment:

OK, here's a patch.

I reviewed the doc style guide [0] but I'm not 100% sure if I'm using the 
appropriate tense. There are also a couple of lines that go a bit over 80 
characters, but the file already had a few of those.

Am happy to make any adjustments, if necessary.

[0] https://docs.python.org/devguide/documenting.html#style-guide

--
keywords: +patch
Added file: http://bugs.python.org/file41201/compileall-doc.patch

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue25768>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25768] compileall functions do not document return values

2015-12-01 Thread Nicholas Chammas

Nicholas Chammas added the comment:

And I just signed the contributor agreement. (Some banner showed up when I 
attached the patch to this issue asking me to do so.)

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue25768>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25768] compileall functions do not document return values

2015-12-01 Thread Nicholas Chammas

Nicholas Chammas added the comment:

:thumbsup: Take your time.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue25768>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25775] Bug tracker emails go to spam

2015-12-01 Thread Nicholas Chammas

New submission from Nicholas Chammas:

Not sure where to report this. Is there a component for the bug tracker itself?

Anyway, Gmail sends emails from this bug tracker to spam and flags each one 
with the following message:

> Why is this message in Spam? It is in violation of Google's recommended email 
> sender guidelines.  Learn more
> https://support.google.com/mail/answer/81126?hl=en#authentication

Is this actionable? Is this a known issue?

--
messages: 255676
nosy: Nicholas Chammas
priority: normal
severity: normal
status: open
title: Bug tracker emails go to spam
type: behavior

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue25775>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25768] compileall functions do not document return values

2015-12-01 Thread Nicholas Chammas

Nicholas Chammas added the comment:

Oh derp. It appears this is dup of issue24386. Apologies.

--
status: open -> closed

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue25768>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25768] compileall functions do not document return values

2015-12-01 Thread Nicholas Chammas

Nicholas Chammas added the comment:

Whoops, wrong issue. Reopening.

--
status: closed -> open

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue25768>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25775] Bug tracker emails go to spam

2015-12-01 Thread Nicholas Chammas

Nicholas Chammas added the comment:

Oh derp. It appears this is dup of issue24386. Apologies.

--
status: open -> closed

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue25775>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25768] compileall functions do not document return values

2015-12-01 Thread Nicholas Chammas

Nicholas Chammas added the comment:

Exciting! I'm on it.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue25768>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue25768] compileall functions do not document return values

2015-11-29 Thread Nicholas Chammas

New submission from Nicholas Chammas:

I'm using the public functions of Python's built-in compileall module.

https://docs.python.org/3/library/compileall.html#public-functions

There doesn't appear to be documentation of what each of these functions 
returns.

I figured out, for example, that compileall.compile_file() returns 1 when the 
file compiles successfully, and 0 if not.

If this is "official" behavior, it would be good to see it documented so that 
we can rely on it.

I'd be happy to submit a patch to fix this if a committer is willing to 
shepherd a new contributor (me) through the process. Otherwise, this is 
probably a quick fix for experienced contributors.

--
assignee: docs@python
components: Documentation
messages: 255600
nosy: Nicholas Chammas, docs@python
priority: normal
severity: normal
status: open
title: compileall functions do not document return values
type: behavior
versions: Python 3.5

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue25768>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



Re: Adding more slaves to a running cluster

2015-11-25 Thread Nicholas Chammas
spark-ec2 does not directly support adding instances to an existing
cluster, apart from the special case of adding slaves to a cluster with a
master but no slaves. There is an open issue to track adding this support,
SPARK-2008 , but it
doesn't have any momentum at the moment.

Your best bet currently is to do what you did and hack your way through
using spark-ec2's various scripts.

You probably already know this, but to be clear, note that Spark itself
supports adding slaves to a running cluster. It's just that spark-ec2
hasn't implemented a feature to do this work for you.

Nick

On Wed, Nov 25, 2015 at 2:27 PM Dillian Murphey 
wrote:

> It appears start-slave.sh works on a running cluster.  I'm surprised I
> can't find more info on this. Maybe I'm not looking hard enough?
>
> Using AWS and spot instances is incredibly more efficient, which begs for
> the need of dynamically adding more nodes while the cluster is up, yet
> everything I've found so far seems to indicate it isn't supported yet.
>
> But yet here I am with 1.5 and it at least appears to be working. Am I
> missing something?
>
> On Tue, Nov 24, 2015 at 4:40 PM, Dillian Murphey 
> wrote:
>
>> What's the current status on adding slaves to a running cluster?  I want
>> to leverage spark-ec2 and autoscaling groups.  I want to launch slaves as
>> spot instances when I need to do some heavy lifting, but I don't want to
>> bring down my cluster in order to add nodes.
>>
>> Can this be done by just running start-slave.sh??
>>
>> What about using Mesos?
>>
>> I just want to create an AMI for a slave and on some trigger launch it
>> and have it automatically add itself to the cluster.
>>
>> thanks
>>
>
>


[jira] [Comment Edited] (SPARK-9999) Dataset API on top of Catalyst/DataFrame

2015-11-23 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022735#comment-15022735
 ] 

Nicholas Chammas edited comment on SPARK- at 11/23/15 8:06 PM:
---

[~sandyr] - Hmm, so are you saying that, generally speaking, Datasets will 
provide no performance advantages over DataFrames, and that they will just help 
in terms of catching type errors early?

{quote}
Python and R are dynamically typed so can't take advantage of these.
{quote}

I can't speak for R, but Python has supported type hints since 3.0. More 
recently, Python 3.5 introduced a [typing 
module|https://docs.python.org/3/library/typing.html#module-typing] to 
standardize how type hints are specified, which facilitates the use of static 
type checkers like [mypy|http://mypy-lang.org/]. PySpark could definitely offer 
a statically type checked API, but practically speaking it would have to be 
limited to Python 3+.

I suppose people don't generally expect static type checking when they use 
Python, so perhaps it makes sense not to support Datasets in PySpark.


was (Author: nchammas):
[~sandyr] - Hmm, so are you saying that, generally speaking, Datasets will 
provide no performance advantages over DataFrames, and that they will just help 
in terms of catching type errors early?

{quote}
Python and R are dynamically typed so can't take advantage of these.
{quote}

I can't speak for R, but Python as supported type hints since 3.0. More 
recently, Python 3.5 introduced a [typing 
module|https://docs.python.org/3/library/typing.html#module-typing] to 
standardize how type hints are specified, which facilitates the use of static 
type checkers like [mypy|http://mypy-lang.org/]. PySpark could definitely offer 
a statically type checked API, but practically speaking it would have to be 
limited to Python 3+.

I suppose people don't generally expect static type checking when they use 
Python, so perhaps it makes sense not to support Datasets in PySpark.

> Dataset API on top of Catalyst/DataFrame
> 
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcL

Fastest way to build Spark from scratch

2015-11-23 Thread Nicholas Chammas
Say I want to build a complete Spark distribution against Hadoop 2.6+ as
fast as possible from scratch.

This is what I’m doing at the moment:

./make-distribution.sh -T 1C -Phadoop-2.6

-T 1C instructs Maven to spin up 1 thread per available core. This takes
around 20 minutes on an m3.large instance.

I see that spark-ec2, on the other hand, builds Spark as follows

when you deploy Spark at a specific git commit:

sbt/sbt clean assembly
sbt/sbt publish-local

This seems slower than using make-distribution.sh, actually.

Is there a faster way to do this?

Nick
​


Re: spark-ec2 script to launch cluster running Spark 1.5.2 built with HIVE?

2015-11-23 Thread Nicholas Chammas
Don't the Hadoop builds include Hive already? Like
spark-1.5.2-bin-hadoop2.6.tgz?

On Mon, Nov 23, 2015 at 7:49 PM Jeff Schecter  wrote:

> Hi all,
>
> As far as I can tell, the bundled spark-ec2 script provides no way to
> launch a cluster running Spark 1.5.2 pre-built with HIVE.
>
> That is to say, all of the pre-build versions of Spark 1.5.2 in the s3 bin
> spark-related-packages are missing HIVE.
>
> aws s3 ls s3://spark-related-packages/ | grep 1.5.2
>
>
> Am I missing something here? I'd rather avoid resorting to whipping up
> hacky patching scripts that might break with the next Spark point release
> if at all possible.
>


[jira] [Commented] (SPARK-9999) Dataset API on top of Catalyst/DataFrame

2015-11-23 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022735#comment-15022735
 ] 

Nicholas Chammas commented on SPARK-:
-

[~sandyr] - Hmm, so are you saying that, generally speaking, Datasets will 
provide no performance advantages over DataFrames, and that they will just help 
in terms of catching type errors early?

{quote}
Python and R are dynamically typed so can't take advantage of these.
{quote}

I can't speak for R, but Python as supported type hints since 3.0. More 
recently, Python 3.5 introduced a [typing 
module|https://docs.python.org/3/library/typing.html#module-typing] to 
standardize how type hints are specified, which facilitates the use of static 
type checkers like [mypy|http://mypy-lang.org/]. PySpark could definitely offer 
a statically type checked API, but practically speaking it would have to be 
limited to Python 3+.

I suppose people don't generally expect static type checking when they use 
Python, so perhaps it makes sense not to support Datasets in PySpark.

> Dataset API on top of Catalyst/DataFrame
> 
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]
> The initial version of the Dataset API has been merged in Spark 1.6. However, 
> it will take a few more future releases to flush everything out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9999) Dataset API on top of Catalyst/DataFrame

2015-11-23 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022957#comment-15022957
 ] 

Nicholas Chammas commented on SPARK-:
-

If you are referring to my comment, note that I am asking about Dataset vs. 
DataFrame, not Dataset vs. RDD.

> Dataset API on top of Catalyst/DataFrame
> 
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]
> The initial version of the Dataset API has been merged in Spark 1.6. However, 
> it will take a few more future releases to flush everything out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11903) Deprecate make-distribution.sh --skip-java-test

2015-11-21 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15020729#comment-15020729
 ] 

Nicholas Chammas commented on SPARK-11903:
--

Also, we could just leave the option in there and add a warning to let the user 
know that it isn't necessary. That would maintain compatibility while 
communicating the deprecation to the user, if this option is indeed deprecated.

> Deprecate make-distribution.sh --skip-java-test
> ---
>
> Key: SPARK-11903
> URL: https://issues.apache.org/jira/browse/SPARK-11903
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>        Reporter: Nicholas Chammas
>Priority: Minor
>
> The {{\-\-skip-java-test}} option to {{make-distribution.sh}} [does not 
> appear to be 
> used|https://github.com/apache/spark/blob/835a79d78ee879a3c36dde85e5b3591243bf3957/make-distribution.sh#L72-L73].
>  Searching the Spark codebase for {{SKIP_JAVA_TEST}} yields no results other 
> than that one.
> If this option is not needed, we should deprecate and eventually remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11903) Deprecate make-distribution.sh --skip-java-test

2015-11-21 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-11903:


 Summary: Deprecate make-distribution.sh --skip-java-test
 Key: SPARK-11903
 URL: https://issues.apache.org/jira/browse/SPARK-11903
 Project: Spark
  Issue Type: Improvement
  Components: Build
Reporter: Nicholas Chammas
Priority: Minor


The {{\-\-skip-java-test}} option to {{make-distribution.sh}} [does not appear 
to be 
used|https://github.com/apache/spark/blob/835a79d78ee879a3c36dde85e5b3591243bf3957/make-distribution.sh#L72-L73].
 Searching the Spark codebase for {{SKIP_JAVA_TEST}} yields no results other 
than that one.

If this option is not needed, we should deprecate and eventually remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11903) Deprecate make-distribution.sh --skip-java-test

2015-11-21 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15020725#comment-15020725
 ] 

Nicholas Chammas commented on SPARK-11903:
--

cc [~pwendell] and [~srowen] - Y'all probably know best about this. I can open 
a PR if appropriate. Just let me know what the appropriate course of action is 
here.

> Deprecate make-distribution.sh --skip-java-test
> ---
>
> Key: SPARK-11903
> URL: https://issues.apache.org/jira/browse/SPARK-11903
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>        Reporter: Nicholas Chammas
>Priority: Minor
>
> The {{\-\-skip-java-test}} option to {{make-distribution.sh}} [does not 
> appear to be 
> used|https://github.com/apache/spark/blob/835a79d78ee879a3c36dde85e5b3591243bf3957/make-distribution.sh#L72-L73].
>  Searching the Spark codebase for {{SKIP_JAVA_TEST}} yields no results other 
> than that one.
> If this option is not needed, we should deprecate and eventually remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11903) Deprecate make-distribution.sh --skip-java-test

2015-11-21 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11903?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15020728#comment-15020728
 ] 

Nicholas Chammas commented on SPARK-11903:
--

Oh, could you elaborate a bit? From what I understood of 
{{make-distribution.sh}}, [tests are always 
skipped|https://github.com/apache/spark/blob/835a79d78ee879a3c36dde85e5b3591243bf3957/make-distribution.sh#L170].

Do you sometimes want to run {{make-distribution.sh}} with tests?

> Deprecate make-distribution.sh --skip-java-test
> ---
>
> Key: SPARK-11903
> URL: https://issues.apache.org/jira/browse/SPARK-11903
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>        Reporter: Nicholas Chammas
>Priority: Minor
>
> The {{\-\-skip-java-test}} option to {{make-distribution.sh}} [does not 
> appear to be 
> used|https://github.com/apache/spark/blob/835a79d78ee879a3c36dde85e5b3591243bf3957/make-distribution.sh#L72-L73].
>  Searching the Spark codebase for {{SKIP_JAVA_TEST}} yields no results other 
> than that one.
> If this option is not needed, we should deprecate and eventually remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11903) Deprecate make-distribution.sh --skip-java-test

2015-11-21 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11903?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-11903:
-
Description: 
The {{\-\-skip-java-test}} option to {{make-distribution.sh}} [does not appear 
to be 
used|https://github.com/apache/spark/blob/835a79d78ee879a3c36dde85e5b3591243bf3957/make-distribution.sh#L72-L73],
 and tests are [always 
skipped|https://github.com/apache/spark/blob/835a79d78ee879a3c36dde85e5b3591243bf3957/make-distribution.sh#L170].
 Searching the Spark codebase for {{SKIP_JAVA_TEST}} yields no results other 
than [this 
one|https://github.com/apache/spark/blob/835a79d78ee879a3c36dde85e5b3591243bf3957/make-distribution.sh#L72-L73].

If this option is not needed, we should deprecate and eventually remove it.

  was:
The {{\-\-skip-java-test}} option to {{make-distribution.sh}} [does not appear 
to be 
used|https://github.com/apache/spark/blob/835a79d78ee879a3c36dde85e5b3591243bf3957/make-distribution.sh#L72-L73].
 Searching the Spark codebase for {{SKIP_JAVA_TEST}} yields no results other 
than that one.

If this option is not needed, we should deprecate and eventually remove it.


> Deprecate make-distribution.sh --skip-java-test
> ---
>
> Key: SPARK-11903
> URL: https://issues.apache.org/jira/browse/SPARK-11903
> Project: Spark
>  Issue Type: Improvement
>  Components: Build
>        Reporter: Nicholas Chammas
>Priority: Minor
>
> The {{\-\-skip-java-test}} option to {{make-distribution.sh}} [does not 
> appear to be 
> used|https://github.com/apache/spark/blob/835a79d78ee879a3c36dde85e5b3591243bf3957/make-distribution.sh#L72-L73],
>  and tests are [always 
> skipped|https://github.com/apache/spark/blob/835a79d78ee879a3c36dde85e5b3591243bf3957/make-distribution.sh#L170].
>  Searching the Spark codebase for {{SKIP_JAVA_TEST}} yields no results other 
> than [this 
> one|https://github.com/apache/spark/blob/835a79d78ee879a3c36dde85e5b3591243bf3957/make-distribution.sh#L72-L73].
> If this option is not needed, we should deprecate and eventually remove it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9999) Dataset API on top of Catalyst/DataFrame

2015-11-20 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15019214#comment-15019214
 ] 

Nicholas Chammas commented on SPARK-:
-

Arriving a little late to this discussion. Quick question for Reynold/Michael:

Will Python (and R) get this API in time for 1.6, or is that planned for a 
later release? Once the Scala API is ready, I'm guessing that the Python 
version will mostly be a lightweight wrapper around that API.

> Dataset API on top of Catalyst/DataFrame
> 
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]
> The initial version of the Dataset API has been merged in Spark 1.6. However, 
> it will take a few more future releases to flush everything out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11744) bin/pyspark --version doesn't return version and exit

2015-11-14 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005572#comment-15005572
 ] 

Nicholas Chammas commented on SPARK-11744:
--

Not sure who would be the best person to comment on this. Perhaps [~vanzin], 
since this is part of the launcher?

> bin/pyspark --version doesn't return version and exit
> -
>
> Key: SPARK-11744
> URL: https://issues.apache.org/jira/browse/SPARK-11744
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>    Reporter: Nicholas Chammas
>Priority: Minor
>
> {{bin/pyspark \-\-help}} offers a {{\-\-version}} option:
> {code}
> $ ./spark/bin/pyspark --help
> Usage: ./bin/pyspark [options]
> Options:
> ...
>   --version,  Print the version of current Spark
> ...
> {code}
> However, trying to get the version in this way doesn't yield the expected 
> results.
> Instead of printing the version and exiting, we get the version, a stack 
> trace, and then get dropped into a plain Python shell ({{sc}} is not defined).
> {code}
> $ ./spark/bin/pyspark --version
> Python 2.7.10 (default, Aug 11 2015, 23:39:10) 
> [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
> Type "help", "copyright", "credits" or "license" for more information.
> Welcome to
>     __
>  / __/__  ___ _/ /__
> _\ \/ _ \/ _ `/ __/  '_/
>/___/ .__/\_,_/_/ /_/\_\   version 1.5.2
>   /_/
> 
> Type --help for more information.
> Traceback (most recent call last):
>   File "/home/ec2-user/spark/python/pyspark/shell.py", line 43, in 
> sc = SparkContext(pyFiles=add_files)
>   File "/home/ec2-user/spark/python/pyspark/context.py", line 110, in __init__
> SparkContext._ensure_initialized(self, gateway=gateway)
>   File "/home/ec2-user/spark/python/pyspark/context.py", line 234, in 
> _ensure_initialized
> SparkContext._gateway = gateway or launch_gateway()
>   File "/home/ec2-user/spark/python/pyspark/java_gateway.py", line 94, in 
> launch_gateway
> raise Exception("Java gateway process exited before sending the driver 
> its port number")
> Exception: Java gateway process exited before sending the driver its port 
> number
> >>> 
> >>> sc
> Traceback (most recent call last):
>   File "", line 1, in 
> NameError: name 'sc' is not defined
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11744) bin/pyspark --version doesn't return version and exit

2015-11-14 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-11744:
-
Description: 
{{bin/pyspark \-\-help}} offers a {{\-\-version}} option:

{code}
$ ./spark/bin/pyspark --help
Usage: ./bin/pyspark [options]

Options:
...
  --version,  Print the version of current Spark
...
{code}

However, trying to get the version in this way doesn't yield the expected 
results.

Instead of printing the version and exiting, we get the version, a stack trace, 
and then get dropped into a plain Python shell ({{sc}} is not defined).

{code}
$ ./spark/bin/pyspark --version
Python 2.7.10 (default, Aug 11 2015, 23:39:10) 
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.2
  /_/

Type --help for more information.
Traceback (most recent call last):
  File "/home/ec2-user/spark/python/pyspark/shell.py", line 43, in 
sc = SparkContext(pyFiles=add_files)
  File "/home/ec2-user/spark/python/pyspark/context.py", line 110, in __init__
SparkContext._ensure_initialized(self, gateway=gateway)
  File "/home/ec2-user/spark/python/pyspark/context.py", line 234, in 
_ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
  File "/home/ec2-user/spark/python/pyspark/java_gateway.py", line 94, in 
launch_gateway
raise Exception("Java gateway process exited before sending the driver its 
port number")
Exception: Java gateway process exited before sending the driver its port number
>>> 
>>> sc
Traceback (most recent call last):
  File "", line 1, in 
NameError: name 'sc' is not defined
{code}

  was:
{{bin/pyspark --help}} offers a {{--version}} option:

{code}
$ ./spark/bin/pyspark --help
Usage: ./bin/pyspark [options]

Options:
...
  --version,  Print the version of current Spark
...
{code}

However, trying to get the version in this way doesn't yield the expected 
results.

Instead of printing the version and exiting, we get the version, a stack trace, 
and then get dropped into a plain Python shell ({{sc}} is not defined).

{code}
$ ./spark/bin/pyspark --version
Python 2.7.10 (default, Aug 11 2015, 23:39:10) 
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.2
  /_/

Type --help for more information.
Traceback (most recent call last):
  File "/home/ec2-user/spark/python/pyspark/shell.py", line 43, in 
sc = SparkContext(pyFiles=add_files)
  File "/home/ec2-user/spark/python/pyspark/context.py", line 110, in __init__
SparkContext._ensure_initialized(self, gateway=gateway)
  File "/home/ec2-user/spark/python/pyspark/context.py", line 234, in 
_ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
  File "/home/ec2-user/spark/python/pyspark/java_gateway.py", line 94, in 
launch_gateway
raise Exception("Java gateway process exited before sending the driver its 
port number")
Exception: Java gateway process exited before sending the driver its port number
>>> 
>>> sc
Traceback (most recent call last):
  File "", line 1, in 
NameError: name 'sc' is not defined
{code}


> bin/pyspark --version doesn't return version and exit
> -
>
>         Key: SPARK-11744
> URL: https://issues.apache.org/jira/browse/SPARK-11744
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>Reporter: Nicholas Chammas
>Priority: Minor
>
> {{bin/pyspark \-\-help}} offers a {{\-\-version}} option:
> {code}
> $ ./spark/bin/pyspark --help
> Usage: ./bin/pyspark [options]
> Options:
> ...
>   --version,  Print the version of current Spark
> ...
> {code}
> However, trying to get the version in this way doesn't yield the expected 
> results.
> Instead of printing the version and exiting, we get the version, a stack 
> trace, and then get dropped into a plain Python shell ({{sc}} is not defined).
> {code}
> $ ./spark/bin/pyspark --version
> Python 2.7.10 (default, Aug 11 2015, 23:39:10) 
> [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
> Type "help", "copyright"

[jira] [Created] (SPARK-11744) bin/pyspark --version doesn't return version and exit

2015-11-14 Thread Nicholas Chammas (JIRA)
Nicholas Chammas created SPARK-11744:


 Summary: bin/pyspark --version doesn't return version and exit
 Key: SPARK-11744
 URL: https://issues.apache.org/jira/browse/SPARK-11744
 Project: Spark
  Issue Type: Bug
  Components: PySpark
Affects Versions: 1.5.2
Reporter: Nicholas Chammas
Priority: Minor


{{bin/pyspark --help}} offers a {{--version}} option:

{code}
$ ./spark/bin/pyspark --help
Usage: ./bin/pyspark [options]

Options:
...
  --version,  Print the version of current Spark
...
{code}

However, trying to get the version in this way doesn't yield the expected 
results.

Instead of printing the version and exiting, we get the version, a stack trace, 
and then get dropped into a plain Python shell ({{sc}} is not defined).

{code}
$ ./spark/bin/pyspark --version
Python 2.7.10 (default, Aug 11 2015, 23:39:10) 
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.2
  /_/

Type --help for more information.
Traceback (most recent call last):
  File "/home/ec2-user/spark/python/pyspark/shell.py", line 43, in 
sc = SparkContext(pyFiles=add_files)
  File "/home/ec2-user/spark/python/pyspark/context.py", line 110, in __init__
SparkContext._ensure_initialized(self, gateway=gateway)
  File "/home/ec2-user/spark/python/pyspark/context.py", line 234, in 
_ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
  File "/home/ec2-user/spark/python/pyspark/java_gateway.py", line 94, in 
launch_gateway
raise Exception("Java gateway process exited before sending the driver its 
port number")
Exception: Java gateway process exited before sending the driver its port number
>>> 
>>> sc
Traceback (most recent call last):
  File "", line 1, in 
NameError: name 'sc' is not defined
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11744) bin/pyspark --version doesn't return version and exit

2015-11-14 Thread Nicholas Chammas (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11744?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nicholas Chammas updated SPARK-11744:
-
Description: 
{{bin/pyspark \-\-help}} offers a {{\-\-version}} option:

{code}
$ ./spark/bin/pyspark --help
Usage: ./bin/pyspark [options]

Options:
...
  --version,  Print the version of current Spark
...
{code}

However, trying to get the version in this way doesn't yield the expected 
results.

Instead of printing the version and exiting, we get the version, a stack trace, 
and then get dropped into a broken PySpark shell.

{code}
$ ./spark/bin/pyspark --version
Python 2.7.10 (default, Aug 11 2015, 23:39:10) 
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.2
  /_/

Type --help for more information.
Traceback (most recent call last):
  File "/home/ec2-user/spark/python/pyspark/shell.py", line 43, in 
sc = SparkContext(pyFiles=add_files)
  File "/home/ec2-user/spark/python/pyspark/context.py", line 110, in __init__
SparkContext._ensure_initialized(self, gateway=gateway)
  File "/home/ec2-user/spark/python/pyspark/context.py", line 234, in 
_ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
  File "/home/ec2-user/spark/python/pyspark/java_gateway.py", line 94, in 
launch_gateway
raise Exception("Java gateway process exited before sending the driver its 
port number")
Exception: Java gateway process exited before sending the driver its port number
>>> 
>>> sc
Traceback (most recent call last):
  File "", line 1, in 
NameError: name 'sc' is not defined
{code}

  was:
{{bin/pyspark \-\-help}} offers a {{\-\-version}} option:

{code}
$ ./spark/bin/pyspark --help
Usage: ./bin/pyspark [options]

Options:
...
  --version,  Print the version of current Spark
...
{code}

However, trying to get the version in this way doesn't yield the expected 
results.

Instead of printing the version and exiting, we get the version, a stack trace, 
and then get dropped into a plain Python shell ({{sc}} is not defined).

{code}
$ ./spark/bin/pyspark --version
Python 2.7.10 (default, Aug 11 2015, 23:39:10) 
[GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
    __
 / __/__  ___ _/ /__
_\ \/ _ \/ _ `/ __/  '_/
   /___/ .__/\_,_/_/ /_/\_\   version 1.5.2
  /_/

Type --help for more information.
Traceback (most recent call last):
  File "/home/ec2-user/spark/python/pyspark/shell.py", line 43, in 
sc = SparkContext(pyFiles=add_files)
  File "/home/ec2-user/spark/python/pyspark/context.py", line 110, in __init__
SparkContext._ensure_initialized(self, gateway=gateway)
  File "/home/ec2-user/spark/python/pyspark/context.py", line 234, in 
_ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
  File "/home/ec2-user/spark/python/pyspark/java_gateway.py", line 94, in 
launch_gateway
raise Exception("Java gateway process exited before sending the driver its 
port number")
Exception: Java gateway process exited before sending the driver its port number
>>> 
>>> sc
Traceback (most recent call last):
  File "", line 1, in 
NameError: name 'sc' is not defined
{code}


> bin/pyspark --version doesn't return version and exit
> -
>
>         Key: SPARK-11744
> URL: https://issues.apache.org/jira/browse/SPARK-11744
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark
>Affects Versions: 1.5.2
>Reporter: Nicholas Chammas
>Priority: Minor
>
> {{bin/pyspark \-\-help}} offers a {{\-\-version}} option:
> {code}
> $ ./spark/bin/pyspark --help
> Usage: ./bin/pyspark [options]
> Options:
> ...
>   --version,  Print the version of current Spark
> ...
> {code}
> However, trying to get the version in this way doesn't yield the expected 
> results.
> Instead of printing the version and exiting, we get the version, a stack 
> trace, and then get dropped into a broken PySpark shell.
> {code}
> $ ./spark/bin/pyspark --version
> Python 2.7.10 (default, Aug 11 2015, 23:39:10) 
> [GCC 4.8.3 20140911 (Red Hat 4.8.3-9)] on linux2
> Type "help", "copyright", "credits" or "license&qu

<    4   5   6   7   8   9   10   11   12   13   >