driver fail-over in Spark streaming 1.2.0

2015-02-11 Thread lin
Hi, all

In Spark Streaming 1.2.0, when the driver fails and a new driver starts
with the most updated check-pointed data, will the former Executors
connects to the new driver, or will the new driver starts out its own set
of new Executors? In which piece of codes is that done?

Any reply will be appreciated :)

regards,

lin


Re: Re: Sort Shuffle performance issues about using AppendOnlyMap for large data sets

2015-02-11 Thread fightf...@163.com
Hi,

Really have no adequate solution got for this issue. Expecting any available 
analytical rules or hints.

Thanks,
Sun.



fightf...@163.com
 
From: fightf...@163.com
Date: 2015-02-09 11:56
To: user; dev
Subject: Re: Sort Shuffle performance issues about using AppendOnlyMap for 
large data sets
Hi,
Problem still exists. Any experts would take a look at this? 

Thanks,
Sun.



fightf...@163.com
 
From: fightf...@163.com
Date: 2015-02-06 17:54
To: user; dev
Subject: Sort Shuffle performance issues about using AppendOnlyMap for large 
data sets
Hi, all
Recently we had caught performance issues when using spark 1.2.0 to read data 
from hbase and do some summary work.
Our scenario means to : read large data sets from hbase (maybe 100G+ file) , 
form hbaseRDD, transform to schemardd, 
groupby and aggregate the data while got fewer new summary data sets, loading 
data into hbase (phoenix).

Our major issue lead to : aggregate large datasets to get summary data sets 
would consume too long time (1 hour +) , while that
should be supposed not so bad performance. We got the dump file attached and 
stacktrace from jstack like the following:

From the stacktrace and dump file we can identify that processing large 
datasets would cause frequent AppendOnlyMap growing, and 
leading to huge map entrysize. We had referenced the source code of 
org.apache.spark.util.collection.AppendOnlyMap and found that 
the map had been initialized with capacity of 64. That would be too small for 
our use case. 

So the question is : Does anyone had encounted such issues before? How did that 
be resolved? I cannot find any jira issues for such problems and 
if someone had seen, please kindly let us know.

More specified solution would goes to : Does any possibility exists for user 
defining the map capacity releatively in spark? If so, please
tell how to achieve that. 

Best Thanks,
Sun.

   Thread 22432: (state = IN_JAVA)
- org.apache.spark.util.collection.AppendOnlyMap.growTable() @bci=87, line=224 
(Compiled frame; information may be imprecise)
- org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.growTable() 
@bci=1, line=38 (Interpreted frame)
- org.apache.spark.util.collection.AppendOnlyMap.incrementSize() @bci=22, 
line=198 (Compiled frame)
- org.apache.spark.util.collection.AppendOnlyMap.changeValue(java.lang.Object, 
scala.Function2) @bci=201, line=145 (Compiled frame)
- 
org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(java.lang.Object,
 scala.Function2) @bci=3, line=32 (Compiled frame)
- 
org.apache.spark.util.collection.ExternalSorter.insertAll(scala.collection.Iterator)
 @bci=141, line=205 (Compiled frame)
- 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(scala.collection.Iterator)
 @bci=74, line=58 (Interpreted frame)
- 
org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext) 
@bci=169, line=68 (Interpreted frame)
- 
org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext) 
@bci=2, line=41 (Interpreted frame)
- org.apache.spark.scheduler.Task.run(long) @bci=77, line=56 (Interpreted frame)
- org.apache.spark.executor.Executor$TaskRunner.run() @bci=310, line=196 
(Interpreted frame)
- 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker)
 @bci=95, line=1145 (Interpreted frame)
- java.util.concurrent.ThreadPoolExecutor$Worker.run() @bci=5, line=615 
(Interpreted frame)
- java.lang.Thread.run() @bci=11, line=744 (Interpreted frame)


Thread 22431: (state = IN_JAVA)
- org.apache.spark.util.collection.AppendOnlyMap.growTable() @bci=87, line=224 
(Compiled frame; information may be imprecise)
- org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.growTable() 
@bci=1, line=38 (Interpreted frame)
- org.apache.spark.util.collection.AppendOnlyMap.incrementSize() @bci=22, 
line=198 (Compiled frame)
- org.apache.spark.util.collection.AppendOnlyMap.changeValue(java.lang.Object, 
scala.Function2) @bci=201, line=145 (Compiled frame)
- 
org.apache.spark.util.collection.SizeTrackingAppendOnlyMap.changeValue(java.lang.Object,
 scala.Function2) @bci=3, line=32 (Compiled frame)
- 
org.apache.spark.util.collection.ExternalSorter.insertAll(scala.collection.Iterator)
 @bci=141, line=205 (Compiled frame)
- 
org.apache.spark.shuffle.sort.SortShuffleWriter.write(scala.collection.Iterator)
 @bci=74, line=58 (Interpreted frame)
- 
org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext) 
@bci=169, line=68 (Interpreted frame)
- 
org.apache.spark.scheduler.ShuffleMapTask.runTask(org.apache.spark.TaskContext) 
@bci=2, line=41 (Interpreted frame)
- org.apache.spark.scheduler.Task.run(long) @bci=77, line=56 (Interpreted frame)
- org.apache.spark.executor.Executor$TaskRunner.run() @bci=310, line=196 
(Interpreted frame)
- 
java.util.concurrent.ThreadPoolExecutor.runWorker(java.util.concurrent.ThreadPoolExecutor$Worker)
 @bci=95, line=1145 (Interpreted frame)
- java.util.concurrent.ThreadPoolExecutor$W

[ANNOUNCE] Spark 1.3.0 Snapshot 1

2015-02-11 Thread Patrick Wendell
Hey All,

I've posted Spark 1.3.0 snapshot 1. At this point the 1.3 branch is
ready for community testing and we are strictly merging fixes and
documentation across all components.

The release files, including signatures, digests, etc can be found at:
http://people.apache.org/~pwendell/spark-1.3.0-snapshot1/

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1068/

The documentation corresponding to this release can be found at:
http://people.apache.org/~pwendell/spark-1.3.0-snapshot1-docs/

Please report any issues with the release to this thread and/or to our
project JIRA. Thanks!

- Patrick

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: CallbackServer in PySpark Streaming

2015-02-11 Thread Todd Gao
Oh I see! Thank you very much, Davies. You correct some of my wrong
understandings.

On Thu, Feb 12, 2015 at 9:50 AM, Davies Liu  wrote:

> Yes.
>
> On Wed, Feb 11, 2015 at 5:44 PM, Todd Gao 
> wrote:
> > Thanks Davies.
> > I am not quite familiar with Spark Streaming. Do you mean that the
> compute
> > routine of DStream is only invoked in the driver node,
> > while only the compute routines of RDD are distributed to the slaves?
> >
> > On Thu, Feb 12, 2015 at 2:38 AM, Davies Liu 
> wrote:
> >>
> >> The CallbackServer is part of Py4j, it's only used in driver, not used
> >> in slaves or workers.
> >>
> >> On Wed, Feb 11, 2015 at 12:32 AM, Todd Gao
> >>  wrote:
> >> > Hi all,
> >> >
> >> > I am reading the code of PySpark and its Streaming module.
> >> >
> >> > In PySpark Streaming, when the `compute` method of the instance of
> >> > PythonTransformedDStream is invoked, a connection to the
> CallbackServer
> >> > is created internally.
> >> > I wonder where is the CallbackServer for each PythonTransformedDStream
> >> > instance on the slave nodes in distributed environment.
> >> > Is there a CallbackServer running on every slave node?
> >> >
> >> > thanks
> >> > Todd
> >
> >
>


Re: CallbackServer in PySpark Streaming

2015-02-11 Thread Davies Liu
Yes.

On Wed, Feb 11, 2015 at 5:44 PM, Todd Gao  wrote:
> Thanks Davies.
> I am not quite familiar with Spark Streaming. Do you mean that the compute
> routine of DStream is only invoked in the driver node,
> while only the compute routines of RDD are distributed to the slaves?
>
> On Thu, Feb 12, 2015 at 2:38 AM, Davies Liu  wrote:
>>
>> The CallbackServer is part of Py4j, it's only used in driver, not used
>> in slaves or workers.
>>
>> On Wed, Feb 11, 2015 at 12:32 AM, Todd Gao
>>  wrote:
>> > Hi all,
>> >
>> > I am reading the code of PySpark and its Streaming module.
>> >
>> > In PySpark Streaming, when the `compute` method of the instance of
>> > PythonTransformedDStream is invoked, a connection to the CallbackServer
>> > is created internally.
>> > I wonder where is the CallbackServer for each PythonTransformedDStream
>> > instance on the slave nodes in distributed environment.
>> > Is there a CallbackServer running on every slave node?
>> >
>> > thanks
>> > Todd
>
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: CallbackServer in PySpark Streaming

2015-02-11 Thread Todd Gao
Thanks Davies.
I am not quite familiar with Spark Streaming. Do you mean that the compute
routine of DStream is only invoked in the driver node,
while only the compute routines of RDD are distributed to the slaves?

On Thu, Feb 12, 2015 at 2:38 AM, Davies Liu  wrote:

> The CallbackServer is part of Py4j, it's only used in driver, not used
> in slaves or workers.
>
> On Wed, Feb 11, 2015 at 12:32 AM, Todd Gao
>  wrote:
> > Hi all,
> >
> > I am reading the code of PySpark and its Streaming module.
> >
> > In PySpark Streaming, when the `compute` method of the instance of
> > PythonTransformedDStream is invoked, a connection to the CallbackServer
> > is created internally.
> > I wonder where is the CallbackServer for each PythonTransformedDStream
> > instance on the slave nodes in distributed environment.
> > Is there a CallbackServer running on every slave node?
> >
> > thanks
> > Todd
>


Re: 1.2.1 start-all.sh broken?

2015-02-11 Thread Nicholas Chammas
SPARK-5747 : Review all
Bash scripts for word splitting bugs

I’ll file sub-tasks under this issue. Feel free to pitch in people!

Nick
​

On Wed Feb 11 2015 at 3:07:51 PM Ted Yu  wrote:

> After some googling / trial and error, I got the following working
> (against a directory with space in its name):
>
> #!/usr/bin/env bash
> OLDIFS="$IFS"  # save it
> IFS="" # don't split on any white space
> dir="$1/*"
> for f in "$dir"; do
>   cat $f
> done
> IFS=$OLDIFS # restore IFS
>
> Cheers
>
> On Wed, Feb 11, 2015 at 2:47 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> The tragic thing here is that I was asked to review the patch that
>> introduced this
>> , and
>> totally missed it... :(
>>
>> On Wed Feb 11 2015 at 2:46:35 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> lol yeah, I changed the path for the email... turned out to be the issue
>>> itself.
>>>
>>>
>>> On Wed Feb 11 2015 at 2:43:09 PM Ted Yu  wrote:
>>>
 I see.
 '/path/to/spark-1.2.1-bin-hadoop2.4' didn't contain space :-)

 On Wed, Feb 11, 2015 at 2:41 PM, Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> Found it:
>
> https://github.com/apache/spark/compare/v1.2.0...v1.2.1#diff-
> 73058f8e51951ec0b4cb3d48ade91a1fR73
>
> GRRR BASH WORD SPLITTING
>
> My path has a space in it...
>
> Nick
>
> On Wed Feb 11 2015 at 2:37:39 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> This is what get:
>>
>> spark-1.2.1-bin-hadoop2.4$ ls -1 lib/
>> datanucleus-api-jdo-3.2.6.jar
>> datanucleus-core-3.2.10.jar
>> datanucleus-rdbms-3.2.9.jar
>> spark-1.2.1-yarn-shuffle.jar
>> spark-assembly-1.2.1-hadoop2.4.0.jar
>> spark-examples-1.2.1-hadoop2.4.0.jar
>>
>> So that looks correct… Hmm.
>>
>> Nick
>> ​
>>
>> On Wed Feb 11 2015 at 2:34:51 PM Ted Yu  wrote:
>>
>>> I downloaded 1.2.1 tar ball for hadoop 2.4
>>> I got:
>>>
>>> ls lib/
>>> datanucleus-api-jdo-3.2.6.jar  datanucleus-rdbms-3.2.9.jar
>>> spark-assembly-1.2.1-hadoop2.4.0.jar
>>> datanucleus-core-3.2.10.jarspark-1.2.1-yarn-shuffle.jar
>>>  spark-examples-1.2.1-hadoop2.4.0.jar
>>>
>>> FYI
>>>
>>> On Wed, Feb 11, 2015 at 2:27 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 I just downloaded 1.2.1 pre-built for Hadoop 2.4+ and ran
 sbin/start-all.sh
 on my OS X.

 Failed to find Spark assembly in /path/to/spark-1.2.1-bin-hadoo
 p2.4/lib
 You need to build Spark before running this program.

 Did the same for 1.2.0 and it worked fine.

 Nick
 ​

>>>
>>>

>


Re: 1.2.1 start-all.sh broken?

2015-02-11 Thread Ted Yu
After some googling / trial and error, I got the following working (against
a directory with space in its name):

#!/usr/bin/env bash
OLDIFS="$IFS"  # save it
IFS="" # don't split on any white space
dir="$1/*"
for f in "$dir"; do
  cat $f
done
IFS=$OLDIFS # restore IFS

Cheers

On Wed, Feb 11, 2015 at 2:47 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> The tragic thing here is that I was asked to review the patch that
> introduced this
> , and
> totally missed it... :(
>
> On Wed Feb 11 2015 at 2:46:35 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> lol yeah, I changed the path for the email... turned out to be the issue
>> itself.
>>
>>
>> On Wed Feb 11 2015 at 2:43:09 PM Ted Yu  wrote:
>>
>>> I see.
>>> '/path/to/spark-1.2.1-bin-hadoop2.4' didn't contain space :-)
>>>
>>> On Wed, Feb 11, 2015 at 2:41 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 Found it:

 https://github.com/apache/spark/compare/v1.2.0...v1.2.1#diff-
 73058f8e51951ec0b4cb3d48ade91a1fR73

 GRRR BASH WORD SPLITTING

 My path has a space in it...

 Nick

 On Wed Feb 11 2015 at 2:37:39 PM Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> This is what get:
>
> spark-1.2.1-bin-hadoop2.4$ ls -1 lib/
> datanucleus-api-jdo-3.2.6.jar
> datanucleus-core-3.2.10.jar
> datanucleus-rdbms-3.2.9.jar
> spark-1.2.1-yarn-shuffle.jar
> spark-assembly-1.2.1-hadoop2.4.0.jar
> spark-examples-1.2.1-hadoop2.4.0.jar
>
> So that looks correct… Hmm.
>
> Nick
> ​
>
> On Wed Feb 11 2015 at 2:34:51 PM Ted Yu  wrote:
>
>> I downloaded 1.2.1 tar ball for hadoop 2.4
>> I got:
>>
>> ls lib/
>> datanucleus-api-jdo-3.2.6.jar  datanucleus-rdbms-3.2.9.jar
>> spark-assembly-1.2.1-hadoop2.4.0.jar
>> datanucleus-core-3.2.10.jarspark-1.2.1-yarn-shuffle.jar
>>  spark-examples-1.2.1-hadoop2.4.0.jar
>>
>> FYI
>>
>> On Wed, Feb 11, 2015 at 2:27 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> I just downloaded 1.2.1 pre-built for Hadoop 2.4+ and ran
>>> sbin/start-all.sh
>>> on my OS X.
>>>
>>> Failed to find Spark assembly in /path/to/spark-1.2.1-bin-hadoo
>>> p2.4/lib
>>> You need to build Spark before running this program.
>>>
>>> Did the same for 1.2.0 and it worked fine.
>>>
>>> Nick
>>> ​
>>>
>>
>>
>>>


Re: 1.2.1 start-all.sh broken?

2015-02-11 Thread Nicholas Chammas
The tragic thing here is that I was asked to review the patch that
introduced this
, and
totally missed it... :(

On Wed Feb 11 2015 at 2:46:35 PM Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> lol yeah, I changed the path for the email... turned out to be the issue
> itself.
>
>
> On Wed Feb 11 2015 at 2:43:09 PM Ted Yu  wrote:
>
>> I see.
>> '/path/to/spark-1.2.1-bin-hadoop2.4' didn't contain space :-)
>>
>> On Wed, Feb 11, 2015 at 2:41 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> Found it:
>>>
>>> https://github.com/apache/spark/compare/v1.2.0...v1.2.1#diff-
>>> 73058f8e51951ec0b4cb3d48ade91a1fR73
>>>
>>> GRRR BASH WORD SPLITTING
>>>
>>> My path has a space in it...
>>>
>>> Nick
>>>
>>> On Wed Feb 11 2015 at 2:37:39 PM Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 This is what get:

 spark-1.2.1-bin-hadoop2.4$ ls -1 lib/
 datanucleus-api-jdo-3.2.6.jar
 datanucleus-core-3.2.10.jar
 datanucleus-rdbms-3.2.9.jar
 spark-1.2.1-yarn-shuffle.jar
 spark-assembly-1.2.1-hadoop2.4.0.jar
 spark-examples-1.2.1-hadoop2.4.0.jar

 So that looks correct… Hmm.

 Nick
 ​

 On Wed Feb 11 2015 at 2:34:51 PM Ted Yu  wrote:

> I downloaded 1.2.1 tar ball for hadoop 2.4
> I got:
>
> ls lib/
> datanucleus-api-jdo-3.2.6.jar  datanucleus-rdbms-3.2.9.jar
> spark-assembly-1.2.1-hadoop2.4.0.jar
> datanucleus-core-3.2.10.jarspark-1.2.1-yarn-shuffle.jar
>  spark-examples-1.2.1-hadoop2.4.0.jar
>
> FYI
>
> On Wed, Feb 11, 2015 at 2:27 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I just downloaded 1.2.1 pre-built for Hadoop 2.4+ and ran
>> sbin/start-all.sh
>> on my OS X.
>>
>> Failed to find Spark assembly in /path/to/spark-1.2.1-bin-hadoo
>> p2.4/lib
>> You need to build Spark before running this program.
>>
>> Did the same for 1.2.0 and it worked fine.
>>
>> Nick
>> ​
>>
>
>
>>


Re: 1.2.1 start-all.sh broken?

2015-02-11 Thread Nicholas Chammas
lol yeah, I changed the path for the email... turned out to be the issue
itself.

On Wed Feb 11 2015 at 2:43:09 PM Ted Yu  wrote:

> I see.
> '/path/to/spark-1.2.1-bin-hadoop2.4' didn't contain space :-)
>
> On Wed, Feb 11, 2015 at 2:41 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> Found it:
>>
>>
>> https://github.com/apache/spark/compare/v1.2.0...v1.2.1#diff-73058f8e51951ec0b4cb3d48ade91a1fR73
>>
>> GRRR BASH WORD SPLITTING
>>
>> My path has a space in it...
>>
>> Nick
>>
>> On Wed Feb 11 2015 at 2:37:39 PM Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> This is what get:
>>>
>>> spark-1.2.1-bin-hadoop2.4$ ls -1 lib/
>>> datanucleus-api-jdo-3.2.6.jar
>>> datanucleus-core-3.2.10.jar
>>> datanucleus-rdbms-3.2.9.jar
>>> spark-1.2.1-yarn-shuffle.jar
>>> spark-assembly-1.2.1-hadoop2.4.0.jar
>>> spark-examples-1.2.1-hadoop2.4.0.jar
>>>
>>> So that looks correct… Hmm.
>>>
>>> Nick
>>> ​
>>>
>>> On Wed Feb 11 2015 at 2:34:51 PM Ted Yu  wrote:
>>>
 I downloaded 1.2.1 tar ball for hadoop 2.4
 I got:

 ls lib/
 datanucleus-api-jdo-3.2.6.jar  datanucleus-rdbms-3.2.9.jar
 spark-assembly-1.2.1-hadoop2.4.0.jar
 datanucleus-core-3.2.10.jarspark-1.2.1-yarn-shuffle.jar
  spark-examples-1.2.1-hadoop2.4.0.jar

 FYI

 On Wed, Feb 11, 2015 at 2:27 PM, Nicholas Chammas <
 nicholas.cham...@gmail.com> wrote:

> I just downloaded 1.2.1 pre-built for Hadoop 2.4+ and ran
> sbin/start-all.sh
> on my OS X.
>
> Failed to find Spark assembly in /path/to/spark-1.2.1-bin-
> hadoop2.4/lib
> You need to build Spark before running this program.
>
> Did the same for 1.2.0 and it worked fine.
>
> Nick
> ​
>


>


Re: 1.2.1 start-all.sh broken?

2015-02-11 Thread Ted Yu
I see.
'/path/to/spark-1.2.1-bin-hadoop2.4' didn't contain space :-)

On Wed, Feb 11, 2015 at 2:41 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> Found it:
>
>
> https://github.com/apache/spark/compare/v1.2.0...v1.2.1#diff-73058f8e51951ec0b4cb3d48ade91a1fR73
>
> GRRR BASH WORD SPLITTING
>
> My path has a space in it...
>
> Nick
>
> On Wed Feb 11 2015 at 2:37:39 PM Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> This is what get:
>>
>> spark-1.2.1-bin-hadoop2.4$ ls -1 lib/
>> datanucleus-api-jdo-3.2.6.jar
>> datanucleus-core-3.2.10.jar
>> datanucleus-rdbms-3.2.9.jar
>> spark-1.2.1-yarn-shuffle.jar
>> spark-assembly-1.2.1-hadoop2.4.0.jar
>> spark-examples-1.2.1-hadoop2.4.0.jar
>>
>> So that looks correct… Hmm.
>>
>> Nick
>> ​
>>
>> On Wed Feb 11 2015 at 2:34:51 PM Ted Yu  wrote:
>>
>>> I downloaded 1.2.1 tar ball for hadoop 2.4
>>> I got:
>>>
>>> ls lib/
>>> datanucleus-api-jdo-3.2.6.jar  datanucleus-rdbms-3.2.9.jar
>>> spark-assembly-1.2.1-hadoop2.4.0.jar
>>> datanucleus-core-3.2.10.jarspark-1.2.1-yarn-shuffle.jar
>>>  spark-examples-1.2.1-hadoop2.4.0.jar
>>>
>>> FYI
>>>
>>> On Wed, Feb 11, 2015 at 2:27 PM, Nicholas Chammas <
>>> nicholas.cham...@gmail.com> wrote:
>>>
 I just downloaded 1.2.1 pre-built for Hadoop 2.4+ and ran
 sbin/start-all.sh
 on my OS X.

 Failed to find Spark assembly in /path/to/spark-1.2.1-bin-hadoop2.4/lib
 You need to build Spark before running this program.

 Did the same for 1.2.0 and it worked fine.

 Nick
 ​

>>>
>>>


Re: 1.2.1 start-all.sh broken?

2015-02-11 Thread Nicholas Chammas
Found it:

https://github.com/apache/spark/compare/v1.2.0...v1.2.1#diff-73058f8e51951ec0b4cb3d48ade91a1fR73

GRRR BASH WORD SPLITTING

My path has a space in it...

Nick

On Wed Feb 11 2015 at 2:37:39 PM Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> This is what get:
>
> spark-1.2.1-bin-hadoop2.4$ ls -1 lib/
> datanucleus-api-jdo-3.2.6.jar
> datanucleus-core-3.2.10.jar
> datanucleus-rdbms-3.2.9.jar
> spark-1.2.1-yarn-shuffle.jar
> spark-assembly-1.2.1-hadoop2.4.0.jar
> spark-examples-1.2.1-hadoop2.4.0.jar
>
> So that looks correct… Hmm.
>
> Nick
> ​
>
> On Wed Feb 11 2015 at 2:34:51 PM Ted Yu  wrote:
>
>> I downloaded 1.2.1 tar ball for hadoop 2.4
>> I got:
>>
>> ls lib/
>> datanucleus-api-jdo-3.2.6.jar  datanucleus-rdbms-3.2.9.jar
>> spark-assembly-1.2.1-hadoop2.4.0.jar
>> datanucleus-core-3.2.10.jarspark-1.2.1-yarn-shuffle.jar
>>  spark-examples-1.2.1-hadoop2.4.0.jar
>>
>> FYI
>>
>> On Wed, Feb 11, 2015 at 2:27 PM, Nicholas Chammas <
>> nicholas.cham...@gmail.com> wrote:
>>
>>> I just downloaded 1.2.1 pre-built for Hadoop 2.4+ and ran
>>> sbin/start-all.sh
>>> on my OS X.
>>>
>>> Failed to find Spark assembly in /path/to/spark-1.2.1-bin-hadoop2.4/lib
>>> You need to build Spark before running this program.
>>>
>>> Did the same for 1.2.0 and it worked fine.
>>>
>>> Nick
>>> ​
>>>
>>
>>


Re: 1.2.1 start-all.sh broken?

2015-02-11 Thread Nicholas Chammas
This is what get:

spark-1.2.1-bin-hadoop2.4$ ls -1 lib/
datanucleus-api-jdo-3.2.6.jar
datanucleus-core-3.2.10.jar
datanucleus-rdbms-3.2.9.jar
spark-1.2.1-yarn-shuffle.jar
spark-assembly-1.2.1-hadoop2.4.0.jar
spark-examples-1.2.1-hadoop2.4.0.jar

So that looks correct… Hmm.

Nick
​

On Wed Feb 11 2015 at 2:34:51 PM Ted Yu  wrote:

> I downloaded 1.2.1 tar ball for hadoop 2.4
> I got:
>
> ls lib/
> datanucleus-api-jdo-3.2.6.jar  datanucleus-rdbms-3.2.9.jar
> spark-assembly-1.2.1-hadoop2.4.0.jar
> datanucleus-core-3.2.10.jarspark-1.2.1-yarn-shuffle.jar
>  spark-examples-1.2.1-hadoop2.4.0.jar
>
> FYI
>
> On Wed, Feb 11, 2015 at 2:27 PM, Nicholas Chammas <
> nicholas.cham...@gmail.com> wrote:
>
>> I just downloaded 1.2.1 pre-built for Hadoop 2.4+ and ran
>> sbin/start-all.sh
>> on my OS X.
>>
>> Failed to find Spark assembly in /path/to/spark-1.2.1-bin-hadoop2.4/lib
>> You need to build Spark before running this program.
>>
>> Did the same for 1.2.0 and it worked fine.
>>
>> Nick
>> ​
>>
>
>


Re: 1.2.1 start-all.sh broken?

2015-02-11 Thread Ted Yu
I downloaded 1.2.1 tar ball for hadoop 2.4
I got:

ls lib/
datanucleus-api-jdo-3.2.6.jar  datanucleus-rdbms-3.2.9.jar
spark-assembly-1.2.1-hadoop2.4.0.jar
datanucleus-core-3.2.10.jarspark-1.2.1-yarn-shuffle.jar
 spark-examples-1.2.1-hadoop2.4.0.jar

FYI

On Wed, Feb 11, 2015 at 2:27 PM, Nicholas Chammas <
nicholas.cham...@gmail.com> wrote:

> I just downloaded 1.2.1 pre-built for Hadoop 2.4+ and ran sbin/start-all.sh
> on my OS X.
>
> Failed to find Spark assembly in /path/to/spark-1.2.1-bin-hadoop2.4/lib
> You need to build Spark before running this program.
>
> Did the same for 1.2.0 and it worked fine.
>
> Nick
> ​
>


Re: 1.2.1 start-all.sh broken?

2015-02-11 Thread Sean Owen
Seems to work OK for me on OS X. I ran ./sbin/start-all.sh from the
root. Both processes say they started successfully.

On Wed, Feb 11, 2015 at 10:27 PM, Nicholas Chammas
 wrote:
> I just downloaded 1.2.1 pre-built for Hadoop 2.4+ and ran sbin/start-all.sh
> on my OS X.
>
> Failed to find Spark assembly in /path/to/spark-1.2.1-bin-hadoop2.4/lib
> You need to build Spark before running this program.
>
> Did the same for 1.2.0 and it worked fine.
>
> Nick
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



1.2.1 start-all.sh broken?

2015-02-11 Thread Nicholas Chammas
I just downloaded 1.2.1 pre-built for Hadoop 2.4+ and ran sbin/start-all.sh
on my OS X.

Failed to find Spark assembly in /path/to/spark-1.2.1-bin-hadoop2.4/lib
You need to build Spark before running this program.

Did the same for 1.2.0 and it worked fine.

Nick
​


numpy on PyPy - potential benefit to PySpark

2015-02-11 Thread Nicholas Chammas
Random question for the PySpark and Python experts/enthusiasts on here:

How big of a deal would it be for PySpark and PySpark users if you could
run numpy on PyPy?

PySpark already supports running on PyPy
, but libraries like MLlib that
use numpy are not supported.

There is an ongoing initiative to support numpy on PyPy
,
and they are taking donations  to support
the effort.

I’m wondering if any companies using PySpark in production would be
interested in pushing this initiative along, or if it’s not that big of a
deal.

Nick
​


[ml] Lost persistence for fold in crossvalidation.

2015-02-11 Thread Peter Rudenko
Hi i have a problem. Using spark 1.2 with Pipeline + GridSearch + 
LogisticRegression. I’ve reimplemented LogisticRegression.fit method and 
comment out instances.unpersist()


|override  def  fit(dataset:SchemaRDD, 
paramMap:ParamMap):LogisticRegressionModel  = {
println(s"Fitting dataset ${dataset.take(1000).toSeq.hashCode()} with ParamMap 
$paramMap.")
transformSchema(dataset.schema, paramMap, logging =true)
import  dataset.sqlContext._
val  map  =  this.paramMap ++ paramMap
val  instances  =  dataset.select(map(labelCol).attr, map(featuresCol).attr)
  .map {
case  Row(label:Double, features:Vector) =>
  LabeledPoint(label, features)
  }

if  (instances.getStorageLevel ==StorageLevel.NONE) {
  println("Instances not persisted")
  instances.persist(StorageLevel.MEMORY_AND_DISK)
}

 val  lr  =  (new  LogisticRegressionWithLBFGS)
  .setValidateData(false)
  .setIntercept(true)
lr.optimizer
  .setRegParam(map(regParam))
  .setNumIterations(map(maxIter))
val  lrm  =  new  LogisticRegressionModel(this, map, 
lr.run(instances).weights)
//instances.unpersist()
// copy model params
Params.inheritValues(map,this, lrm)
lrm
  }
|

CrossValidator feeds the same SchemaRDD for each parameter (same hash 
code), but somewhere cache being flushed. The memory is enough. Here’s 
the output:


|Fitting dataset 2051470010 with ParamMap {
DRLogisticRegression-f35ae4d3-regParam: 0.1
}.
Instances not persisted
Fitting dataset 2051470010 with ParamMap {
DRLogisticRegression-f35ae4d3-regParam: 0.01
}.
Instances not persisted
Fitting dataset 2051470010 with ParamMap {
DRLogisticRegression-f35ae4d3-regParam: 0.001
}.
Instances not persisted
Fitting dataset 802615223 with ParamMap {
DRLogisticRegression-f35ae4d3-regParam: 0.1
}.
Instances not persisted
Fitting dataset 802615223 with ParamMap {
DRLogisticRegression-f35ae4d3-regParam: 0.01
}.
Instances not persisted
|

I have 3 parameters in GridSearch and 3 folds for CrossValidation:

|
val  paramGrid  =  new  ParamGridBuilder()
  .addGrid(model.regParam,Array(0.1,0.01,0.001))
  .build()

crossval.setEstimatorParamMaps(paramGrid)
crossval.setNumFolds(3)
|

I assume that the data should be read and cached 3 times (1 to 
numFolds).combinations(2) and be independent from number of parameters. 
But i have 9 times data being read and cached.


Thanks,
Peter Rudenko

​


Re: Data source API | sizeInBytes should be to *Scan

2015-02-11 Thread Reynold Xin
Unfortunately this is not to happen for 1.3 (as a snapshot release is
already cut). We need to figure out how we are going to do cardinality
estimation before implementing this. If we need to do this in the future, I
think we can do it in a way that doesn't break existing APIs. Given I think
this won't bring much benefit right now (the only use for it is broadcast
joins), I think it is ok to push this till later.

The issue I asked still stands. What should the optimizer do w.r.t. filters
that are pushed into the data source? Should it ignore those filters, or
apply statistics again?

This also depends on how we want to do statistics. Hive (and a lot of other
database systems) does a scan to figure out statistics, and put all of
those statistics in a catalog. That is a more unified way to solve the
stats problem.

That said, in the world of federated databases, I can see why we might want
to push cardinality estimation to the data sources, since if the use case
is selecting a very small subset of the data from the sources, then it
might be hard for the statistics to be accurate in the catalog built from
data scan.



On Wed, Feb 11, 2015 at 10:47 AM, Aniket Bhatnagar <
aniket.bhatna...@gmail.com> wrote:

> Circling back on this. Did you get a chance to re-look at this?
>
> Thanks,
> Aniket
>
> On Sun, Feb 8, 2015, 2:53 AM Aniket Bhatnagar 
> wrote:
>
>> Thanks for looking into this. If this true, isn't this an issue today?
>> The default implementation of sizeInBytes is 1 + broadcast threshold. So,
>> if catalyst's cardinality estimation estimates even a small filter
>> selectivity, it will result in broadcasting the relation. Therefore,
>> shouldn't the default be much higher than broadcast threshold?
>>
>> Also, since the default implementation of sizeInBytes already exists in
>> BaseRelation, I am not sure why the same/similar default implementation
>> can't be provided with in *Scan specific sizeInBytes functions and have
>> Catalyst always trust the size returned by DataSourceAPI (with default
>> implementation being to never broadcast). Another thing that could be done
>> is have sizeInBytes return Option[Long] so that Catalyst explicitly knows
>> when DataSource was able to optimize the size. The reason why I would push
>> for sizeInBytes in *Scan interfaces is because at times the data source
>> implementation can more accurately predict the size output. For example,
>> DataSource implementations for MongoDB, ElasticSearch, Cassandra, etc can
>> easy use filter push downs to query the underlying storage to predict the
>> size. Such predictions will be more accurate than Catalyst's prediction.
>> Therefore, if its not a fundamental change in Catalyst, I would think this
>> makes sense.
>>
>>
>> Thanks,
>> Aniket
>>
>>
>> On Sat, Feb 7, 2015, 4:50 AM Reynold Xin  wrote:
>>
>>> We thought about this today after seeing this email. I actually built a
>>> patch for this (adding filter/column to data source stat estimation), but
>>> ultimately dropped it due to the potential problems the change the cause.
>>>
>>> The main problem I see is that column pruning/predicate pushdowns are
>>> advisory, i.e. the data source might or might not apply those filters.
>>>
>>> Without significantly complicating the data source API, it is hard for
>>> the optimizer (and future cardinality estimation) to know whether the
>>> filter/column pushdowns are advisory, and whether to incorporate that in
>>> cardinality estimation.
>>>
>>> Imagine this scenario: a data source applies a filter and estimates the
>>> filter's selectivity is 0.1, then the data set is reduced to 10% of the
>>> size. Catalyst's own cardinality estimation estimates the filter
>>> selectivity to 0.1 again, and thus the estimated data size is now 1% of the
>>> original data size, lowering than some threshold. Catalyst decides to
>>> broadcast the table. The actual table size is actually 10x the size.
>>>
>>>
>>>
>>>
>>>
>>> On Fri, Feb 6, 2015 at 3:39 AM, Aniket Bhatnagar <
>>> aniket.bhatna...@gmail.com> wrote:
>>>
 Hi Spark SQL committers

 I have started experimenting with data sources API and I was wondering
 if
 it makes sense to move the method sizeInBytes from BaseRelation to Scan
 interfaces. This is because that a relation may be able to leverage
 filter
 push down to estimate size potentially making a very large relation
 broadcast-able. Thoughts?

 Aniket

>>>
>>>


Re: Data source API | sizeInBytes should be to *Scan

2015-02-11 Thread Aniket Bhatnagar
Circling back on this. Did you get a chance to re-look at this?

Thanks,
Aniket

On Sun, Feb 8, 2015, 2:53 AM Aniket Bhatnagar 
wrote:

> Thanks for looking into this. If this true, isn't this an issue today? The
> default implementation of sizeInBytes is 1 + broadcast threshold. So, if
> catalyst's cardinality estimation estimates even a small filter
> selectivity, it will result in broadcasting the relation. Therefore,
> shouldn't the default be much higher than broadcast threshold?
>
> Also, since the default implementation of sizeInBytes already exists in
> BaseRelation, I am not sure why the same/similar default implementation
> can't be provided with in *Scan specific sizeInBytes functions and have
> Catalyst always trust the size returned by DataSourceAPI (with default
> implementation being to never broadcast). Another thing that could be done
> is have sizeInBytes return Option[Long] so that Catalyst explicitly knows
> when DataSource was able to optimize the size. The reason why I would push
> for sizeInBytes in *Scan interfaces is because at times the data source
> implementation can more accurately predict the size output. For example,
> DataSource implementations for MongoDB, ElasticSearch, Cassandra, etc can
> easy use filter push downs to query the underlying storage to predict the
> size. Such predictions will be more accurate than Catalyst's prediction.
> Therefore, if its not a fundamental change in Catalyst, I would think this
> makes sense.
>
>
> Thanks,
> Aniket
>
>
> On Sat, Feb 7, 2015, 4:50 AM Reynold Xin  wrote:
>
>> We thought about this today after seeing this email. I actually built a
>> patch for this (adding filter/column to data source stat estimation), but
>> ultimately dropped it due to the potential problems the change the cause.
>>
>> The main problem I see is that column pruning/predicate pushdowns are
>> advisory, i.e. the data source might or might not apply those filters.
>>
>> Without significantly complicating the data source API, it is hard for
>> the optimizer (and future cardinality estimation) to know whether the
>> filter/column pushdowns are advisory, and whether to incorporate that in
>> cardinality estimation.
>>
>> Imagine this scenario: a data source applies a filter and estimates the
>> filter's selectivity is 0.1, then the data set is reduced to 10% of the
>> size. Catalyst's own cardinality estimation estimates the filter
>> selectivity to 0.1 again, and thus the estimated data size is now 1% of the
>> original data size, lowering than some threshold. Catalyst decides to
>> broadcast the table. The actual table size is actually 10x the size.
>>
>>
>>
>>
>>
>> On Fri, Feb 6, 2015 at 3:39 AM, Aniket Bhatnagar <
>> aniket.bhatna...@gmail.com> wrote:
>>
>>> Hi Spark SQL committers
>>>
>>> I have started experimenting with data sources API and I was wondering if
>>> it makes sense to move the method sizeInBytes from BaseRelation to Scan
>>> interfaces. This is because that a relation may be able to leverage
>>> filter
>>> push down to estimate size potentially making a very large relation
>>> broadcast-able. Thoughts?
>>>
>>> Aniket
>>>
>>
>>


Re: CallbackServer in PySpark Streaming

2015-02-11 Thread Davies Liu
The CallbackServer is part of Py4j, it's only used in driver, not used
in slaves or workers.

On Wed, Feb 11, 2015 at 12:32 AM, Todd Gao
 wrote:
> Hi all,
>
> I am reading the code of PySpark and its Streaming module.
>
> In PySpark Streaming, when the `compute` method of the instance of
> PythonTransformedDStream is invoked, a connection to the CallbackServer
> is created internally.
> I wonder where is the CallbackServer for each PythonTransformedDStream
> instance on the slave nodes in distributed environment.
> Is there a CallbackServer running on every slave node?
>
> thanks
> Todd

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[GraphX] Estimating Average distance of a big graph using GraphX

2015-02-11 Thread James
Hello,

Recently  I am trying to estimate the average distance of a big graph using
spark with the help of [HyperAnf](http://dl.acm.org/citation.cfm?id=1963493
).

It works like Connect Componenet algorithm, while the attribute of a vertex
is a HyperLogLog counter that at k-th iteration it estimates the number of
vertices it could reaches less than k hops.

I have successfully run the code on a graph with 20M vertices. But I still
need help:


*I think the code could work more efficiently especially the "Send message"
function, but I am not sure about what will happen if a vertex receive no
message at a iteration.*

Here is my code: https://github.com/alcaid1801/Erdos

Any returns is appreciated.


CallbackServer in PySpark Streaming

2015-02-11 Thread Todd Gao
Hi all,

I am reading the code of PySpark and its Streaming module.

In PySpark Streaming, when the `compute` method of the instance of
PythonTransformedDStream is invoked, a connection to the CallbackServer
is created internally.
I wonder where is the CallbackServer for each PythonTransformedDStream
instance on the slave nodes in distributed environment.
Is there a CallbackServer running on every slave node?

thanks
Todd