?????? Spark-sql(yarn-client) java.lang.NoClassDefFoundError: org/apache/spark/deploy/yarn/ExecutorLauncher

2015-06-18 Thread Sea
Thanks, Yin Huai
I work it out.
I use JDK1.7 to build Spark 1.4.0, but my yarn cluster run on JDK1.6. 
But java.version in pom.xml in 1.6 and the exception makes me confused




--  --
??: "Yin Huai";;
: 2015??6??18??(??) 11:19
??: "Sea"<261810...@qq.com>; 
: "dev"; 
: Re: Spark-sql(yarn-client) java.lang.NoClassDefFoundError: 
org/apache/spark/deploy/yarn/ExecutorLauncher



Is it the full stack trace?

On Thu, Jun 18, 2015 at 6:39 AM, Sea <261810...@qq.com> wrote:
Hi, all:


I want to run spark sql on yarn(yarn-client), but ... I already set 
"spark.yarn.jar" and  "spark.jars" in conf/spark-defaults.conf.
./bin/spark-sql -f game.sql --executor-memory 2g --num-executors 100 > game.txt
Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/spark/deploy/yarn/ExecutorLauncher
Caused by: java.lang.ClassNotFoundException: 
org.apache.spark.deploy.yarn.ExecutorLauncher
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Could not find the main class: org.apache.spark.deploy.yarn.ExecutorLauncher.  
Program will exit.







Anyone can help?

Re: Increase partition count (repartition) without shuffle

2015-06-18 Thread Mridul Muralidharan
If you can scan input twice, you can of course do per partition count and
build custom RDD which can reparation without shuffle.
But nothing off the shelf as Sandy mentioned.

Regards
Mridul

On Thursday, June 18, 2015, Sandy Ryza  wrote:

> Hi Alexander,
>
> There is currently no way to create an RDD with more partitions than its
> parent RDD without causing a shuffle.
>
> However, if the files are splittable, you can set the Hadoop
> configurations that control split size to something smaller so that the
> HadoopRDD ends up with more partitions.
>
> -Sandy
>
> On Thu, Jun 18, 2015 at 2:26 PM, Ulanov, Alexander <
> alexander.ula...@hp.com
> > wrote:
>
>>  Hi,
>>
>>
>>
>> Is there a way to increase the amount of partition of RDD without causing
>> shuffle? I’ve found JIRA issue
>> https://issues.apache.org/jira/browse/SPARK-5997 however there is no
>> implementation yet.
>>
>>
>>
>> Just in case, I am reading data from ~300 big binary files, which results
>> in 300 partitions, then I need to sort my RDD, but it crashes with
>> outofmemory exception. If I change the number of partitions to 2000, sort
>> works OK, but repartition itself takes a lot of time due to shuffle.
>>
>>
>>
>> Best regards, Alexander
>>
>
>


Re: Increase partition count (repartition) without shuffle

2015-06-18 Thread Sandy Ryza
Hi Alexander,

There is currently no way to create an RDD with more partitions than its
parent RDD without causing a shuffle.

However, if the files are splittable, you can set the Hadoop configurations
that control split size to something smaller so that the HadoopRDD ends up
with more partitions.

-Sandy

On Thu, Jun 18, 2015 at 2:26 PM, Ulanov, Alexander 
wrote:

>  Hi,
>
>
>
> Is there a way to increase the amount of partition of RDD without causing
> shuffle? I’ve found JIRA issue
> https://issues.apache.org/jira/browse/SPARK-5997 however there is no
> implementation yet.
>
>
>
> Just in case, I am reading data from ~300 big binary files, which results
> in 300 partitions, then I need to sort my RDD, but it crashes with
> outofmemory exception. If I change the number of partitions to 2000, sort
> works OK, but repartition itself takes a lot of time due to shuffle.
>
>
>
> Best regards, Alexander
>


Increase partition count (repartition) without shuffle

2015-06-18 Thread Ulanov, Alexander
Hi,

Is there a way to increase the amount of partition of RDD without causing 
shuffle? I've found JIRA issue https://issues.apache.org/jira/browse/SPARK-5997 
however there is no implementation yet.

Just in case, I am reading data from ~300 big binary files, which results in 
300 partitions, then I need to sort my RDD, but it crashes with outofmemory 
exception. If I change the number of partitions to 2000, sort works OK, but 
repartition itself takes a lot of time due to shuffle.

Best regards, Alexander


Re: Random Forest driver memory

2015-06-18 Thread Joseph Bradley
Hi Isca,

Could you please give more details?  Data size, model parameters, stack
traces / logs, etc. to help get a better picture?

Thanks,
Joseph

On Wed, Jun 17, 2015 at 9:56 AM, Isca Harmatz  wrote:

> hello,
>
> does anyone has any help on the issue?
>
>
> Isca
>
> On Tue, Jun 16, 2015 at 7:45 AM, Isca Harmatz  wrote:
>
>> hello,
>>
>> i have noticed that the random forest implementation crashes when
>> to many trees/ to big maxDepth is used.
>>
>> im guessing that this is something to do with the amount of nodes that
>> need to be
>> kept in driver's memory during the run.
>>
>> but when i examined the nodes structure is seems rather small
>>
>> does anyone now where does the memory issue come from?
>>
>> thanks,
>>   Isca
>>
>
>


Re: [mllib] Refactoring some spark.mllib model classes in Python not inheriting JavaModelWrapper

2015-06-18 Thread Xiangrui Meng
Hi Yu,

Reducing the code complexity on the Python side is certainly what we
want to see:) We didn't call Java directly in Python models because
Java methods don't work inside RDD closures, e.g.,

rdd.map(lambda x: model.predict(x[1]))

But I agree that for model save/load the implementation should be
simplified. Could you submit a PR and see how much code we can save?

Thanks,
Xiangrui

On Wed, Jun 17, 2015 at 8:15 PM, Yu Ishikawa
 wrote:
> Hi all,
>
> I think we should refactor some machine learning model classes in Python to
> reduce the software maintainability.
> Inheriting JavaModelWrapper class, we can easily and directly call Scala API
> for the model without PythonMLlibAPI.
>
> In some case, a machine learning model class in Python has complicated
> variables. That is, it is a little hard to implement import/export methods
> and it is also a little troublesome to implement the function in both of
> Scala and Python. And I think standardizing how to create a model class in
> python is important.
>
> What do you think about that?
>
> Thanks,
> Yu
>
>
>
> -
> -- Yu Ishikawa
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/mllib-Refactoring-some-spark-mllib-model-classes-in-Python-not-inheriting-JavaModelWrapper-tp12781.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Latency between the RDD in Streaming

2015-06-18 Thread anshu shukla
Is there any  fixed way to find  among RDD in stream processing systems ,
in the Distributed set-up .

-- 
Thanks & Regards,
Anshu Shukla


Re: Sidebar: issues targeted for 1.4.0

2015-06-18 Thread Nicholas Chammas
> Given fixed time, adding more TODOs generally means other stuff has to be
taken
out for the release. If not, then it happens de facto anyway, which is
worse than managing it on purpose.

+1 to this.

I wouldn't mind helping go through open issues on JIRA targeted for the
next release around RC time to make sure that a) nothing major is getting
missed for the release and b) the JIRA backlog gets trimmed of the cruft
which is constantly building up. It's good housekeeping.

Nick

On Thu, Jun 18, 2015 at 3:23 AM Sean Owen  wrote:

> I also like using Target Version meaningfully. It might be a little
> much to require no Target Version = X before starting an RC. I do
> think it's reasonable to not start the RC with Blockers open.
>
> And here we started the RC with almost 100 TODOs for 1.4.0, most of
> which did not get done. Not the end of the world, but, clearly some
> other decisions were made in the past based on the notion that most of
> those would get done. The 'targeting' is too optimistic. Given fixed
> time, adding more TODOs generally means other stuff has to be taken
> out for the release. If not, then it happens de facto anyway, which is
> worse than managing it on purpose.
>
> Anyway, thanks all for the attention to some cleanup. I'll wait a
> short while and then fix up the rest of them as intelligently as I
> can. Maybe I can push on this a little the next time we have a release
> cycle to see how we're doing with use of Target Version.
>
>
>
>
>
> On Wed, Jun 17, 2015 at 10:03 PM, Heller, Chris 
> wrote:
> > I appreciate targets having the strong meaning you suggest, as its useful
> > to get a sense of what will realistically be included in a release.
> >
> >
> > Would it make sense (speaking as a relative outsider here) that we would
> > not enter into the RC phase of a release until all JIRA targeting that
> > release were complete?
> >
> > If a JIRA targeting a release is blocking entry to the RC phase, and its
> > determined that the JIRA should not hold up the release, than it should
> > get re-targeted to the next release.
> >
> > -Chris
> >
> > On 6/17/15, 3:55 PM, "Patrick Wendell"  wrote:
> >
> >>Hey Sean,
> >>
> >>Thanks for bringing this up - I went through and fixed about 10 of
> >>them. Unfortunately there isn't a hard and fast way to resolve them. I
> >>found all of the following:
> >>
> >>- Features that missed the release and needed to be retargeted to 1.5.
> >>- Bugs that missed the release and needed to be retargeted to 1.4.1.
> >>- Issues that were not properly targeted (e.g. someone randomly set
> >>the target version) and should probably be untargeted.
> >>
> >>I'd like to encourage others to do this, especially the more active
> >>developers on different components (Streaming, ML, etc).
> >>
> >>One other question is what the semantics of target version are, which
> >>I don't think we've defined clearly. Is it the target of the person
> >>contributing the feature? Or in some sense the target of the
> >>committership? My preference would be that targeting a JIRA has some
> >>strong semantics - i.e. it means the commiter targeting has mentally
> >>allocated time to review a patch for that feature in the timeline of
> >>that release. I.e. prefer to have fewer targeted JIRA's for a release,
> >>and also expect to get most of the targeted features merged into a
> >>release. In the past I think targeting has meant different things to
> >>different people.
> >>
> >>- Patrick
> >>
> >>On Tue, Jun 16, 2015 at 8:09 AM, Josh Rosen 
> wrote:
> >>> Whatever you do, DO NOT use the built-in JIRA 'releases' feature to
> >>>migrate
> >>> issues from 1.4.0 to another version: the JIRA feature will have the
> >>> side-effect of automatically changing the target versions for issues
> >>>that
> >>> have been closed, which is going to be really confusing. I've made this
> >>> mistake once myself and it was a bit of a hassle to clean up.
> >>>
> >>> On Tue, Jun 16, 2015 at 5:24 AM, Sean Owen  wrote:
> 
>  Question: what would happen if I cleared Target Version for everything
>  still marked Target Version = 1.4.0? There are 76 right now, and
>  clearly that's not correct.
> 
>  56 were opened by committers, including issues like "Do X for 1.4".
>  I'd like to understand whether these are resolved but just weren't
>  closed, or else why so many issues are being filed as a todo and not
>  resolved? Slipping things here or there is OK, but these weren't even
>  slipped, just forgotten.
> 
>  On Sat, May 30, 2015 at 3:55 PM, Sean Owen 
> wrote:
>  > In an ideal world,  Target Version really is what's going to go in
> as
>  > far as anyone knows and when new stuff comes up, we all have to
> figure
>  > out what gets dropped to fit by the release date. Boring, standard
>  > software project management practice. I don't know how realistic
> that
>  > is, but, I'm wondering how people feel about this, who have filed
>  > thes

Re: Spark-sql(yarn-client) java.lang.NoClassDefFoundError: org/apache/spark/deploy/yarn/ExecutorLauncher

2015-06-18 Thread Yin Huai
Is it the full stack trace?

On Thu, Jun 18, 2015 at 6:39 AM, Sea <261810...@qq.com> wrote:

> Hi, all:
>
> I want to run spark sql on yarn(yarn-client), but ... I already set
> "spark.yarn.jar" and  "spark.jars" in conf/spark-defaults.conf.
>
> ./bin/spark-sql -f game.sql --executor-memory 2g --num-executors 100 > 
> game.txt
>
> Exception in thread "main" java.lang.NoClassDefFoundError:
> org/apache/spark/deploy/yarn/ExecutorLauncher
> Caused by: java.lang.ClassNotFoundException:
> org.apache.spark.deploy.yarn.ExecutorLauncher
> at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
> at java.security.AccessController.doPrivileged(Native Method)
> at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
> at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
> Could not find the main class:
> org.apache.spark.deploy.yarn.ExecutorLauncher.  Program will exit.
>
>
> Anyone can help?
>


Spark-sql(yarn-client) java.lang.NoClassDefFoundError: org/apache/spark/deploy/yarn/ExecutorLauncher

2015-06-18 Thread Sea
Hi, all:


I want to run spark sql on yarn(yarn-client), but ... I already set 
"spark.yarn.jar" and  "spark.jars" in conf/spark-defaults.conf.
./bin/spark-sql -f game.sql --executor-memory 2g --num-executors 100 > game.txt
Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/spark/deploy/yarn/ExecutorLauncher
Caused by: java.lang.ClassNotFoundException: 
org.apache.spark.deploy.yarn.ExecutorLauncher
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Could not find the main class: org.apache.spark.deploy.yarn.ExecutorLauncher.  
Program will exit.





Anyone can help?

Spark-sql(yarn-client) java.lang.NoClassDefFoundError: org/apache/spark/deploy/yarn/ExecutorLauncher

2015-06-18 Thread Sea
Hi, all:

I want to run spark sql on yarn(yarn-client), but ... I already set 
"spark.yarn.jar" and  "spark.jars" in conf/spark-defaults.conf.
./bin/spark-sql -f game.sql --executor-memory 2g --num-executors 100 > game.txt
Exception in thread "main" java.lang.NoClassDefFoundError: 
org/apache/spark/deploy/yarn/ExecutorLauncher
Caused by: java.lang.ClassNotFoundException: 
org.apache.spark.deploy.yarn.ExecutorLauncher
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
Could not find the main class: org.apache.spark.deploy.yarn.ExecutorLauncher.  
Program will exit.





Any can help?

Re: Approximate rank-based statistics (median, 95-th percentile, etc.) for Spark

2015-06-18 Thread Nick Pentreath
If it's going into the DataFrame API (which it probably should rather than
in RDD itself) - then it could become a UDT (similar to HyperLogLogUDT)
which would mean it doesn't have to implement Serializable, as it appears
that serialization is taken care of in the UDT def (e.g.
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala#L254
)

If I understand correctly UDT SerDe correctly?

On Thu, Jun 11, 2015 at 2:47 AM, Ray Ortigas 
wrote:

> Hi Grega and Reynold,
>
> Grega, if you still want to use t-digest, I filed this PR because I
> thought your t-digest suggestion was a good idea.
>
> https://github.com/tdunning/t-digest/pull/56
>
> If it is helpful feel free to do whatever with it.
>
> Regards,
> Ray
>
>
> On Wed, Jun 10, 2015 at 2:54 PM, Reynold Xin  wrote:
>
>> This email is good. Just one note -- a lot of people are swamped right
>> before Spark Summit, so you might not get prompt responses this week.
>>
>>
>> On Wed, Jun 10, 2015 at 2:53 PM, Grega Kešpret  wrote:
>>
>>> I have some time to work on it now. What's a good way to continue the
>>> discussions before coding it?
>>>
>>> This e-mail list, JIRA or something else?
>>>
>>> On Mon, Apr 6, 2015 at 12:59 AM, Reynold Xin 
>>> wrote:
>>>
 I think those are great to have. I would put them in the DataFrame API
 though, since this is applying to structured data. Many of the advanced
 functions on the PairRDDFunctions should really go into the DataFrame API
 now we have it.

 One thing that would be great to understand is what state-of-the-art
 alternatives are out there. I did a quick google scholar search using the
 keyword "approximate quantile" and found some older papers. Just the
 first few I found:

 http://www.softnet.tuc.gr/~minos/Papers/sigmod05.pdf  by bell labs


 http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.6.6513&rep=rep1&type=pdf
  by Bruce Lindsay, IBM

 http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf





 On Mon, Apr 6, 2015 at 12:50 AM, Grega Kešpret 
 wrote:

> Hi!
>
> I'd like to get community's opinion on implementing a generic quantile
> approximation algorithm for Spark that is O(n) and requires limited 
> memory.
> I would find it useful and I haven't found any existing implementation. 
> The
> plan was basically to wrap t-digest
> , implement the
> serialization/deserialization boilerplate and provide
>
> def cdf(x: Double): Double
> def quantile(q: Double): Double
>
>
> on RDD[Double] and RDD[(K, Double)].
>
> Let me know what you think. Any other ideas/suggestions also welcome!
>
> Best,
> Grega
> --
> [image: Inline image 1]*Grega Kešpret*
> Senior Software Engineer, Analytics
>
> Skype: gregakespret
> celtra.com  | @celtramobile
> 
>
>

>>>
>>
>


Re: Sidebar: issues targeted for 1.4.0

2015-06-18 Thread Sean Owen
I also like using Target Version meaningfully. It might be a little
much to require no Target Version = X before starting an RC. I do
think it's reasonable to not start the RC with Blockers open.

And here we started the RC with almost 100 TODOs for 1.4.0, most of
which did not get done. Not the end of the world, but, clearly some
other decisions were made in the past based on the notion that most of
those would get done. The 'targeting' is too optimistic. Given fixed
time, adding more TODOs generally means other stuff has to be taken
out for the release. If not, then it happens de facto anyway, which is
worse than managing it on purpose.

Anyway, thanks all for the attention to some cleanup. I'll wait a
short while and then fix up the rest of them as intelligently as I
can. Maybe I can push on this a little the next time we have a release
cycle to see how we're doing with use of Target Version.





On Wed, Jun 17, 2015 at 10:03 PM, Heller, Chris  wrote:
> I appreciate targets having the strong meaning you suggest, as its useful
> to get a sense of what will realistically be included in a release.
>
>
> Would it make sense (speaking as a relative outsider here) that we would
> not enter into the RC phase of a release until all JIRA targeting that
> release were complete?
>
> If a JIRA targeting a release is blocking entry to the RC phase, and its
> determined that the JIRA should not hold up the release, than it should
> get re-targeted to the next release.
>
> -Chris
>
> On 6/17/15, 3:55 PM, "Patrick Wendell"  wrote:
>
>>Hey Sean,
>>
>>Thanks for bringing this up - I went through and fixed about 10 of
>>them. Unfortunately there isn't a hard and fast way to resolve them. I
>>found all of the following:
>>
>>- Features that missed the release and needed to be retargeted to 1.5.
>>- Bugs that missed the release and needed to be retargeted to 1.4.1.
>>- Issues that were not properly targeted (e.g. someone randomly set
>>the target version) and should probably be untargeted.
>>
>>I'd like to encourage others to do this, especially the more active
>>developers on different components (Streaming, ML, etc).
>>
>>One other question is what the semantics of target version are, which
>>I don't think we've defined clearly. Is it the target of the person
>>contributing the feature? Or in some sense the target of the
>>committership? My preference would be that targeting a JIRA has some
>>strong semantics - i.e. it means the commiter targeting has mentally
>>allocated time to review a patch for that feature in the timeline of
>>that release. I.e. prefer to have fewer targeted JIRA's for a release,
>>and also expect to get most of the targeted features merged into a
>>release. In the past I think targeting has meant different things to
>>different people.
>>
>>- Patrick
>>
>>On Tue, Jun 16, 2015 at 8:09 AM, Josh Rosen  wrote:
>>> Whatever you do, DO NOT use the built-in JIRA 'releases' feature to
>>>migrate
>>> issues from 1.4.0 to another version: the JIRA feature will have the
>>> side-effect of automatically changing the target versions for issues
>>>that
>>> have been closed, which is going to be really confusing. I've made this
>>> mistake once myself and it was a bit of a hassle to clean up.
>>>
>>> On Tue, Jun 16, 2015 at 5:24 AM, Sean Owen  wrote:

 Question: what would happen if I cleared Target Version for everything
 still marked Target Version = 1.4.0? There are 76 right now, and
 clearly that's not correct.

 56 were opened by committers, including issues like "Do X for 1.4".
 I'd like to understand whether these are resolved but just weren't
 closed, or else why so many issues are being filed as a todo and not
 resolved? Slipping things here or there is OK, but these weren't even
 slipped, just forgotten.

 On Sat, May 30, 2015 at 3:55 PM, Sean Owen  wrote:
 > In an ideal world,  Target Version really is what's going to go in as
 > far as anyone knows and when new stuff comes up, we all have to
figure
 > out what gets dropped to fit by the release date. Boring, standard
 > software project management practice. I don't know how realistic that
 > is, but, I'm wondering how people feel about this, who have filed
 > these JIRAs?
 >
 > Concretely, should non-Critical issues for 1.4.0 be un-Targeted?
 > should they all be un-Targeted after the release?

 -
 To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
 For additional commands, e-mail: dev-h...@spark.apache.org

>>>
>>
>>-
>>To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
>>For additional commands, e-mail: dev-h...@spark.apache.org
>>

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...