Re: [SNAPSHOT] Snapshot1 of Spark 1.1.0 has been posted

2014-08-06 Thread Patrick Wendell
Minor correction: the encoded URL in the staging repo link was wrong.
The correct repo is:
https://repository.apache.org/content/repositories/orgapachespark-1025/


On Wed, Aug 6, 2014 at 11:23 PM, Patrick Wendell  wrote:
>
> Hi All,
>
> I've packaged and published a snapshot release of Spark 1.1 for testing. This 
> is being distributed to the community for QA and preview purposes. It is not 
> yet an official RC for voting. Going forward, we'll do preview releases like 
> this for testing ahead of official votes.
>
> The tag of this release is v1.1.0-snapshot1 (commit d428d8):
> https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d428d88418d385d1d04e1b0adcb6b068efe9c7b0
>
> The release files, including signatures, digests, etc can be found at:
> http://people.apache.org/~pwendell/spark-1.1.0-snapshot1/
>
> Release artifacts are signed with the following key:
> https://people.apache.org/keys/committer/pwendell.asc
>
> The staging repository for this release can be found at:
> https://repository.apache.org/content/repositories/orgapachespark-1025/
>
> NOTE: Due to SPARK-2899, docs are not yet available for this release. Docs 
> will be posted ASAP.
>
> To learn more about Apache Spark, please see
> http://spark.apache.org/

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



[SNAPSHOT] Snapshot1 of Spark 1.1.0 has been posted

2014-08-06 Thread Patrick Wendell
Hi All,

I've packaged and published a snapshot release of Spark 1.1 for testing.
This is being distributed to the community for QA and preview purposes. It
is not yet an official RC for voting. Going forward, we'll do preview
releases like this for testing ahead of official votes.

The tag of this release is v1.1.0-snapshot1 (commit d428d8):
*https://git-wip-us.apache.org/repos/asf?p=spark.git;a=commit;h=d428d88418d385d1d04e1b0adcb6b068efe9c7b0
*

The release files, including signatures, digests, etc can be found at:
*http://people.apache.org/~pwendell/spark-1.1.0-snapshot1/
*

Release artifacts are signed with the following key:
*https://people.apache.org/keys/committer/pwendell.asc
*

The staging repository for this release can be found at:
https://repository.apache.org/content/repositories/orgapachespark-1025/


NOTE: Due to SPARK-2899, docs are not yet available for this release. Docs
will be posted ASAP.

To learn more about Apache Spark, please see
http://spark.apache.org/


Re: Documentation confusing or incorrect for decision trees?

2014-08-06 Thread Sean Owen
It's definitely just a typo. The ordered categories are A, C, B so the
other split can't be A | B, C. Just open a PR.

On Thu, Aug 7, 2014 at 2:11 AM, Matt Forbes  wrote:
> I found the section on ordering categorical features really interesting,
> but the A, B, C example seemed inconsistent. Am I interpreting this passage
> wrong, or are there typos? Aren't the split candidates A | C, B and A, C |
> B ?
>
> For example, for a binary classification problem with one categorical
> feature with three categories A, B and C with corresponding proportion of
> label 1 as 0.2, 0.6 and 0.4, the categorical features are ordered as A
> followed by C followed B or A, B, C. The two split candidates are A | C, B
> and A , B | C where | denotes the split.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Buidling spark in Eclipse Kepler

2014-08-06 Thread Sean Owen
(Don't use gen-idea, just open it directly as a Maven project in IntelliJ.)

On Thu, Aug 7, 2014 at 4:53 AM, Ron Gonzalez
 wrote:
> So I downloaded community edition of IntelliJ, and ran sbt/sbt gen-idea.
> I then imported the pom.xml file.
> I'm still getting all sorts of errors from IntelliJ about unresolved 
> dependencies.
> Any suggestions?
>
> Thanks,
> Ron
>
>
> On Wednesday, August 6, 2014 12:29 PM, Ron Gonzalez 
>  wrote:
>
>
>
> Hi,
>   I'm trying to get the apache spark trunk compiling in my Eclipse, but I 
> can't seem to get it going. In particular, I've tried sbt/sbt eclipse, but it 
> doesn't seem to create the eclipse pieces for yarn and other projects. Doing 
> mvn eclipse:eclipse on yarn seems to fail as well as sbt/sbt eclipse just for 
> yarn fails. Is there some documentation available for eclipse? I've gone 
> through the ones on the site, but to no avail.
>   Any tips?
>
> Thanks,
> Ron

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Buidling spark in Eclipse Kepler

2014-08-06 Thread Ron Gonzalez
Thanks, will give that a try.

Sent from my iPad

> On Aug 6, 2014, at 9:26 PM, DB Tsai  wrote:
> 
> After sbt gen-idea , you can open the intellji project directly without going 
> through pom.xml
> 
> If u want to compile inside intellji, you have to remove one of the messo 
> jar. This is an open issue, and u can find the detail in JIRA.
> 
> Sent from my Google Nexus 5
> 
>> On Aug 6, 2014 8:54 PM, "Ron Gonzalez"  wrote:
>> So I downloaded community edition of IntelliJ, and ran sbt/sbt gen-idea.
>> I then imported the pom.xml file.
>> I'm still getting all sorts of errors from IntelliJ about unresolved 
>> dependencies.
>> Any suggestions?
>> 
>> Thanks,
>> Ron
>> 
>> 
>> On Wednesday, August 6, 2014 12:29 PM, Ron Gonzalez 
>>  wrote:
>> 
>> 
>> 
>> Hi,
>>   I'm trying to get the apache spark trunk compiling in my Eclipse, but I 
>> can't seem to get it going. In particular, I've tried sbt/sbt eclipse, but 
>> it doesn't seem to create the eclipse pieces for yarn and other projects. 
>> Doing mvn eclipse:eclipse on yarn seems to fail as well as sbt/sbt eclipse 
>> just for yarn fails. Is there some documentation available for eclipse? I've 
>> gone through the ones on the site, but to no avail.
>>   Any tips?
>> 
>> Thanks,
>> Ron


Re: Buidling spark in Eclipse Kepler

2014-08-06 Thread DB Tsai
After sbt gen-idea , you can open the intellji project directly without
going through pom.xml

If u want to compile inside intellji, you have to remove one of the messo
jar. This is an open issue, and u can find the detail in JIRA.

Sent from my Google Nexus 5
On Aug 6, 2014 8:54 PM, "Ron Gonzalez"  wrote:

> So I downloaded community edition of IntelliJ, and ran sbt/sbt gen-idea.
> I then imported the pom.xml file.
> I'm still getting all sorts of errors from IntelliJ about unresolved
> dependencies.
> Any suggestions?
>
> Thanks,
> Ron
>
>
> On Wednesday, August 6, 2014 12:29 PM, Ron Gonzalez
>  wrote:
>
>
>
> Hi,
>   I'm trying to get the apache spark trunk compiling in my Eclipse, but I
> can't seem to get it going. In particular, I've tried sbt/sbt eclipse, but
> it doesn't seem to create the eclipse pieces for yarn and other projects.
> Doing mvn eclipse:eclipse on yarn seems to fail as well as sbt/sbt eclipse
> just for yarn fails. Is there some documentation available for eclipse?
> I've gone through the ones on the site, but to no avail.
>   Any tips?
>
> Thanks,
> Ron


Re: Buidling spark in Eclipse Kepler

2014-08-06 Thread Ron Gonzalez
So I downloaded community edition of IntelliJ, and ran sbt/sbt gen-idea.
I then imported the pom.xml file.
I'm still getting all sorts of errors from IntelliJ about unresolved 
dependencies.
Any suggestions?

Thanks,
Ron


On Wednesday, August 6, 2014 12:29 PM, Ron Gonzalez 
 wrote:
 


Hi,
  I'm trying to get the apache spark trunk compiling in my Eclipse, but I can't 
seem to get it going. In particular, I've tried sbt/sbt eclipse, but it doesn't 
seem to create the eclipse pieces for yarn and other projects. Doing mvn 
eclipse:eclipse on yarn seems to fail as well as sbt/sbt eclipse just for yarn 
fails. Is there some documentation available for eclipse? I've gone through the 
ones on the site, but to no avail.
  Any tips?

Thanks,
Ron

Re: Buidling spark in Eclipse Kepler

2014-08-06 Thread Ron Gonzalez
Ok I'll give it a little more time, and if I can't get it going, I'll switch. I 
am indeed a little disappointed in the Scala IDE plugin for Eclipse so I think 
switching to IntelliJ might be my best bet.

Thanks,
Ron

Sent from my iPad

> On Aug 6, 2014, at 1:43 PM, Sean Owen  wrote:
> 
> I think your best bet by far is to consume the Maven build as-is from
> within Eclipse. I wouldn't try to export a project config from the
> build as there is plenty to get lost in translation.
> 
> Certainly this works well with IntelliJ, and by the by, if you have a
> choice, I would strongly recommend IntelliJ over Eclipse for working
> with Maven and Scala.
> 
> On Wed, Aug 6, 2014 at 8:29 PM, Ron Gonzalez
>  wrote:
>> Hi,
>>  I'm trying to get the apache spark trunk compiling in my Eclipse, but I 
>> can't seem to get it going. In particular, I've tried sbt/sbt eclipse, but 
>> it doesn't seem to create the eclipse pieces for yarn and other projects. 
>> Doing mvn eclipse:eclipse on yarn seems to fail as well as sbt/sbt eclipse 
>> just for yarn fails. Is there some documentation available for eclipse? I've 
>> gone through the ones on the site, but to no avail.
>>  Any tips?
>> 
>> Thanks,
>> Ron

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: [SPARK-2878] Kryo serialisation with custom Kryo registrator failing

2014-08-06 Thread Graham Dennis
See my comment on https://issues.apache.org/jira/browse/SPARK-2878 for the
full stacktrace, but it's in the BlockManager/BlockManagerWorker where it's
trying to fulfil a "getBlock" request for another node.  The objects that
would be in the block haven't yet been serialised, and that then causes the
deserialisation to happen on that thread.  See MemoryStore.scala:102.


On 7 August 2014 11:53, Reynold Xin  wrote:

> I don't think it was a conscious design decision to not include the
> application classes in the connection manager serializer. We should fix
> that. Where is it deserializing data in that thread?
>
> 4 might make sense in the long run, but it adds a lot of complexity to the
> code base (whole separate code base, task queue, blocking/non-blocking
> logic within task threads) that can be error prone, so I think it is best
> to stay away from that right now.
>
>
>
>
>
> On Wed, Aug 6, 2014 at 6:47 PM, Graham Dennis 
> wrote:
>
>> Hi Spark devs,
>>
>> I’ve posted an issue on JIRA (
>> https://issues.apache.org/jira/browse/SPARK-2878) which occurs when using
>> Kryo serialisation with a custom Kryo registrator to register custom
>> classes with Kryo.  This is an insidious issue that non-deterministically
>> causes Kryo to have different ID number => class name maps on different
>> nodes, which then causes weird exceptions (ClassCastException,
>> ClassNotFoundException, ArrayIndexOutOfBoundsException) at deserialisation
>> time.  I’ve created a reliable reproduction for the issue here:
>> https://github.com/GrahamDennis/spark-kryo-serialisation
>>
>> I’m happy to try and put a pull request together to try and address this,
>> but it’s not obvious to me the right way to solve this and I’d like to get
>> feedback / ideas on how to address this.
>>
>> The root cause of the problem is a "Failed to run spark.kryo.registrator”
>> error which non-deterministically occurs in some executor processes during
>> operation.  My custom Kryo registrator is in the application jar, and it
>> is
>> accessible on the worker nodes.  This is demonstrated by the fact that
>> most
>> of the time the custom kryo registrator is successfully run.
>>
>> What’s happening is that Kryo serialisation/deserialisation is happening
>> most of the time on an “Executor task launch worker” thread, which has the
>> thread's class loader set to contain the application jar.  This happens in
>> `org.apache.spark.executor.Executor.TaskRunner.run`, and from what I can
>> tell, it is only these threads that have access to the application jar
>> (that contains the custom Kryo registrator).  However, the
>> ConnectionManager threads sometimes need to serialise/deserialise objects
>> to satisfy “getBlock” requests when the objects haven’t previously been
>> serialised.  As the ConnectionManager threads don’t have the application
>> jar available from their class loader, when it tries to look up the custom
>> Kryo registrator, this fails.  Spark then swallows this exception, which
>> results in a different ID number —> class mapping for this kryo instance,
>> and this then causes deserialisation errors later on a different node.
>>
>> A related issue to the issue reported in SPARK-2878 is that Spark probably
>> shouldn’t swallow the ClassNotFound exception for custom Kryo
>> registrators.
>>  The user has explicitly specified this class, and if it deterministically
>> can’t be found, then it may cause problems at serialisation /
>> deserialisation time.  If only sometimes it can’t be found (as in this
>> case), then it leads to a data corruption issue later on.  Either way,
>> we’re better off dying due to the ClassNotFound exception earlier, than
>> the
>> weirder errors later on.
>>
>> I have some ideas on potential solutions to this issue, but I’m keen for
>> experienced eyes to critique these approaches:
>>
>> 1. The simplest approach to fixing this would be to just make the
>> application jar available to the connection manager threads, but I’m
>> guessing it’s a design decision to isolate the application jar to just the
>> executor task runner threads.  Also, I don’t know if there are any other
>> threads that might be interacting with kryo serialisation /
>> deserialisation.
>> 2. Before looking up the custom Kryo registrator, change the thread’s
>> class
>> loader to include the application jar, then restore the class loader after
>> the kryo registrator has been run.  I don’t know if this would have any
>> other side-effects.
>> 3. Always serialise / deserialise on the existing TaskRunner threads,
>> rather than delaying serialisation until later, when it can be done only
>> if
>> needed.  This approach would probably have negative performance
>> consequences.
>> 4. Create a new dedicated thread pool for lazy serialisation /
>> deserialisation that has the application jar on the class path.
>>  Serialisation / deserialisation would be the only thing these threads do,
>> and this would minimise conflicts / interactions between the applica

Re: [SPARK-2878] Kryo serialisation with custom Kryo registrator failing

2014-08-06 Thread Reynold Xin
I don't think it was a conscious design decision to not include the
application classes in the connection manager serializer. We should fix
that. Where is it deserializing data in that thread?

4 might make sense in the long run, but it adds a lot of complexity to the
code base (whole separate code base, task queue, blocking/non-blocking
logic within task threads) that can be error prone, so I think it is best
to stay away from that right now.





On Wed, Aug 6, 2014 at 6:47 PM, Graham Dennis 
wrote:

> Hi Spark devs,
>
> I’ve posted an issue on JIRA (
> https://issues.apache.org/jira/browse/SPARK-2878) which occurs when using
> Kryo serialisation with a custom Kryo registrator to register custom
> classes with Kryo.  This is an insidious issue that non-deterministically
> causes Kryo to have different ID number => class name maps on different
> nodes, which then causes weird exceptions (ClassCastException,
> ClassNotFoundException, ArrayIndexOutOfBoundsException) at deserialisation
> time.  I’ve created a reliable reproduction for the issue here:
> https://github.com/GrahamDennis/spark-kryo-serialisation
>
> I’m happy to try and put a pull request together to try and address this,
> but it’s not obvious to me the right way to solve this and I’d like to get
> feedback / ideas on how to address this.
>
> The root cause of the problem is a "Failed to run spark.kryo.registrator”
> error which non-deterministically occurs in some executor processes during
> operation.  My custom Kryo registrator is in the application jar, and it is
> accessible on the worker nodes.  This is demonstrated by the fact that most
> of the time the custom kryo registrator is successfully run.
>
> What’s happening is that Kryo serialisation/deserialisation is happening
> most of the time on an “Executor task launch worker” thread, which has the
> thread's class loader set to contain the application jar.  This happens in
> `org.apache.spark.executor.Executor.TaskRunner.run`, and from what I can
> tell, it is only these threads that have access to the application jar
> (that contains the custom Kryo registrator).  However, the
> ConnectionManager threads sometimes need to serialise/deserialise objects
> to satisfy “getBlock” requests when the objects haven’t previously been
> serialised.  As the ConnectionManager threads don’t have the application
> jar available from their class loader, when it tries to look up the custom
> Kryo registrator, this fails.  Spark then swallows this exception, which
> results in a different ID number —> class mapping for this kryo instance,
> and this then causes deserialisation errors later on a different node.
>
> A related issue to the issue reported in SPARK-2878 is that Spark probably
> shouldn’t swallow the ClassNotFound exception for custom Kryo registrators.
>  The user has explicitly specified this class, and if it deterministically
> can’t be found, then it may cause problems at serialisation /
> deserialisation time.  If only sometimes it can’t be found (as in this
> case), then it leads to a data corruption issue later on.  Either way,
> we’re better off dying due to the ClassNotFound exception earlier, than the
> weirder errors later on.
>
> I have some ideas on potential solutions to this issue, but I’m keen for
> experienced eyes to critique these approaches:
>
> 1. The simplest approach to fixing this would be to just make the
> application jar available to the connection manager threads, but I’m
> guessing it’s a design decision to isolate the application jar to just the
> executor task runner threads.  Also, I don’t know if there are any other
> threads that might be interacting with kryo serialisation /
> deserialisation.
> 2. Before looking up the custom Kryo registrator, change the thread’s class
> loader to include the application jar, then restore the class loader after
> the kryo registrator has been run.  I don’t know if this would have any
> other side-effects.
> 3. Always serialise / deserialise on the existing TaskRunner threads,
> rather than delaying serialisation until later, when it can be done only if
> needed.  This approach would probably have negative performance
> consequences.
> 4. Create a new dedicated thread pool for lazy serialisation /
> deserialisation that has the application jar on the class path.
>  Serialisation / deserialisation would be the only thing these threads do,
> and this would minimise conflicts / interactions between the application
> jar and other jars.
>
> #4 sounds like the best approach to me, but I think would require
> considerable knowledge of Spark internals, which is beyond me at present.
>  Does anyone have any better (and ideally simpler) ideas?
>
> Cheers,
>
> Graham
>


[SPARK-2878] Kryo serialisation with custom Kryo registrator failing

2014-08-06 Thread Graham Dennis
Hi Spark devs,

I’ve posted an issue on JIRA (
https://issues.apache.org/jira/browse/SPARK-2878) which occurs when using
Kryo serialisation with a custom Kryo registrator to register custom
classes with Kryo.  This is an insidious issue that non-deterministically
causes Kryo to have different ID number => class name maps on different
nodes, which then causes weird exceptions (ClassCastException,
ClassNotFoundException, ArrayIndexOutOfBoundsException) at deserialisation
time.  I’ve created a reliable reproduction for the issue here:
https://github.com/GrahamDennis/spark-kryo-serialisation

I’m happy to try and put a pull request together to try and address this,
but it’s not obvious to me the right way to solve this and I’d like to get
feedback / ideas on how to address this.

The root cause of the problem is a "Failed to run spark.kryo.registrator”
error which non-deterministically occurs in some executor processes during
operation.  My custom Kryo registrator is in the application jar, and it is
accessible on the worker nodes.  This is demonstrated by the fact that most
of the time the custom kryo registrator is successfully run.

What’s happening is that Kryo serialisation/deserialisation is happening
most of the time on an “Executor task launch worker” thread, which has the
thread's class loader set to contain the application jar.  This happens in
`org.apache.spark.executor.Executor.TaskRunner.run`, and from what I can
tell, it is only these threads that have access to the application jar
(that contains the custom Kryo registrator).  However, the
ConnectionManager threads sometimes need to serialise/deserialise objects
to satisfy “getBlock” requests when the objects haven’t previously been
serialised.  As the ConnectionManager threads don’t have the application
jar available from their class loader, when it tries to look up the custom
Kryo registrator, this fails.  Spark then swallows this exception, which
results in a different ID number —> class mapping for this kryo instance,
and this then causes deserialisation errors later on a different node.

A related issue to the issue reported in SPARK-2878 is that Spark probably
shouldn’t swallow the ClassNotFound exception for custom Kryo registrators.
 The user has explicitly specified this class, and if it deterministically
can’t be found, then it may cause problems at serialisation /
deserialisation time.  If only sometimes it can’t be found (as in this
case), then it leads to a data corruption issue later on.  Either way,
we’re better off dying due to the ClassNotFound exception earlier, than the
weirder errors later on.

I have some ideas on potential solutions to this issue, but I’m keen for
experienced eyes to critique these approaches:

1. The simplest approach to fixing this would be to just make the
application jar available to the connection manager threads, but I’m
guessing it’s a design decision to isolate the application jar to just the
executor task runner threads.  Also, I don’t know if there are any other
threads that might be interacting with kryo serialisation / deserialisation.
2. Before looking up the custom Kryo registrator, change the thread’s class
loader to include the application jar, then restore the class loader after
the kryo registrator has been run.  I don’t know if this would have any
other side-effects.
3. Always serialise / deserialise on the existing TaskRunner threads,
rather than delaying serialisation until later, when it can be done only if
needed.  This approach would probably have negative performance
consequences.
4. Create a new dedicated thread pool for lazy serialisation /
deserialisation that has the application jar on the class path.
 Serialisation / deserialisation would be the only thing these threads do,
and this would minimise conflicts / interactions between the application
jar and other jars.

#4 sounds like the best approach to me, but I think would require
considerable knowledge of Spark internals, which is beyond me at present.
 Does anyone have any better (and ideally simpler) ideas?

Cheers,

Graham


Documentation confusing or incorrect for decision trees?

2014-08-06 Thread Matt Forbes
I found the section on ordering categorical features really interesting,
but the A, B, C example seemed inconsistent. Am I interpreting this passage
wrong, or are there typos? Aren't the split candidates A | C, B and A, C |
B ?

For example, for a binary classification problem with one categorical
feature with three categories A, B and C with corresponding proportion of
label 1 as 0.2, 0.6 and 0.4, the categorical features are ordered as A
followed by C followed B or A, B, C. The two split candidates are A | C, B
and A , B | C where | denotes the split.


Re: Low Level Kafka Consumer for Spark

2014-08-06 Thread Tathagata Das
Hi Dibyendu,

This is really awesome. I am still yet to go through the code to understand
the details, but I want to do it really soon. In particular, I want to
understand the improvements, over the existing Kafka receiver.

And its fantastic to see such contributions from the community. :)

TD

On Tue, Aug 5, 2014 at 8:38 AM, Dibyendu Bhattacharya <
dibyendu.bhattach...@gmail.com> wrote:

> Hi
>
> This fault tolerant aspect already taken care in the Kafka-Spark Consumer
> code , like if Leader of a partition changes etc.. in ZkCoordinator.java.
> Basically it does a refresh of PartitionManagers every X seconds to make
> sure Partition details is correct and consumer don't fail.
>
> Dib
>
>
> On Tue, Aug 5, 2014 at 8:01 PM, Shao, Saisai 
> wrote:
>
> > Hi,
> >
> > I think this is an awesome feature for Spark Streaming Kafka interface to
> > offer user the controllability of partition offset, so user can have more
> > applications based on this.
> >
> > What I concern is that if we want to do offset management, fault tolerant
> > related control and others, we have to take the role as current
> > ZookeeperConsumerConnect did, that would be a big field we should take
> care
> > of, for example when node is failed, how to pass current partition to
> > another consumer and some others. I’m not sure what is your thought?
> >
> > Thanks
> > Jerry
> >
> > From: Dibyendu Bhattacharya [mailto:dibyendu.bhattach...@gmail.com]
> > Sent: Tuesday, August 05, 2014 5:15 PM
> > To: Jonathan Hodges; dev@spark.apache.org
> > Cc: user
> > Subject: Re: Low Level Kafka Consumer for Spark
> >
> > Thanks Jonathan,
> >
> > Yes, till non-ZK based offset management is available in Kafka, I need to
> > maintain the offset in ZK. And yes, both cases explicit commit is
> > necessary. I modified the Low Level Kafka Spark Consumer little bit to
> have
> > Receiver spawns threads for every partition of the topic and perform the
> > 'store' operation in multiple threads. It would be good if the
> > receiver.store methods are made thread safe..which is not now presently .
> >
> > Waiting for TD's comment on this Kafka Spark Low Level consumer.
> >
> >
> > Regards,
> > Dibyendu
> >
> >
> > On Tue, Aug 5, 2014 at 5:32 AM, Jonathan Hodges   > hodg...@gmail.com>> wrote:
> > Hi Yan,
> >
> > That is a good suggestion.  I believe non-Zookeeper offset management
> will
> > be a feature in the upcoming Kafka 0.8.2 release tentatively scheduled
> for
> > September.
> >
> >
> >
> https://cwiki.apache.org/confluence/display/KAFKA/Inbuilt+Consumer+Offset+Management
> >
> > That should make this fairly easy to implement, but it will still require
> > explicit offset commits to avoid data loss which is different than the
> > current KafkaUtils implementation.
> >
> > Jonathan
> >
> >
> >
> >
> > On Mon, Aug 4, 2014 at 4:51 PM, Yan Fang  > yanfang...@gmail.com>> wrote:
> > Another suggestion that may help is that, you can consider use Kafka to
> > store the latest offset instead of Zookeeper. There are at least two
> > benefits: 1) lower the workload of ZK 2) support replay from certain
> > offset. This is how Samza deals with
> > the Kafka offset, the doc is here<
> >
> http://samza.incubator.apache.org/learn/documentation/0.7.0/container/checkpointing.html
> >
> > . Thank you.
> >
> > Cheers,
> >
> > Fang, Yan
> > yanfang...@gmail.com
> > +1 (206) 849-4108
> >
> > On Sun, Aug 3, 2014 at 8:59 PM, Patrick Wendell  > > wrote:
> > I'll let TD chime on on this one, but I'm guessing this would be a
> welcome
> > addition. It's great to see community effort on adding new
> > streams/receivers, adding a Java API for receivers was something we did
> > specifically to allow this :)
> >
> > - Patrick
> >
> > On Sat, Aug 2, 2014 at 10:09 AM, Dibyendu Bhattacharya <
> > dibyendu.bhattach...@gmail.com>
> > wrote:
> > Hi,
> >
> > I have implemented a Low Level Kafka Consumer for Spark Streaming using
> > Kafka Simple Consumer API. This API will give better control over the
> Kafka
> > offset management and recovery from failures. As the present Spark
> > KafkaUtils uses HighLevel Kafka Consumer API, I wanted to have a better
> > control over the offset management which is not possible in Kafka
> HighLevel
> > consumer.
> >
> > This Project is available in below Repo :
> >
> > https://github.com/dibbhatt/kafka-spark-consumer
> >
> >
> > I have implemented a Custom Receiver consumer.kafka.client.KafkaReceiver.
> > The KafkaReceiver uses low level Kafka Consumer API (implemented in
> > consumer.kafka packages) to fetch messages from Kafka and 'store' it in
> > Spark.
> >
> > The logic will detect number of partitions for a topic and spawn that
> many
> > threads (Individual instances of Consumers). Kafka Consumer uses
> Zookeeper
> > for storing the latest offset for individual partitions, which will help
> to
> > recover in case of failure. T

Re: compilation error in Catalyst module

2014-08-06 Thread Ted Yu
Forgot to do that step.

Now compilation passes.


On Wed, Aug 6, 2014 at 1:36 PM, Zongheng Yang  wrote:

> Hi Ted,
>
> By refreshing do you mean you have done 'mvn clean'?
>
> On Wed, Aug 6, 2014 at 1:17 PM, Ted Yu  wrote:
> > I refreshed my workspace.
> > I got the following error with this command:
> >
> > mvn -Pyarn -Phive -Phadoop-2.4 -DskipTests install
> >
> > [ERROR] bad symbolic reference. A signature in package.class refers to
> term
> > scalalogging
> > in package com.typesafe which is not available.
> > It may be completely missing from the current classpath, or the version
> on
> > the classpath might be incompatible with the version used when compiling
> > package.class.
> > [ERROR]
> >
> /homes/hortonzy/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/package.scala:36:
> > bad symbolic reference. A signature in package.class refers to term slf4j
> > in value com.typesafe.scalalogging which is not available.
> > It may be completely missing from the current classpath, or the version
> on
> > the classpath might be incompatible with the version used when compiling
> > package.class.
> > [ERROR] package object trees extends Logging {
> > [ERROR]  ^
> > [ERROR] two errors found
> >
> > Has anyone else seen the above ?
> >
> > Thanks
>


Re: Buidling spark in Eclipse Kepler

2014-08-06 Thread Sean Owen
I think your best bet by far is to consume the Maven build as-is from
within Eclipse. I wouldn't try to export a project config from the
build as there is plenty to get lost in translation.

Certainly this works well with IntelliJ, and by the by, if you have a
choice, I would strongly recommend IntelliJ over Eclipse for working
with Maven and Scala.

On Wed, Aug 6, 2014 at 8:29 PM, Ron Gonzalez
 wrote:
> Hi,
>   I'm trying to get the apache spark trunk compiling in my Eclipse, but I 
> can't seem to get it going. In particular, I've tried sbt/sbt eclipse, but it 
> doesn't seem to create the eclipse pieces for yarn and other projects. Doing 
> mvn eclipse:eclipse on yarn seems to fail as well as sbt/sbt eclipse just for 
> yarn fails. Is there some documentation available for eclipse? I've gone 
> through the ones on the site, but to no avail.
>   Any tips?
>
> Thanks,
> Ron

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: compilation error in Catalyst module

2014-08-06 Thread Zongheng Yang
Hi Ted,

By refreshing do you mean you have done 'mvn clean'?

On Wed, Aug 6, 2014 at 1:17 PM, Ted Yu  wrote:
> I refreshed my workspace.
> I got the following error with this command:
>
> mvn -Pyarn -Phive -Phadoop-2.4 -DskipTests install
>
> [ERROR] bad symbolic reference. A signature in package.class refers to term
> scalalogging
> in package com.typesafe which is not available.
> It may be completely missing from the current classpath, or the version on
> the classpath might be incompatible with the version used when compiling
> package.class.
> [ERROR]
> /homes/hortonzy/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/package.scala:36:
> bad symbolic reference. A signature in package.class refers to term slf4j
> in value com.typesafe.scalalogging which is not available.
> It may be completely missing from the current classpath, or the version on
> the classpath might be incompatible with the version used when compiling
> package.class.
> [ERROR] package object trees extends Logging {
> [ERROR]  ^
> [ERROR] two errors found
>
> Has anyone else seen the above ?
>
> Thanks

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



compilation error in Catalyst module

2014-08-06 Thread Ted Yu
I refreshed my workspace.
I got the following error with this command:

mvn -Pyarn -Phive -Phadoop-2.4 -DskipTests install

[ERROR] bad symbolic reference. A signature in package.class refers to term
scalalogging
in package com.typesafe which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling
package.class.
[ERROR]
/homes/hortonzy/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/package.scala:36:
bad symbolic reference. A signature in package.class refers to term slf4j
in value com.typesafe.scalalogging which is not available.
It may be completely missing from the current classpath, or the version on
the classpath might be incompatible with the version used when compiling
package.class.
[ERROR] package object trees extends Logging {
[ERROR]  ^
[ERROR] two errors found

Has anyone else seen the above ?

Thanks


Buidling spark in Eclipse Kepler

2014-08-06 Thread Ron Gonzalez
Hi,
  I'm trying to get the apache spark trunk compiling in my Eclipse, but I can't 
seem to get it going. In particular, I've tried sbt/sbt eclipse, but it doesn't 
seem to create the eclipse pieces for yarn and other projects. Doing mvn 
eclipse:eclipse on yarn seems to fail as well as sbt/sbt eclipse just for yarn 
fails. Is there some documentation available for eclipse? I've gone through the 
ones on the site, but to no avail.
  Any tips?

Thanks,
Ron

Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1

2014-08-06 Thread Debasish Das
I did not play with Hadoop settings...everything is compiled with
2.3.0CDH5.0.2 for me...

I did try to bump the version number of HBase from 0.94 to 0.96 or 0.98 but
there was no profile for CDH in the pom...but that's unrelated to this !


On Wed, Aug 6, 2014 at 9:45 AM, DB Tsai  wrote:

> One related question, is mllib jar independent from hadoop version (doesnt
> use hadoop api directly)? Can I use mllib jar compile for one version of
> hadoop and use it in another version of hadoop?
>
> Sent from my Google Nexus 5
> On Aug 6, 2014 8:29 AM, "Debasish Das"  wrote:
>
>> Hi Xiangrui,
>>
>> Maintaining another file will be a pain later so I deployed spark 1.0.1
>> without mllib and then my application jar bundles mllib 1.1.0-SNAPSHOT
>> along with the code changes for quadratic optimization...
>>
>> Later the plan is to patch the snapshot mllib with the deployed stable
>> mllib...
>>
>> There are 5 variants that I am experimenting with around 400M ratings
>> (daily data, monthly data I will update in few days)...
>>
>> 1. LS
>> 2. NNLS
>> 3. Quadratic with bounds
>> 4. Quadratic with L1
>> 5. Quadratic with equality and positivity
>>
>> Now the ALS 1.1.0 snapshot runs fine but after completion on this step
>> ALS.scala:311
>>
>> // Materialize usersOut and productsOut.
>> usersOut.count()
>>
>> I am getting from one of the executors: java.lang.ClassCastException:
>> scala.Tuple1 cannot be cast to scala.Product2
>>
>> I am debugging it further but I was wondering if this is due to RDD
>> compatibility within 1.0.1 and 1.1.0-SNAPSHOT ?
>>
>> I have built the jars on my Mac which has Java 1.7.0_55 but the deployed
>> cluster has Java 1.7.0_45.
>>
>> The flow runs fine on my localhost spark 1.0.1 with 1 worker. Can that
>> Java
>> version mismatch cause this ?
>>
>> Stack traces are below
>>
>> Thanks.
>> Deb
>>
>>
>> Executor stacktrace:
>>
>>
>> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:156)
>>
>>
>>
>> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)
>>
>>
>>
>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>>
>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>>
>> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154)
>>
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>
>> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>
>>
>> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>>
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>
>> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>
>>
>>
>> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
>>
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>
>> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>
>>
>> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>>
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>
>> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>
>>
>>
>> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:126)
>>
>>
>>
>> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:123)
>>
>>
>>
>> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>>
>>
>>
>> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>>
>>
>> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>>
>>
>>
>> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>>
>> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:123)
>>
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>
>> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>
>>
>> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>>
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>
>> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>
>>
>>
>> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
>>
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>
>> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>
>> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
>>
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>>
>> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>>
>>
>>
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
>>
>>
>> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>>
>> org.apache.spark.scheduler.Task.run(Task.scala:51)
>>
>>
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
>>
>>
>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>
>>
>>
>> jav

Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1

2014-08-06 Thread DB Tsai
One related question, is mllib jar independent from hadoop version (doesnt
use hadoop api directly)? Can I use mllib jar compile for one version of
hadoop and use it in another version of hadoop?

Sent from my Google Nexus 5
On Aug 6, 2014 8:29 AM, "Debasish Das"  wrote:

> Hi Xiangrui,
>
> Maintaining another file will be a pain later so I deployed spark 1.0.1
> without mllib and then my application jar bundles mllib 1.1.0-SNAPSHOT
> along with the code changes for quadratic optimization...
>
> Later the plan is to patch the snapshot mllib with the deployed stable
> mllib...
>
> There are 5 variants that I am experimenting with around 400M ratings
> (daily data, monthly data I will update in few days)...
>
> 1. LS
> 2. NNLS
> 3. Quadratic with bounds
> 4. Quadratic with L1
> 5. Quadratic with equality and positivity
>
> Now the ALS 1.1.0 snapshot runs fine but after completion on this step
> ALS.scala:311
>
> // Materialize usersOut and productsOut.
> usersOut.count()
>
> I am getting from one of the executors: java.lang.ClassCastException:
> scala.Tuple1 cannot be cast to scala.Product2
>
> I am debugging it further but I was wondering if this is due to RDD
> compatibility within 1.0.1 and 1.1.0-SNAPSHOT ?
>
> I have built the jars on my Mac which has Java 1.7.0_55 but the deployed
> cluster has Java 1.7.0_45.
>
> The flow runs fine on my localhost spark 1.0.1 with 1 worker. Can that Java
> version mismatch cause this ?
>
> Stack traces are below
>
> Thanks.
> Deb
>
>
> Executor stacktrace:
>
>
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:156)
>
>
>
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)
>
>
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154)
>
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
>
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
>
>
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
>
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
>
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
>
>
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:126)
>
>
>
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:123)
>
>
>
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>
>
>
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>
> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>
>
>
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:123)
>
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
>
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
>
>
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
>
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
>
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
>
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
>
>
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>
> org.apache.spark.scheduler.Task.run(Task.scala:51)
>
>
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
>
>
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
>
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> java.lang.Thread.run(Thread.java:744)
>
> Driver stacktrace:
>
> at org.apache.spark.scheduler.DAGScheduler.org
>
> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
>
> at
>
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
>
> at
>
> org.apache.spark.scheduler.DAGScheduler$$anon

Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1

2014-08-06 Thread Debasish Das
Ok...let me look into it a bit more and most likely I will deploy the Spark
v1.1 and then use mllib 1.1 SNAPSHOT jar with it so that we follow your
guideline of not running newer spark component on older version of spark
core...

That should solve this issue unless it is related to Java versions

I am also keen to see the final recommendation within L1 and
PositivityI will compute the metrics

Our plan is to use scalable matrix factorization as an engine to do
clustering, feature extraction, topic modeling and auto encoders (single
layer to start with). So these algorithms are not really constrained to
recommendation use-cases...



On Wed, Aug 6, 2014 at 9:09 AM, Xiangrui Meng  wrote:

> One thing I like to clarify is that we do not support running a newer
> version of a Spark component on top of a older version of Spark core.
> I don't remember any code change in MLlib that requires Spark v1.1 but
> I might miss some PRs. There were changes to CoGroup, which may be
> relevant:
>
>
> https://github.com/apache/spark/commits/master/core/src/main/scala/org/apache/spark/rdd/CoGroupedRDD.scala
>
> Btw, for the constrained optimization, I'm really interested in how
> they differ in the final recommendation? It would be great if you can
> test prec@k or ndcg@k metrics.
>
> Best,
> Xiangrui
>
> On Wed, Aug 6, 2014 at 8:28 AM, Debasish Das 
> wrote:
> > Hi Xiangrui,
> >
> > Maintaining another file will be a pain later so I deployed spark 1.0.1
> > without mllib and then my application jar bundles mllib 1.1.0-SNAPSHOT
> along
> > with the code changes for quadratic optimization...
> >
> > Later the plan is to patch the snapshot mllib with the deployed stable
> > mllib...
> >
> > There are 5 variants that I am experimenting with around 400M ratings
> (daily
> > data, monthly data I will update in few days)...
> >
> > 1. LS
> > 2. NNLS
> > 3. Quadratic with bounds
> > 4. Quadratic with L1
> > 5. Quadratic with equality and positivity
> >
> > Now the ALS 1.1.0 snapshot runs fine but after completion on this step
> > ALS.scala:311
> >
> > // Materialize usersOut and productsOut.
> > usersOut.count()
> >
> > I am getting from one of the executors: java.lang.ClassCastException:
> > scala.Tuple1 cannot be cast to scala.Product2
> >
> > I am debugging it further but I was wondering if this is due to RDD
> > compatibility within 1.0.1 and 1.1.0-SNAPSHOT ?
> >
> > I have built the jars on my Mac which has Java 1.7.0_55 but the deployed
> > cluster has Java 1.7.0_45.
> >
> > The flow runs fine on my localhost spark 1.0.1 with 1 worker. Can that
> Java
> > version mismatch cause this ?
> >
> > Stack traces are below
> >
> > Thanks.
> > Deb
> >
> >
> > Executor stacktrace:
> >
> >
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:156)
> >
> >
> >
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)
> >
> >
> >
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
> >
> >
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
> >
> > org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154)
> >
> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> >
> > org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> >
> >
> > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> >
> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> >
> > org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> >
> >
> >
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
> >
> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> >
> > org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> >
> >
> > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> >
> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> >
> > org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> >
> >
> >
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:126)
> >
> >
> >
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:123)
> >
> >
> >
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
> >
> >
> >
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> >
> >
> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> >
> >
> >
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
> >
> > org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:123)
> >
> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> >
> > org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> >
> >
> > org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
> >
> > org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> >
> > org.apache.spark.rdd.RDD.iterator

Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1

2014-08-06 Thread Xiangrui Meng
One thing I like to clarify is that we do not support running a newer
version of a Spark component on top of a older version of Spark core.
I don't remember any code change in MLlib that requires Spark v1.1 but
I might miss some PRs. There were changes to CoGroup, which may be
relevant:

https://github.com/apache/spark/commits/master/core/src/main/scala/org/apache/spark/rdd/CoGroupedRDD.scala

Btw, for the constrained optimization, I'm really interested in how
they differ in the final recommendation? It would be great if you can
test prec@k or ndcg@k metrics.

Best,
Xiangrui

On Wed, Aug 6, 2014 at 8:28 AM, Debasish Das  wrote:
> Hi Xiangrui,
>
> Maintaining another file will be a pain later so I deployed spark 1.0.1
> without mllib and then my application jar bundles mllib 1.1.0-SNAPSHOT along
> with the code changes for quadratic optimization...
>
> Later the plan is to patch the snapshot mllib with the deployed stable
> mllib...
>
> There are 5 variants that I am experimenting with around 400M ratings (daily
> data, monthly data I will update in few days)...
>
> 1. LS
> 2. NNLS
> 3. Quadratic with bounds
> 4. Quadratic with L1
> 5. Quadratic with equality and positivity
>
> Now the ALS 1.1.0 snapshot runs fine but after completion on this step
> ALS.scala:311
>
> // Materialize usersOut and productsOut.
> usersOut.count()
>
> I am getting from one of the executors: java.lang.ClassCastException:
> scala.Tuple1 cannot be cast to scala.Product2
>
> I am debugging it further but I was wondering if this is due to RDD
> compatibility within 1.0.1 and 1.1.0-SNAPSHOT ?
>
> I have built the jars on my Mac which has Java 1.7.0_55 but the deployed
> cluster has Java 1.7.0_45.
>
> The flow runs fine on my localhost spark 1.0.1 with 1 worker. Can that Java
> version mismatch cause this ?
>
> Stack traces are below
>
> Thanks.
> Deb
>
>
> Executor stacktrace:
>
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:156)
>
>
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)
>
>
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>
> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154)
>
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
>
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
>
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
>
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
>
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
>
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:126)
>
>
> org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:123)
>
>
> scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)
>
>
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>
> scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
>
>
> scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)
>
> org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:123)
>
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
>
> org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
>
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
>
> org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
>
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
> org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
>
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
>
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)
>
>
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)
>
> org.apache.spark.scheduler.Task.run(Task.scala:51)
>
>
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)
>
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>
> java.lang.Thread.run(Thread.

Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1

2014-08-06 Thread Debasish Das
Hi Xiangrui,

Maintaining another file will be a pain later so I deployed spark 1.0.1
without mllib and then my application jar bundles mllib 1.1.0-SNAPSHOT
along with the code changes for quadratic optimization...

Later the plan is to patch the snapshot mllib with the deployed stable
mllib...

There are 5 variants that I am experimenting with around 400M ratings
(daily data, monthly data I will update in few days)...

1. LS
2. NNLS
3. Quadratic with bounds
4. Quadratic with L1
5. Quadratic with equality and positivity

Now the ALS 1.1.0 snapshot runs fine but after completion on this step
ALS.scala:311

// Materialize usersOut and productsOut.
usersOut.count()

I am getting from one of the executors: java.lang.ClassCastException:
scala.Tuple1 cannot be cast to scala.Product2

I am debugging it further but I was wondering if this is due to RDD
compatibility within 1.0.1 and 1.1.0-SNAPSHOT ?

I have built the jars on my Mac which has Java 1.7.0_55 but the deployed
cluster has Java 1.7.0_45.

The flow runs fine on my localhost spark 1.0.1 with 1 worker. Can that Java
version mismatch cause this ?

Stack traces are below

Thanks.
Deb


Executor stacktrace:

org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:156)


org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:154)


scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:154)

org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

org.apache.spark.rdd.RDD.iterator(RDD.scala:229)


org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)

org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

org.apache.spark.rdd.RDD.iterator(RDD.scala:229)


org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)

org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

org.apache.spark.rdd.RDD.iterator(RDD.scala:229)


org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)

org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

org.apache.spark.rdd.RDD.iterator(RDD.scala:229)


org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:126)


org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:123)


scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)


scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)

scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)


scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)

org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:123)

org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

org.apache.spark.rdd.RDD.iterator(RDD.scala:229)


org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)

org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

org.apache.spark.rdd.RDD.iterator(RDD.scala:229)


org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)

org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)

org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

org.apache.spark.rdd.RDD.iterator(RDD.scala:229)


org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:158)


org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:99)

org.apache.spark.scheduler.Task.run(Task.scala:51)


org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:183)


java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)


java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

java.lang.Thread.run(Thread.java:744)

Driver stacktrace:

at org.apache.spark.scheduler.DAGScheduler.org
$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)

at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)

at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)

at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)

at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)

at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)

at scala.Option.foreach(Option.scala:236