Re: Upgrade to Spark 1.1.0?

Pat Ferrel Tue, 21 Oct 2014 09:23:50 -0700

Until we get this sorted out I suggest staying on Spark 1.0.1.

There are multiple problems trying to use anything newer. So far I suspect that 
the build of Spark and Mahout must be done in a very particular manner and 
haven’t discovered quite what that is yet.


The error below is often caused by running a version of Spark that Mahout was 
not build against, causing serialization class UIDs to not match. We've heard 
several report of problems running the shell examples and the CLI on 1.0.2 and 
1.1.0

I’ll try to put together a bulletproof steps to build IF I can get it working.

In the meantime thanks for any stack traces and build process descriptions. If 
someone wants to create a JIRA for all these under one ticket that would be 
fine.


On Oct 21, 2014, at 7:15 AM, Mahesh Balija <[email protected]> wrote:

Also if I use any other versions of Spark there are incompatible method
signatures due to which Mahout Spark-shell itself is NOT started.

On Tue, Oct 21, 2014 at 7:42 PM, Mahesh Balija <[email protected]>
wrote:

> Hi All,
> 
> Here are the errors I get which I run in a pseudo distributed mode,
> 
> Spark 1.0.2 and Mahout latest code (Clone)
> 
> When I run the command in page,
> https://mahout.apache.org/users/sparkbindings/play-with-shell.html
> 
> val drmX = drmData(::, 0 until 4)
> 
> java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class 
> incompatible: stream classdesc serialVersionUID = 385418487991259089, local 
> class serialVersionUID = -6766554341038829528
>       at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
>       at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>       at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>       at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>       at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>       at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>       at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>       at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>       at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>       at 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
>       at 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
>       at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
>       at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
>       at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>       at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>       at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>       at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:701)
> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 1 (task 0.0:1)
> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 2 (task 0.0:0)
> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 3 (task 0.0:1)
> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 4 (task 0.0:0)
> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 5 (task 0.0:1)
> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 6 (task 0.0:0)
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 
> failed 4 times, most recent failure: Exception failure in TID 6 on host 
> mahesh-VirtualBox.local: java.io.InvalidClassException: 
> org.apache.spark.rdd.RDD; local class incompatible: stream classdesc 
> serialVersionUID = 385418487991259089, local class serialVersionUID = 
> -6766554341038829528
>        java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
>        java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>        java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>        java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>        java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>        
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>        java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>        java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>        
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>        
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
>        
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
>        java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
>        
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
>        java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>        java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>        
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>        
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
>        org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
>        
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>        
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>        java.lang.Thread.run(Thread.java:701)
> Driver stacktrace:
>       at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
>       at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
>       at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
>       at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>       at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>       at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
>       at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>       at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>       at scala.Option.foreach(Option.scala:236)
>       at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
>       at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
>       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>       at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>       at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>       at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>       at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> 
> Best,
> Mahesh Balija.
> 
> 
> 
> 
> 
> 
> 
> On Tue, Oct 21, 2014 at 2:38 AM, Dmitriy Lyubimov <[email protected]>
> wrote:
> 
>> On Mon, Oct 20, 2014 at 1:51 PM, Pat Ferrel <[email protected]>
>> wrote:
>> 
>>> Is anyone else nervous about ignoring this issue or relying on non-build
>>> (hand run) test driven transitive dependency checking. I hope someone
>> else
>>> will chime in.
>>> 
>>> As to running unit tests on a TEST_MASTER I’ll look into it. Can we set
>> up
>>> the build machine to do this? I’d feel better about eyeballing deps if
>> we
>>> could have a TEST_MASTER automatically run during builds at Apache.
>> Maybe
>>> the regular unit tests are OK for building locally ourselves.
>>> 
>>>> 
>>>> On Oct 20, 2014, at 12:23 PM, Dmitriy Lyubimov <[email protected]>
>>> wrote:
>>>> 
>>>> On Mon, Oct 20, 2014 at 11:44 AM, Pat Ferrel <[email protected]>
>>> wrote:
>>>> 
>>>>> Maybe a more fundamental issue is that we don’t know for sure
>> whether we
>>>>> have missing classes or not. The job.jar at least used the pom
>>> dependencies
>>>>> to guarantee every needed class was present. So the job.jar seems to
>>> solve
>>>>> the problem but may ship some unnecessary duplicate code, right?
>>>>> 
>>>> 
>>>> No, as i wrote spark doesn't  work with job jar format. Neither as it
>>> turns
>>>> out more recent hadoop MR btw.
>>> 
>>> Not speaking literally of the format. Spark understands jars and maven
>> can
>>> build one from transitive dependencies.
>>> 
>>>> 
>>>> Yes, this is A LOT of duplicate code (will take normally MINUTES to
>>> startup
>>>> tasks with all of it just on copy time). This is absolutely not the
>> way
>>> to
>>>> go with this.
>>>> 
>>> 
>>> Lack of guarantee to load seems like a bigger problem than startup time.
>>> Clearly we can’t just ignore this.
>>> 
>> 
>> Nope. given highly iterative nature and dynamic task allocation in this
>> environment, one is looking to effects similar to Map Reduce. This is not
>> the only reason why I never go to MR anymore, but that's one of main ones.
>> 
>> How about experiment: why don't you create assembly that copies ALL
>> transitive dependencies in one folder, and then try to broadcast it from
>> single point (front end) to well... let's start with 20 machines. (of
>> course we ideally want to into 10^3 ..10^4 range -- but why bother if we
>> can't do it for 20).
>> 
>> Or, heck, let's try to simply parallel-copy it between too machines 20
>> times that are not collocated on the same subnet.
>> 
>> 
>>>> 
>>>>> There may be any number of bugs waiting for the time we try running
>> on a
>>>>> node machine that doesn’t have some class in it’s classpath.
>>>> 
>>>> 
>>>> No. Assuming any given method is tested on all its execution paths,
>> there
>>>> will be no bugs. The bugs of that sort will only appear if the user is
>>>> using algebra directly and calls something that is not on the path,
>> from
>>>> the closure. In which case our answer to this is the same as for the
>>> solver
>>>> methodology developers -- use customized SparkConf while creating
>> context
>>>> to include stuff you really want.
>>>> 
>>>> Also another right answer to this is that we probably should
>> reasonably
>>>> provide the toolset here. For example, all the stats stuff found in R
>>> base
>>>> and R stat packages so the user is not compelled to go non-native.
>>>> 
>>>> 
>>> 
>>> Huh? this is not true. The one I ran into was found by calling something
>>> in math from something in math-scala. It led outside and you can
>> encounter
>>> such things even in algebra.  In fact you have no idea if these problems
>>> exists except for the fact you have used it a lot personally.
>>> 
>> 
>> 
>> You ran it with your own code that never existed before.
>> 
>> But there's difference between released Mahout code (which is what you are
>> working on) and the user code. Released code must run thru remote tests as
>> you suggested and thus guarantee there are no such problems with post
>> release code.
>> 
>> For users, we only can provide a way for them to load stuff that they
>> decide to use. We don't have apriori knowledge what they will use. It is
>> the same thing that spark does, and the same thing that MR does, doesn't
>> it?
>> 
>> Of course mahout should drop rigorously the stuff it doesn't load, from
>> the
>> scala scope. No argue about that. In fact that's what i suggested as #1
>> solution. But there's nothing much to do here but to go dependency
>> cleansing for math and spark code. Part of the reason there's so much is
>> because newer modules still bring in everything from mrLegacy.
>> 
>> You are right in saying it is hard to guess what else dependencies are in
>> the util/legacy code that are actually used. but that's not a
>> justification
>> for brute force "copy them all" approach that virtually guarantees ruining
>> one of the foremost legacy issues this work intended to address.
>> 
> 
>

Re: Upgrade to Spark 1.1.0?

Reply via email to