Re: Upgrade to Spark 1.1.0?

Mahesh Balija Tue, 21 Oct 2014 07:16:36 -0700

Also if I use any other versions of Spark there are incompatible method
signatures due to which Mahout Spark-shell itself is NOT started.


On Tue, Oct 21, 2014 at 7:42 PM, Mahesh Balija <[email protected]>
wrote:

> Hi All,
>
> Here are the errors I get which I run in a pseudo distributed mode,
>
> Spark 1.0.2 and Mahout latest code (Clone)
>
> When I run the command in page,
> https://mahout.apache.org/users/sparkbindings/play-with-shell.html
>
> val drmX = drmData(::, 0 until 4)
>
> java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class 
> incompatible: stream classdesc serialVersionUID = 385418487991259089, local 
> class serialVersionUID = -6766554341038829528
>       at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
>       at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>       at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>       at 
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>       at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>       at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>       at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>       at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>       at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>       at 
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
>       at 
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
>       at 
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
>       at 
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
>       at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>       at java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>       at 
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>       at 
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
>       at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>       at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>       at java.lang.Thread.run(Thread.java:701)
> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 1 (task 0.0:1)
> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 2 (task 0.0:0)
> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 3 (task 0.0:1)
> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 4 (task 0.0:0)
> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 5 (task 0.0:1)
> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 6 (task 0.0:0)
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0.0:0 
> failed 4 times, most recent failure: Exception failure in TID 6 on host 
> mahesh-VirtualBox.local: java.io.InvalidClassException: 
> org.apache.spark.rdd.RDD; local class incompatible: stream classdesc 
> serialVersionUID = 385418487991259089, local class serialVersionUID = 
> -6766554341038829528
>         java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592)
>         
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>         java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>         
> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621)
>         java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516)
>         
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770)
>         java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>         java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>         
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>         
> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61)
>         
> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141)
>         
> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836)
>         
> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795)
>         java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349)
>         java.io.ObjectInputStream.readObject(ObjectInputStream.java:369)
>         
> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63)
>         
> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85)
>         org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165)
>         
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146)
>         
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>         java.lang.Thread.run(Thread.java:701)
> Driver stacktrace:
>       at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044)
>       at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028)
>       at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026)
>       at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>       at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>       at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026)
>       at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>       at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634)
>       at scala.Option.foreach(Option.scala:236)
>       at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634)
>       at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229)
>       at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
>       at akka.actor.ActorCell.invoke(ActorCell.scala:456)
>       at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
>       at akka.dispatch.Mailbox.run(Mailbox.scala:219)
>       at 
> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
>       at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
>       at 
> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
>       at 
> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
>
> Best,
> Mahesh Balija.
>
>
>
>
>
>
>
> On Tue, Oct 21, 2014 at 2:38 AM, Dmitriy Lyubimov <[email protected]>
> wrote:
>
>> On Mon, Oct 20, 2014 at 1:51 PM, Pat Ferrel <[email protected]>
>> wrote:
>>
>> > Is anyone else nervous about ignoring this issue or relying on non-build
>> > (hand run) test driven transitive dependency checking. I hope someone
>> else
>> > will chime in.
>> >
>> > As to running unit tests on a TEST_MASTER I’ll look into it. Can we set
>> up
>> > the build machine to do this? I’d feel better about eyeballing deps if
>> we
>> > could have a TEST_MASTER automatically run during builds at Apache.
>> Maybe
>> > the regular unit tests are OK for building locally ourselves.
>> >
>> > >
>> > > On Oct 20, 2014, at 12:23 PM, Dmitriy Lyubimov <[email protected]>
>> > wrote:
>> > >
>> > > On Mon, Oct 20, 2014 at 11:44 AM, Pat Ferrel <[email protected]>
>> > wrote:
>> > >
>> > >> Maybe a more fundamental issue is that we don’t know for sure
>> whether we
>> > >> have missing classes or not. The job.jar at least used the pom
>> > dependencies
>> > >> to guarantee every needed class was present. So the job.jar seems to
>> > solve
>> > >> the problem but may ship some unnecessary duplicate code, right?
>> > >>
>> > >
>> > > No, as i wrote spark doesn't  work with job jar format. Neither as it
>> > turns
>> > > out more recent hadoop MR btw.
>> >
>> > Not speaking literally of the format. Spark understands jars and maven
>> can
>> > build one from transitive dependencies.
>> >
>> > >
>> > > Yes, this is A LOT of duplicate code (will take normally MINUTES to
>> > startup
>> > > tasks with all of it just on copy time). This is absolutely not the
>> way
>> > to
>> > > go with this.
>> > >
>> >
>> > Lack of guarantee to load seems like a bigger problem than startup time.
>> > Clearly we can’t just ignore this.
>> >
>>
>> Nope. given highly iterative nature and dynamic task allocation in this
>> environment, one is looking to effects similar to Map Reduce. This is not
>> the only reason why I never go to MR anymore, but that's one of main ones.
>>
>> How about experiment: why don't you create assembly that copies ALL
>> transitive dependencies in one folder, and then try to broadcast it from
>> single point (front end) to well... let's start with 20 machines. (of
>> course we ideally want to into 10^3 ..10^4 range -- but why bother if we
>> can't do it for 20).
>>
>> Or, heck, let's try to simply parallel-copy it between too machines 20
>> times that are not collocated on the same subnet.
>>
>>
>> > >
>> > >> There may be any number of bugs waiting for the time we try running
>> on a
>> > >> node machine that doesn’t have some class in it’s classpath.
>> > >
>> > >
>> > > No. Assuming any given method is tested on all its execution paths,
>> there
>> > > will be no bugs. The bugs of that sort will only appear if the user is
>> > > using algebra directly and calls something that is not on the path,
>> from
>> > > the closure. In which case our answer to this is the same as for the
>> > solver
>> > > methodology developers -- use customized SparkConf while creating
>> context
>> > > to include stuff you really want.
>> > >
>> > > Also another right answer to this is that we probably should
>> reasonably
>> > > provide the toolset here. For example, all the stats stuff found in R
>> > base
>> > > and R stat packages so the user is not compelled to go non-native.
>> > >
>> > >
>> >
>> > Huh? this is not true. The one I ran into was found by calling something
>> > in math from something in math-scala. It led outside and you can
>> encounter
>> > such things even in algebra.  In fact you have no idea if these problems
>> > exists except for the fact you have used it a lot personally.
>> >
>>
>>
>> You ran it with your own code that never existed before.
>>
>> But there's difference between released Mahout code (which is what you are
>> working on) and the user code. Released code must run thru remote tests as
>> you suggested and thus guarantee there are no such problems with post
>> release code.
>>
>> For users, we only can provide a way for them to load stuff that they
>> decide to use. We don't have apriori knowledge what they will use. It is
>> the same thing that spark does, and the same thing that MR does, doesn't
>> it?
>>
>> Of course mahout should drop rigorously the stuff it doesn't load, from
>> the
>> scala scope. No argue about that. In fact that's what i suggested as #1
>> solution. But there's nothing much to do here but to go dependency
>> cleansing for math and spark code. Part of the reason there's so much is
>> because newer modules still bring in everything from mrLegacy.
>>
>> You are right in saying it is hard to guess what else dependencies are in
>> the util/legacy code that are actually used. but that's not a
>> justification
>> for brute force "copy them all" approach that virtually guarantees ruining
>> one of the foremost legacy issues this work intended to address.
>>
>
>

Re: Upgrade to Spark 1.1.0?

Reply via email to