Re: Hadoop's Configuration object isn't threadsafe

2014-07-16 Thread Andrew Ash
Sounds good -- I added comments to the ticket. Since SPARK-2521 is scheduled for a 1.1.0 release and we can work around with spark.speculation, I don't personally see a need for a 1.0.2 backport. Thanks looking through this issue! On Thu, Jul 17, 2014 at 2:14 AM, Patrick Wendell wrote: > Hey

Re: Hadoop's Configuration object isn't threadsafe

2014-07-16 Thread Patrick Wendell
Hey Andrew, I think you are correct and a follow up to SPARK-2521 will end up fixing this. The desing of SPARK-2521 automatically broadcasts RDD data in tasks and the approach creates a new copy of the RDD and associated data for each task. A natural follow-up to that patch is to stop handling the

Re: Hadoop's Configuration object isn't threadsafe

2014-07-16 Thread Andrew Ash
Hi Patrick, thanks for taking a look. I filed as https://issues.apache.org/jira/browse/SPARK-2546 Would you recommend I pursue the cloned Configuration object approach now and send in a PR? Reynold's recent announcement of the broadcast RDD object patch may also have implications of the right pa

Re: small (yet major) change going in: broadcasting RDD to reduce task size

2014-07-16 Thread Stephen Haberman
Wow. Great writeup. I keep tabs on several open source projects that we use heavily, and I'd be ecstatic if more major changes were this well/succinctly explained instead of the usual "just read the commit message/diff". - Stephen

Re: small (yet major) change going in: broadcasting RDD to reduce task size

2014-07-16 Thread Reynold Xin
Yup - that is correct. Thanks for clarifying. On Wed, Jul 16, 2014 at 10:12 PM, Matei Zaharia wrote: > Hey Reynold, just to clarify, users will still have to manually broadcast > objects that they want to use *across* operations (e.g. in multiple > iterations of an algorithm, or multiple map f

Re: small (yet major) change going in: broadcasting RDD to reduce task size

2014-07-16 Thread Matei Zaharia
Hey Reynold, just to clarify, users will still have to manually broadcast objects that they want to use *across* operations (e.g. in multiple iterations of an algorithm, or multiple map functions, or stuff like that). But they won't have to broadcast something they only use once. Matei On Jul

Re: small (yet major) change going in: broadcasting RDD to reduce task size

2014-07-16 Thread Reynold Xin
Oops - the pull request should be https://github.com/apache/spark/pull/1452 On Wed, Jul 16, 2014 at 10:06 PM, Reynold Xin wrote: > Hi Spark devs, > > Want to give you guys a heads up that I'm working on a small (but major) > change with respect to how task dispatching works. Currently (as of Sp

small (yet major) change going in: broadcasting RDD to reduce task size

2014-07-16 Thread Reynold Xin
Hi Spark devs, Want to give you guys a heads up that I'm working on a small (but major) change with respect to how task dispatching works. Currently (as of Spark 1.0.1), Spark sends RDD object and closures using Akka along with the task itself to the executors. This is however inefficient because

Re: Possible bug in ClientBase.scala?

2014-07-16 Thread Chester Chen
Looking further, the yarn and yarn-stable are both for the stable version of Yarn, that explains the compilation errors when using 2.0.5-alpha version of hadoop. the module yarn-alpha ( although is still on SparkBuild.scala), is no longer there in sbt console. > projects [info] In file:/Users/c

Re: Does RDD checkpointing store the entire state in HDFS?

2014-07-16 Thread Tathagata Das
After every checkpointing interval, the latest state RDD is stored to HDFS in its entirety. Along with that, the series of DStream transformations that was setup with the streaming context is also stored into HDFS (the whole DAG of DStream objects is serialized and saved). TD On Wed, Jul 16, 201

Re: Possible bug in ClientBase.scala?

2014-07-16 Thread Chester Chen
Hmm looks like a Build script issue: I run the command with : sbt/sbt clean *yarn/*test:compile but errors came from [error] 40 errors found [error] (*yarn-stable*/compile:compile) Compilation failed Chester On Wed, Jul 16, 2014 at 5:18 PM, Chester Chen wrote: > Hi, Sandy > > We do h

Does RDD checkpointing store the entire state in HDFS?

2014-07-16 Thread Yan Fang
Hi guys, am wondering how the RDD checkpointing works in Spark Streaming. When I use updateStateByKey, does the Spark store the entire state (at one time point) into the HDFS or only put the transformation in

Re: Possible bug in ClientBase.scala?

2014-07-16 Thread Chester Chen
Hi, Sandy We do have some issue with this. The difference is in Yarn-Alpha and Yarn Stable ( I noticed that in the latest build, the module name has changed, yarn-alpha --> yarn yarn --> yarn-stable ) For example: MRJobConfig.class the field: "DEFAULT_MAPREDUCE_APPLICATION_CLASSPAT

Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?

2014-07-16 Thread andy petrella
Indeed, these two cases are tightly coupled (the first one is a special case of the second). Actually, these "outliers" could be handled by a dedicated function what I named outliersManager -- I was not so much inspired ^^, but we could name these outliers, "outlaws" and thus the function would be

Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?

2014-07-16 Thread Tathagata Das
I think it makes sense, though without a concrete implementation its hard to be sure. Applying sorting on the RDD according to the RDDs makes sense, but I can think of two kinds of fundamental problems. 1. How do you deal with ordering across RDD boundaries. Say two consecutive RDDs in the DStream

Re: Resource allocations

2014-07-16 Thread Kay Ousterhout
Hi Karthik, The resourceOffer() method is invoked from a class implementing the SchedulerBackend interface; in the case of a standalone cluster, it's invoked from a CoarseGrainedSchedulerBackend (in the makeOffers() method). If you look in TaskSchedulerImpl.submitTasks(), it calls backend.reviveO

Resource allocations

2014-07-16 Thread rapelly kartheek
Hi, I am trying to understand how the resource allocation happens in spark. I understand the resourceOffer method in taskScheduler. This method takes care of locality factor while allocating the resources. This resourceOffer method gets invoked by the corresponding cluster manager. I am working o

Re: [brainsotrming] Generalization of DStream, a ContinuousRDD ?

2014-07-16 Thread andy petrella
Heya TD, Thanks for the detailed answer! Much appreciated. Regarding order among elements within an RDD, you're definitively right, it'd kill the //ism and would require synchronization which is completely avoided in distributed env. That's why, I won't push this constraint to the RDDs themselve

Re: on shark, is tachyon less efficient than memory_only cache strategy ?

2014-07-16 Thread qingyang li
let's me describe my scene: -- i have 8 machines (24 core , 16G memory, per machine) of spark cluster and tachyon cluster. On tachyon, I create one table which contains 800M data, when i run query sql on shark, it will cost 2.43s, but when i create the same table on spark m