Using scala-2.11 when making changes to spark source
The dev/change-scala-version.sh [2.11] script modifies in-place the pom.xml files across all of the modules. This is a git-visible change. So if we wish to make changes to spark source in our own fork's - while developing with scala 2.11 - we would end up conflating those updates with our own. A possible scenario would be to update .gitignore - by adding pom.xml. However I can not get that to work: .gitignore is tricky. Suggestions appreciated.
Re: Using scala-2.11 when making changes to spark source
Maybe the following can be used for changing Scala version: http://maven.apache.org/archetype/maven-archetype-plugin/ I played with it a little bit but didn't get far. FYI On Sun, Sep 20, 2015 at 6:18 AM, Stephen Boesch wrote: > > The dev/change-scala-version.sh [2.11] script modifies in-place the > pom.xml files across all of the modules. This is a git-visible change. So > if we wish to make changes to spark source in our own fork's - while > developing with scala 2.11 - we would end up conflating those updates with > our own. > > A possible scenario would be to update .gitignore - by adding pom.xml. > However I can not get that to work: .gitignore is tricky. > > Suggestions appreciated. >
Re: RDD: Execution and Scheduling
Concerning answers 1 and 2: 1) How Spark determines a node as a "slow node" and how slow is that? 2) How an RDD chooses a location as a preferred location and with which criteria? Could you please also include the links of the source files for the two questions above? -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Execution-and-Scheduling-tp14177p14226.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: RDD: Execution and Scheduling
On Sun, Sep 20, 2015 at 3:58 PM, gsvic wrote: > Concerning answers 1 and 2: > > 1) How Spark determines a node as a "slow node" and how slow is that? > There are two cases here: 1. If a node is busy (e.g. all slots are already occupied), the scheduler cannot schedule anything on it. See "Delay Scheduling: A Simple Technique for Achieving Locality and Fairness in Cluster Scheduling" paper for how locality scheduling is done. 2. Within the same stage, if a task is slower than other tasks, a copy of it can be launched speculatively in order to mitigate stragglers. Search for speculation in the code base to find out more. > 2) How an RDD chooses a location as a preferred location and with which > criteria? > This is part of the RDD definition. The RDD interface itself defines locality. The Spark NSDI paper already talks about this. Why don't you just do a little bit of code reading yourself? > > Could you please also include the links of the source files for the two > questions above? > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Execution-and-Scheduling-tp14177p14226.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >
Join operation on DStreams
Hi Spark Experts, I'm trying to use join(otherStream, [numTasks]) on DStreams, and it requires called on two DStreams of (K, V) and (K, W) pairs, Usually in common RDD, we could use keyBy(f) to build the (K, V) pair, however I could not find it in DStream. My question is: What is the expected way to build (K, V) pair in DStream? Thanks Shawn -- View this message in context: http://apache-spark-developers-list.1001551.n3.nabble.com/Join-operation-on-DStreams-tp14228.html Sent from the Apache Spark Developers List mailing list archive at Nabble.com. - To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org
Re: Join operation on DStreams
stream.map(record => (keyFunction(record), record)) For future reference, this question should go to the user list, not dev list. On Sun, Sep 20, 2015 at 11:47 PM, guoxu1231 wrote: > Hi Spark Experts, > > I'm trying to use join(otherStream, [numTasks]) on DStreams, and it > requires called on two DStreams of (K, V) and (K, W) pairs, > > Usually in common RDD, we could use keyBy(f) to build the (K, V) pair, > however I could not find it in DStream. > > My question is: > What is the expected way to build (K, V) pair in DStream? > > > Thanks > Shawn > > > > > -- > View this message in context: > http://apache-spark-developers-list.1001551.n3.nabble.com/Join-operation-on-DStreams-tp14228.html > Sent from the Apache Spark Developers List mailing list archive at > Nabble.com. > > - > To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org > For additional commands, e-mail: dev-h...@spark.apache.org > >