Using scala-2.11 when making changes to spark source

2015-09-20 Thread Stephen Boesch
The dev/change-scala-version.sh [2.11]  script modifies in-place  the
pom.xml files across all of the modules.  This is a git-visible change.  So
if we wish to make changes to spark source in our own fork's - while
developing with scala 2.11 - we would end up conflating those updates with
our own.

A possible scenario would be to update .gitignore - by adding pom.xml.
However I can not get that to work: .gitignore is tricky.

Suggestions appreciated.


Re: Using scala-2.11 when making changes to spark source

2015-09-20 Thread Ted Yu
Maybe the following can be used for changing Scala version:
http://maven.apache.org/archetype/maven-archetype-plugin/

I played with it a little bit but didn't get far.

FYI

On Sun, Sep 20, 2015 at 6:18 AM, Stephen Boesch  wrote:

>
> The dev/change-scala-version.sh [2.11]  script modifies in-place  the
> pom.xml files across all of the modules.  This is a git-visible change.  So
> if we wish to make changes to spark source in our own fork's - while
> developing with scala 2.11 - we would end up conflating those updates with
> our own.
>
> A possible scenario would be to update .gitignore - by adding pom.xml.
> However I can not get that to work: .gitignore is tricky.
>
> Suggestions appreciated.
>


Re: RDD: Execution and Scheduling

2015-09-20 Thread gsvic
Concerning answers 1 and 2: 

1) How Spark determines a node as a "slow node" and how slow is that? 

2) How an RDD chooses a location as a preferred location and with which
criteria? 

Could you please also include the links of the source files for the two
questions above?



--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Execution-and-Scheduling-tp14177p14226.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: RDD: Execution and Scheduling

2015-09-20 Thread Reynold Xin
On Sun, Sep 20, 2015 at 3:58 PM, gsvic  wrote:

> Concerning answers 1 and 2:
>
> 1) How Spark determines a node as a "slow node" and how slow is that?
>

There are two cases here:

1. If a node is busy (e.g. all slots are already occupied), the scheduler
cannot schedule anything on it. See "Delay Scheduling: A Simple Technique
for Achieving
Locality and Fairness in Cluster Scheduling" paper for how locality
scheduling is done.

2. Within the same stage, if a task is slower than other tasks, a copy of
it can be launched speculatively in order to mitigate stragglers. Search
for speculation in the code base to find out more.



> 2) How an RDD chooses a location as a preferred location and with which
> criteria?
>

This is part of the RDD definition. The RDD interface itself defines
locality. The Spark NSDI paper already talks about this.

Why don't you just do a little bit of code reading yourself?



>
> Could you please also include the links of the source files for the two
> questions above?
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/RDD-Execution-and-Scheduling-tp14177p14226.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>


Join operation on DStreams

2015-09-20 Thread guoxu1231
Hi Spark Experts, 

I'm trying to use join(otherStream, [numTasks]) on DStreams,  and it
requires called on two DStreams of (K, V) and (K, W) pairs,

Usually in common RDD, we could use keyBy(f) to build the (K, V) pair,
however I could not find it in DStream. 

My question is:
What is the expected way to build (K, V) pair in DStream?


Thanks
Shawn




--
View this message in context: 
http://apache-spark-developers-list.1001551.n3.nabble.com/Join-operation-on-DStreams-tp14228.html
Sent from the Apache Spark Developers List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org



Re: Join operation on DStreams

2015-09-20 Thread Reynold Xin
stream.map(record => (keyFunction(record), record))


For future reference, this question should go to the user list, not dev
list.


On Sun, Sep 20, 2015 at 11:47 PM, guoxu1231  wrote:

> Hi Spark Experts,
>
> I'm trying to use join(otherStream, [numTasks]) on DStreams,  and it
> requires called on two DStreams of (K, V) and (K, W) pairs,
>
> Usually in common RDD, we could use keyBy(f) to build the (K, V) pair,
> however I could not find it in DStream.
>
> My question is:
> What is the expected way to build (K, V) pair in DStream?
>
>
> Thanks
> Shawn
>
>
>
>
> --
> View this message in context:
> http://apache-spark-developers-list.1001551.n3.nabble.com/Join-operation-on-DStreams-tp14228.html
> Sent from the Apache Spark Developers List mailing list archive at
> Nabble.com.
>
> -
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
>
>