Re: Ways to partition the RDD

2014-08-14 Thread ssb61
You can try something like this,

val kvRdd = sc.textFile(rawdata/).map( m = { 
 val
pfUser = m.split(t,2)

(pfUser(0) - pfUser(1))})
   .partitionBy(new
org.apache.spark.HashPartitioner(8))

You have a kvRdd with pageName as Key and UserID as Value.

Thanks





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp12083p12119.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Ways to partition the RDD

2014-08-14 Thread bdev
Thanks, will give that a try. 

I see the number of partitions requested is 8 (through HashPartitioner(8)).
If I have a 40 node cluster, whats the recommended number of partitions?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp12083p12128.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Ways to partition the RDD

2014-08-14 Thread Daniel Siegmann
First, I think you might have a misconception about partitioning. ALL RDDs
are partitioned (even if they are a single partition). When reading from
HDFS the number of partitions depends on how the data is stored in HDFS.
After data is shuffled (generally caused by things like reduceByKey), the
number of partitions will be whatever is set as the default parallelism
(see https://spark.apache.org/docs/latest/configuration.html). Most of
these methods allow you to specify a different number of partitions as a
parameter

The number of partitions == the number of tasks. So generally the number of
partitions you want will be enough to keep all your cores (not # nodes - #
cores) busy.

Also, to create a key you can use the map method to return a Tuple2 as
Santosh showed. You can also use keyBy.

If you just want to get the number of unique users, I would do something
like this:

val userIdColIndex: Int = ...
val pageIdColIndex: Int = ...
val inputPath: String = ...
// I'm assuming the user and page ID are both strings
val pageVisitorRdd: RDD[(String, String)] = sc.textFile(inputPath).map(
record =
   val colValues = record.split('\t')
   // You might want error handling in here - I'll assume all records are
valid
   val userId = colValues(userIdColIndex)
   val pageId = colValues(pageIdColIndex)
   (pageId, userId) // Here's your key-value pair
}

So now that you have your pageId - userId mappings, what to do with them?
Maybe the most obvious would be:

val uniquePageVisits: RDD[(String, Int)] =
pageVistorRdd.groupByKey().mapValues(_.toSet.size)

But groupByKey will be a bad choice if you have many visits per page
(you'll end up with a large collection in each record). It might be better
to start with a distinct, then map to (pageId, 1) and reduceByKey(_+_) to
get the sums.

I hope that helps.


On Thu, Aug 14, 2014 at 2:14 PM, bdev buntu...@gmail.com wrote:

 Thanks, will give that a try.

 I see the number of partitions requested is 8 (through HashPartitioner(8)).
 If I have a 40 node cluster, whats the recommended number of partitions?



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp12083p12128.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io


Re: Ways to partition the RDD

2014-08-14 Thread bdev
Thanks Daniel for the detailed information. Since the RDD is already
partitioned, there is no need to worry about repartitioning. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp12083p12136.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Ways to partition the RDD

2014-08-14 Thread Daniel Siegmann
There may be cases where you want to adjust the number of partitions or
explicitly call RDD.repartition or RDD.coalesce. However, I would start
with the defaults and then adjust if necessary to improve performance (for
example, if cores are idling because there aren't enough tasks you may want
more partitions).

Looks like PairRDDFunctions has a countByKey (though you'd still need to
distinct first). You might also look at combineByKey and foldByKey as
alternatives to reduceByKey or groupByKey.


On Thu, Aug 14, 2014 at 4:14 PM, bdev buntu...@gmail.com wrote:

 Thanks Daniel for the detailed information. Since the RDD is already
 partitioned, there is no need to worry about repartitioning.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp12083p12136.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




-- 
Daniel Siegmann, Software Developer
Velos
Accelerating Machine Learning

440 NINTH AVENUE, 11TH FLOOR, NEW YORK, NY 10001
E: daniel.siegm...@velos.io W: www.velos.io


Re: Ways to partition the RDD

2014-08-13 Thread bdev
Forgot to mention, I'm using Spark 1.0.0 and running against 40 node
yarn-cluster.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Ways-to-partition-the-RDD-tp12083p12088.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org