RE: Percentile example

2015-02-17 Thread SiMaYunRui
Thanks Imran for very detailed explanations and options. I think for now 
T-Digest is what I want. 

From: iras...@cloudera.com
Date: Tue, 17 Feb 2015 08:39:48 -0600
Subject: Re: Percentile example
To: myl...@hotmail.com
CC: user@spark.apache.org

(trying to repost to the list w/out URLs -- rejected as spam earlier)
Hi,
Using take() is not a good idea, as you have noted it will pull a lot of data 
down to the driver so its not scalable.  Here are some more scalable 
alternatives:
1. Approximate solutions
1a. Sample the data.  Just sample some of the data to the driver, sort that 
data in memory, and take the 66th percentile of that sample.
1b.  Make a histogram with pre-determined buckets.  Eg., if you know your data 
ranges from 0 to 1 and is uniform-ish, you could make buckets every 0.01.  Then 
count how many data points go into each bucket.  Or if you only care about 
relative error and you have integers (often the case if your data is counts), 
then you can span the full range of integers with a relatively small number of 
buckets.  Eg., you only need 200 buckets for 5% error.  See the Histogram class 
in twitter's Ostrich library
The problem is, if you have no idea what the distribution of your data is, its 
very hard to come up with good buckets; you could have an arbitrary amount of 
data going to one bucket, and thus tons of error.
1c.  Use a TDigest , a compact  scalable data structure for approximating 
distributions, and performs reasonably across a wide range of distributions.  
You would make one TDigest for each partition (with mapPartitions), and then 
merge all of the TDigests together.  I wrote up a little more detail on this 
earlier, you can search the spark-user on nabble for tdigest
2. Exact solutions.  There are also a few options here, but I'll give one that 
is a variant of what you suggested.  Start out by doing a sortByKey.  Then 
figure out how many records you have in each partitions (with mapPartitions).  
Figure out which partition the 66th percentile would be in.  Then just read the 
one partition you want, and go down to the Nth record in that partition.
To read the one partition you want, you can either (a) use 
mapPartitionsWithIndex, and just ignore every partition that isnt' the one you 
want or (b) use PartitionPruningRDD.  PartitionPruningRDD will avoid launching 
empty tasks on the other partitions, so it will be slightly more efficient, but 
its also a developer api, so perhaps not worth going to that level of detail.
Note that internally, sortByKey will sample your data to get an approximate 
distribution, to figure out what data to put in each partition.  However, your 
still getting an exact answer this way -- the approximation is only important 
for distributing work among all executors.  Even if the approximation is 
inaccurate, you'll still correct for it, you will just have unequal partitions.
Imran On Sun, Feb 15, 2015 at 9:37 AM, SiMaYunRui myl...@hotmail.com wrote:



hello, 
I am a newbie to spark and trying to figure out how to get percentile against a 
big data set. Actually, I googled this topic but not find any very useful code 
example and explanation. Seems that I can use transformer SortBykey to get my 
data set in order, but not pretty sure how can I get value of , for example, 
percentile 66. 
Should I use take() to pick up the value of percentile 66? I don't believe any 
machine can load my data set in memory. I believe there must be more efficient 
approaches. 
Can anyone shed some light on this problem? 
Regards
  



  

RE: Percentile example

2015-02-17 Thread SiMaYunRui
Thanks Kohler, that's very interesting approach. I never used Spark SQL and not 
sure whether my cluster was configured well for it. But will definitely have a 
try. 

From: c.koh...@elsevier.com
To: myl...@hotmail.com; user@spark.apache.org
Subject: Re: Percentile example
Date: Tue, 17 Feb 2015 17:41:53 +













The best approach I’ve found to calculate Percentiles in Spark is to leverage 
SparkSQL.  If you use the Hive Query Language support, you can use the UDAFs 
for percentiles (as of Spark 1.2)



Something like this (Note: syntax not guaranteed to run but should give you the 
gist of what you need to do):  




JavaSparkContext sc = new JavaSparkContext(sparkConf);

JavaHiveContext hsc = new JavaHiveContext(sc);
//Get your Data into a SchemaRDD and register the Table



// Query it

String hql =  SELECT FIELD1, FIELD2, percentile(FIELD3, 0.05) AS
 ptile5 from TABLE-NAME GROUP BY FIELD1, FIELD2;”

JavaSchemaRDD result = hsc.hql(hql);
ListRow grp = result.collect();





for (int z = 2; z 
  row.length(); z++) {
  // Do something with the results
}




Curt









From: SiMaYunRui myl...@hotmail.com

Date: Sunday, February 15, 2015 at 10:37 AM

To: user@spark.apache.org user@spark.apache.org

Subject: Percentile example







hello, 



I am a newbie to spark and trying to figure out how to get percentile against a 
big data set. Actually, I googled this topic but not find any very useful code 
example and explanation. Seems that I can use transformer SortBykey to get my 
data set in order,
 but not pretty sure how can I get value of , for example, percentile 66. 



Should I use take() to pick up the value of percentile 66? I don't believe any 
machine can load my data set in memory. I believe there must be more efficient 
approaches. 



Can anyone shed some light on this problem? 



Regards






  

Re: Percentile example

2015-02-17 Thread Imran Rashid
(trying to repost to the list w/out URLs -- rejected as spam earlier)

Hi,

Using take() is not a good idea, as you have noted it will pull a lot of
data down to the driver so its not scalable.  Here are some more scalable
alternatives:

1. Approximate solutions

1a. Sample the data.  Just sample some of the data to the driver, sort that
data in memory, and take the 66th percentile of that sample.

1b.  Make a histogram with pre-determined buckets.  Eg., if you know your
data ranges from 0 to 1 and is uniform-ish, you could make buckets every
0.01.  Then count how many data points go into each bucket.  Or if you only
care about relative error and you have integers (often the case if your
data is counts), then you can span the full range of integers with a
relatively small number of buckets.  Eg., you only need 200 buckets for 5%
error.  See the Histogram class in twitter's Ostrich library

The problem is, if you have no idea what the distribution of your data is,
its very hard to come up with good buckets; you could have an arbitrary
amount of data going to one bucket, and thus tons of error.

1c.  Use a TDigest , a compact  scalable data structure for approximating
distributions, and performs reasonably across a wide range of
distributions.  You would make one TDigest for each partition (with
mapPartitions), and then merge all of the TDigests together.  I wrote up a
little more detail on this earlier, you can search the spark-user on nabble
for tdigest

2. Exact solutions.  There are also a few options here, but I'll give one
that is a variant of what you suggested.  Start out by doing a sortByKey.
Then figure out how many records you have in each partitions (with
mapPartitions).  Figure out which partition the 66th percentile would be
in.  Then just read the one partition you want, and go down to the Nth
record in that partition.

To read the one partition you want, you can either (a) use
mapPartitionsWithIndex, and just ignore every partition that isnt' the one
you want or (b) use PartitionPruningRDD.  PartitionPruningRDD will avoid
launching empty tasks on the other partitions, so it will be slightly more
efficient, but its also a developer api, so perhaps not worth going to that
level of detail.

Note that internally, sortByKey will sample your data to get an approximate
distribution, to figure out what data to put in each partition.  However,
your still getting an exact answer this way -- the approximation is only
important for distributing work among all executors.  Even if the
approximation is inaccurate, you'll still correct for it, you will just
have unequal partitions.

Imran


 On Sun, Feb 15, 2015 at 9:37 AM, SiMaYunRui myl...@hotmail.com wrote:

 hello,

 I am a newbie to spark and trying to figure out how to get percentile
 against a big data set. Actually, I googled this topic but not find any
 very useful code example and explanation. Seems that I can use transformer
 SortBykey to get my data set in order, but not pretty sure how can I get
 value of , for example, percentile 66.

 Should I use take() to pick up the value of percentile 66? I don't
 believe any machine can load my data set in memory. I believe there must be
 more efficient approaches.

 Can anyone shed some light on this problem?

 *Regards*





Re: Percentile example

2015-02-17 Thread Kohler, Curt E (ELS-STL)


The best approach I've found to calculate Percentiles in Spark is to leverage 
SparkSQL.  If you use the Hive Query Language support, you can use the UDAFs 
for percentiles (as of Spark 1.2)

Something like this (Note: syntax not guaranteed to run but should give you the 
gist of what you need to do):


JavaSparkContext sc = new JavaSparkContext(sparkConf);

JavaHiveContext hsc = new JavaHiveContext(sc);

//Get your Data into a SchemaRDD and register the Table


// Query it

String hql =  SELECT FIELD1, FIELD2, percentile(FIELD3, 0.05) AS ptile5 from 
TABLE-NAME GROUP BY FIELD1, FIELD2;

JavaSchemaRDD result = hsc.hql(hql);

ListRow grp = result.collect();


for (int z = 2; z   row.length(); z++) {

  // Do something with the results

}

Curt


From: SiMaYunRui myl...@hotmail.commailto:myl...@hotmail.com
Date: Sunday, February 15, 2015 at 10:37 AM
To: user@spark.apache.orgmailto:user@spark.apache.org 
user@spark.apache.orgmailto:user@spark.apache.org
Subject: Percentile example

hello,

I am a newbie to spark and trying to figure out how to get percentile against a 
big data set. Actually, I googled this topic but not find any very useful code 
example and explanation. Seems that I can use transformer SortBykey to get my 
data set in order, but not pretty sure how can I get value of , for example, 
percentile 66.

Should I use take() to pick up the value of percentile 66? I don't believe any 
machine can load my data set in memory. I believe there must be more efficient 
approaches.

Can anyone shed some light on this problem?

Regards



Percentile example

2015-02-15 Thread SiMaYunRui
hello, 
I am a newbie to spark and trying to figure out how to get percentile against a 
big data set. Actually, I googled this topic but not find any very useful code 
example and explanation. Seems that I can use transformer SortBykey to get my 
data set in order, but not pretty sure how can I get value of , for example, 
percentile 66. 
Should I use take() to pick up the value of percentile 66? I don't believe any 
machine can load my data set in memory. I believe there must be more efficient 
approaches. 
Can anyone shed some light on this problem? 
Regards