[jira] [Commented] (DATAFU-16) weighted reservoir sampling with exponential jumps UDF

2014-02-08 Thread jian wang (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-16?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13895791#comment-13895791
 ] 

jian wang commented on DATAFU-16:
-

Matt, Do you think we go ahead to implement the exponential jump only for the 
accumulate-based model? And for algebraic, we still use the weighted reservoir 
sampling without exponential jump. 

The good part of introducing the exp jump:  it could improve the job 
performance, especially when there is a lot of data to process, without 
sacrificing much on the sampling precision(per-item sampling probability is 
close to w/sum(w)). 

The not good part: the chance of using accumulate-based model may not be as 
many as algebraic, so is it worthwhile to introduce this enhancement?

 weighted reservoir sampling with exponential jumps UDF
 --

 Key: DATAFU-16
 URL: https://issues.apache.org/jira/browse/DATAFU-16
 Project: DataFu
  Issue Type: New Feature
 Environment: Mac, Linux
 pig-0.11
Reporter: jian wang
Priority: Minor
 Attachments: ScoredExpJmpReservoir.java, ScoredReservoir.java, 
 WeightedSamplingCorrectnessTests.java


 Create a weightedReservoirSampleWithExpJump UDF to implement the weighted 
 reservoir sampling algorithm with exponential jumps. Investigation is tracked 
 in  https://github.com/linkedin/datafu/issues/80. This task is part of 
 experiment of different weighted sampling algorithms.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (DATAFU-28) Tests are too slow

2014-02-08 Thread jian wang (JIRA)

[ 
https://issues.apache.org/jira/browse/DATAFU-28?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13895817#comment-13895817
 ] 

jian wang commented on DATAFU-28:
-

Matt, do you have stats of individual test cases for 
datafu.test.pig.stats.entropy  and datafu.test.pig.sampling?

Which ant option or tool do you use to measure the running duration of the test 
cases?

 Tests are too slow
 --

 Key: DATAFU-28
 URL: https://issues.apache.org/jira/browse/DATAFU-28
 Project: DataFu
  Issue Type: Bug
Reporter: Matthew Hayes

 I ran the tests on my laptop and it took nearly 2 hours.
 The worst offenders are {{datafu.test.pig.sampling}}, 
 {{datafu.test.pig.stats}}, and {{datafu.test.pig.stats.entropy}}.
 ||Package  ||Tests||  Failures||  Duration||  Success rate||
 |datafu.test.pig.bags|27  |0| 1m10.72s|100%|
 |datafu.test.pig.geo  |1  |0  |9.757s |100%|
 |datafu.test.pig.hash|4   |0  |41.039s|   100%|
 |datafu.test.pig.linkanalysis|5   |0| 32.677s |100%|
 |datafu.test.pig.random   |1| 0|  11.789s|100%|
 |datafu.test.pig.sampling |25|0   |38m25.81s| 100%|
 |datafu.test.pig.sessions |7  |0  |2m50.67s   |100%|
 |datafu.test.pig.sets |9  |0  |5m46.70s   |100%|
 |datafu.test.pig.stats|   52| 0   |26m11.98s| 100%|
 |datafu.test.pig.stats.entropy|40|0   |31m30.97s  |100%|
 |datafu.test.pig.urls|1   |0  |1m35.24s   |100%|
 |datafu.test.pig.util|21  |0| 4m51.64s|100%|



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)