[ 
https://issues.apache.org/jira/browse/MAHOUT-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13040732#comment-13040732
 ] 

Lance Norskog edited comment on MAHOUT-676 at 5/29/11 3:32 AM:
---------------------------------------------------------------

I got interested again :)

This includes full unit tests and a new sampler. The sampler interface is 
changed: you can add samples, iterate the current list, and check whether the 
sample would be dropped. This check kicks forward the state machine inside the 
sampler.

The major point of interest is a brute-force implementation of "Slice 
Sampling": you supply a function on your samples, and the sampler keeps samples 
based on the "area" under the function. Example: a user who watches 2 movies is 
more interesting than a user who watches one, on up to 20 movies. After that, 
who cares? Let's say the user's "influence score" is the square root of the 
number of movies he has watched. 

Slice sampling requires two functions: the function that maps a user to an X 
value, and a function that maps an X value to a Y value. The first gives the 
raw influence of the user, and the second compresses that influence. Slice 
sampling pulls a subset of the original samples whose density matches the area 
under the second function.

This is interesting because it lets you shape a set of samples according to a 
fixed curve. If your categorizer has problems with the class of inputs you are 
most interested in, you can use slice sampling to trim down the less 
interesting samples.

      was (Author: lancenorskog):
    I got interested again :)

This includes full unit tests and a new sampler. The sampler interface is 
changed: you can add samples, iterate the current list, and check whether the 
sample would be dropped. This kicks forward the state machine inside the 
sampler.

The major point of interest is a brute-force implementation of "Slice 
Sampling": you supply a function on your samples, and the sampler keeps samples 
based on the "area" under the function. Example: it doesn't matter how many 
movies a user watched above 20 movies. So, a function on the sample returns the 
number of movies. 
  
> Random samplers in a modular library
> ------------------------------------
>
>                 Key: MAHOUT-676
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-676
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>            Reporter: Lance Norskog
>            Priority: Minor
>         Attachments: MAHOUT-676.patch, Sampler.patch
>
>
> This is a modular suite of samplers.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to