[
https://issues.apache.org/jira/browse/MAHOUT-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13040732#comment-13040732
]
Lance Norskog edited comment on MAHOUT-676 at 5/29/11 3:32 AM:
---------------------------------------------------------------
I got interested again :)
This includes full unit tests and a new sampler. The sampler interface is
changed: you can add samples, iterate the current list, and check whether the
sample would be dropped. This check kicks forward the state machine inside the
sampler.
The major point of interest is a brute-force implementation of "Slice
Sampling": you supply a function on your samples, and the sampler keeps samples
based on the "area" under the function. Example: a user who watches 2 movies is
more interesting than a user who watches one, on up to 20 movies. After that,
who cares? Let's say the user's "influence score" is the square root of the
number of movies he has watched.
Slice sampling requires two functions: the function that maps a user to an X
value, and a function that maps an X value to a Y value. The first gives the
raw influence of the user, and the second compresses that influence. Slice
sampling pulls a subset of the original samples whose density matches the area
under the second function.
This is interesting because it lets you shape a set of samples according to a
fixed curve. If your categorizer has problems with the class of inputs you are
most interested in, you can use slice sampling to trim down the less
interesting samples.
was (Author: lancenorskog):
I got interested again :)
This includes full unit tests and a new sampler. The sampler interface is
changed: you can add samples, iterate the current list, and check whether the
sample would be dropped. This kicks forward the state machine inside the
sampler.
The major point of interest is a brute-force implementation of "Slice
Sampling": you supply a function on your samples, and the sampler keeps samples
based on the "area" under the function. Example: it doesn't matter how many
movies a user watched above 20 movies. So, a function on the sample returns the
number of movies.
> Random samplers in a modular library
> ------------------------------------
>
> Key: MAHOUT-676
> URL: https://issues.apache.org/jira/browse/MAHOUT-676
> Project: Mahout
> Issue Type: New Feature
> Components: Math
> Reporter: Lance Norskog
> Priority: Minor
> Attachments: MAHOUT-676.patch, Sampler.patch
>
>
> This is a modular suite of samplers.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira