[ 
https://issues.apache.org/jira/browse/MAHOUT-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021947#comment-13021947
 ] 

Lance Norskog commented on MAHOUT-676:
--------------------------------------

The Iterator semantic is too limiting for a modular library of samplers. 

There are two main types of sampling: Bernoulli/binomial and reservoir. 
Bernoulli does not save anything that goes past, and emits a substream of the 
inputs. It can have state about what it sees. Reservoir sampling maintains an 
array of items that have gone past. The basic idea is that, at any time, the 
contents of the reservoir is a random subset of the entire previous stream of 
items. Bernoulli is pretty simplistic but is good enough in many applications. 
Reservoir sampling is alledgely better at being random, and allows various 
options like deleting items, weighting different kinds of items, and 
(my favorite!) correctly subsampling the input legs of a join. 

SamplingIterator and SamplingLongPrimitiveIterator are Bernoulli samplers. 
FixedSizeSamplingIterator and StableFixedSSI are reservoir samplers. 
OnlineSummary is a more complex kind of reservoir sampler.

These Sampler classes are not iterators: they have the simplest possible API to 
represent a queue/stream. They need to be able to add items from different 
sources in the same stream.The iterators are not useful in map/reduce: mappers 
get one object at a time, reducers do. Combiners cannot share the common state. 

Sampling is _really_ important in map/reduce and Mahout only supports them in 
reducers.  


> Random samplers in a modular library
> ------------------------------------
>
>                 Key: MAHOUT-676
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-676
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Math
>            Reporter: Lance Norskog
>            Priority: Minor
>         Attachments: Sampler.patch
>
>
> This is a modular suite of samplers.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to