[
https://issues.apache.org/jira/browse/MAHOUT-676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13021947#comment-13021947
]
Lance Norskog commented on MAHOUT-676:
--------------------------------------
The Iterator semantic is too limiting for a modular library of samplers.
There are two main types of sampling: Bernoulli/binomial and reservoir.
Bernoulli does not save anything that goes past, and emits a substream of the
inputs. It can have state about what it sees. Reservoir sampling maintains an
array of items that have gone past. The basic idea is that, at any time, the
contents of the reservoir is a random subset of the entire previous stream of
items. Bernoulli is pretty simplistic but is good enough in many applications.
Reservoir sampling is alledgely better at being random, and allows various
options like deleting items, weighting different kinds of items, and
(my favorite!) correctly subsampling the input legs of a join.
SamplingIterator and SamplingLongPrimitiveIterator are Bernoulli samplers.
FixedSizeSamplingIterator and StableFixedSSI are reservoir samplers.
OnlineSummary is a more complex kind of reservoir sampler.
These Sampler classes are not iterators: they have the simplest possible API to
represent a queue/stream. They need to be able to add items from different
sources in the same stream.The iterators are not useful in map/reduce: mappers
get one object at a time, reducers do. Combiners cannot share the common state.
Sampling is _really_ important in map/reduce and Mahout only supports them in
reducers.
> Random samplers in a modular library
> ------------------------------------
>
> Key: MAHOUT-676
> URL: https://issues.apache.org/jira/browse/MAHOUT-676
> Project: Mahout
> Issue Type: New Feature
> Components: Math
> Reporter: Lance Norskog
> Priority: Minor
> Attachments: Sampler.patch
>
>
> This is a modular suite of samplers.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira