[
https://issues.apache.org/jira/browse/MAHOUT-771?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070300#comment-13070300
]
Lance Norskog commented on MAHOUT-771:
--------------------------------------
Performance tests: results on my laptop for RandomProjectorBenchmark:
||Algorithm|Density|Time (ms)|
|JDK|dense|658|
|JDK|sparse|6399|
|2of6|dense|198|
|2of6|sparse|13018|
|+1/-1|dense|68|
|+1/-1|sparse|2401|
The ++1/-1+ implementation (RandomProjectPlusMinus) is much faster, both dense
and sparse, as the "2 out of 6" implementation. ++1/-1+ uses 64 samples from
one MurmurHash cycle, whereas 2of6 only uses 24 samples (6^24 < 2^63 < 2^64 <
6^25). I can see why ++1/-1+ is faster on dense, but with sparse data both
algorithms generate a new cycle for each sample and I don't see why it is 5+
times slower than ++1/-1+.
However, 2of6 can create 0-valued vectors. If your sparse vectors have 5
samples, 12% of the output vectors will be 0. With 10 samples, it's down to 1%.
> Random Projection using sampled values
> --------------------------------------
>
> Key: MAHOUT-771
> URL: https://issues.apache.org/jira/browse/MAHOUT-771
> Project: Mahout
> Issue Type: New Feature
> Components: Math
> Reporter: Lance Norskog
> Priority: Minor
> Attachments: RandomProjector.patch, RandomProjectorBenchmark.java
>
>
> Random Projection implementation which follows two deterministic guarantees:
> # The same data projected multiple times produces the same output
> # Dense and sparse data with the same contents produce the same output
> Custom class that does Random Projection based on Johnson-Lindenstrauss. This
> implementation uses Achlioptas's results, which allow using method other than
> a full-range random multiplier per sample:
> * use 1 random bit to add or subtract a sample to a row sum
> * use a random value from 1/6 to add (1/6), subtract (1/6), or ignore (4 out
> of 6) a sample to a row sum
> Custom implementations for both dense and sparse vectors are included. The
> sparse vector implementation assumes the active values will fit in memory.
> An implementation using full-range random multipliers made by
> java.util.Random is included for reference/research.
> *Database-friendly random projections: Johnson-Lindenstrauss with binary
> coins*
> _Dimitris Achlioptas_
> [http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.84.4546&rep=rep1&type=pdf]
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira