Hi Khurrum, To expand upon what Dmitriy was saying regarding k-means|| sketching in the github repo for the samsara book, please see:
https://github.com/andrewpalumbo/mahout-samsara-book/blob/master/myMahoutApp/src/main/scala/myMahoutApp/BahmaniSketch.scala#L48 Mahout has a sampling apI based the underlying Engine's sampling methods, in this case Spark's as described by Dmitriy below. See: https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/drm/package.scala#L138 and its implementation in the Spark Module: https://github.com/apache/mahout/blob/master/spark/src/main/scala/org/apache/mahout/sparkbindings/SparkEngine.scala#L247 With these sampling methods along wih the statistics available for DRMs, most of the tools are available to implement Monte Carlo style algorithms, and we would have interest in including some implementations in the upcoming releases. Andy ________________________________________ From: Khurrum Nasim <[email protected]> Sent: Monday, May 2, 2016 12:47:17 PM To: [email protected] Subject: Re: stochastic nature Thanks for the insight Dimitri. I will look further into spark to understand how it handles parallelization and distributed processing. > On May 2, 2016, at 12:39 PM, Dmitriy Lyubimov <[email protected]> wrote: > > by probabilistic algorithms i mostly mean inference involving monte carlo > type mechanisms (Gibbs sampling LDA which i think might still be part of > our MR collection might be an example, as well as its faster counterpart, > variational Bayes inference. > > the parallelization strategies are are just standard spark mechanisms (in > case of spark), mostly are using their standard hash samplers (which are in > math speak are uniform multinomial samplers really). > > On Mon, May 2, 2016 at 9:25 AM, Khurrum Nasim <[email protected]> > wrote: > >> Hey Dimitri - >> >> Yes I meant probabilistic algorithms. If mahout doesn’t use probabilistic >> algos then how does it accomplish a degree of optimal parallelization ? >> Wouldn’t you need randomization to spread out the processing of tasks. >> >>> On May 2, 2016, at 12:13 PM, Dmitriy Lyubimov <[email protected]> wrote: >>> >>> yes mahout has stochastic svd and pca which are described at length in >> the >>> samsara book. The book examples in Andrew Palumbo's github also contain >> an >>> example of computing k-means|| sketch. >>> >>> if you mean _probabilistic_ algorithms, although i have done some things >>> outside the public domain, nothing has been contributed. >>> >>> You are very welcome to try something if you don't have big constraints >> on >>> oss contribution. >>> >>> -d >>> >>> On Mon, May 2, 2016 at 7:49 AM, Khurrum Nasim <[email protected]> >>> wrote: >>> >>>> Hey All, >>>> >>>> I’d like to know if Mahout uses any randomized algorithms. I’m >> thinking >>>> it probably does. Can somebody point me to the packages that utilized >>>> randomized algos. >>>> >>>> Thanks, >>>> >>>> Khurrum >>>> >>>> >> >>
