on distributions, I did not find anything multivariate Mahout Matrix-based. Hopefully, i did not look well enough. Everything univariate seems to be pretty spotty. Aside from that, i need scala traits, plus i find it extremely unelegant (un-scala, if you want) to write something like `new MultivariateUniformDistribution(mu,sigma).sample()`, so i really just dsl-bridged for most part. There are enough third party choices not to bother with filling the gaps.
On step-recorded evolutionary search, after my literature search on the topic, it doesnt look like even distant third best choice, in particular under big data training settings. First, i did not find head-to-head comparisons of that with any of top choices. It is not included in Amplab survey of top search choices. GP-EI is Netflix's choice, for example. So there's very little convincing data to go on, to begin with. So given lack of such comparisons, the next best thing is to copy what others do here. Second, under big data settings, every data point (training) is precious. In spark specifically, unlike MR, since we want to retain as much data in RAM is possible and avoid spills, best performance is usually achieved by sequentially semaphoring trainings rather then throwing a whole bunch of them out at once. Especially under circumstances where companies are extremely anemic in provisioning hardware needed for whatever reason. In that sense, exploration algorithms that are capable of making better inference after each new data point, and arrive to a reasonably performing model in ~20..30 sequential trains are infinitely more preferable, rather than those that require a whole bunch of trainings to happen to begin to figure the next centroid of trials. I am not even sure if step-recorded search was even ever tried outside SGD where datapoints are abundant albeit incomplete. On Tue, Aug 26, 2014 at 8:32 AM, Ted Dunning <[email protected]> wrote: > On Mon, Aug 25, 2014 at 2:40 PM, Dmitriy Lyubimov <[email protected]> > wrote: > > > This work is obviously also interesting in that it > > establishes probabilistic framework in Mahout (distributions & gaussian > > process). > > > > We already have that. > > (distributions not GP) > > Note that we also have an implementation of recorded step evolutionary > programming that works really well for hyper-parameter search. I don't > like the way that the API turned out (too hard to understand). >
