Re: Discussion Of ML environment/MR, Mahout

Dmitriy Lyubimov Wed, 13 Mar 2013 12:00:48 -0700

On Wed, Mar 13, 2013 at 5:39 AM, Sean Owen <[email protected]> wrote:

> On Wed, Mar 13, 2013 at 12:06 PM, Dmitriy Lyubimov <[email protected]
> >wrote:
>
> > Also. I still have an impression as i mentioned that adaptive version of
> > algorithm is not available and specifying lambda for als-wr is left to
> > operator's intuition? This is probably a bigger issue even than the
> >
>
> (What's the adaptive version? I don't know of an implementation that
> dynamically chooses lambda, but you can always choose it with
> cross-validation. And that could be done in-line with iterations I guess. )
>


MM.. sort of. The way i understand the purpose of crossvalidation is to
find expectation of a cost function (error), cause we of course are not
interested just in the point of estimate, such estimate would exhibit way
too big standard error itself to be reliable. We do run K folds of training
but parameters of the training stay unchanged.

So we need some sort of search for argmin of the cost given lambda. Indeed,
some versions of R algorithms do it "fold" style, where they train for say
20 different lambdas on exponential scale given reasonable bounds and then
pick the one that resulted in a best cost expectation.

A slightly less computationally intensive approach, as Rafael has
suggested, was to do iterative improvement based on much fewer estimates
(say 3) fit into second degree curve with single maximum as the best guess
for the next iteration. That would provide significantly less total flops
requirement with significant precision benefits. But again, you have k-fold
runs each requiring quite a bit iterations for als itself to converge (say
20) multiplied sequentially on number of search for lambda. (btw N of
iterations is in itself ideally parameter to optimize since less than
something will result in unacceptable underfit and muddy the waters even
more -- but luckily this has monotinic effect on cost so we generally
ignore this).

Next, we discussed, well maybe we could boostrap lambda on a subset. Ted
said that this approach had unsurmountable problems: optimum would not be
the same and projection for a large dataset is complicated. This is also a
reason why it cannot be done "inline" -- you need to run thru entire
dataset to get reliable cost estimate given certain lambda.

That was it. Brute force adaptivity was thought to be costly multiplication
of flops and thus was not implemented, iterative was even less approachable
for the reason of explosive growth of iteration count. So it'd been left at
that. Or so is my interpretation. Please correct if I am wrong. One way or
another, i was under impression current version was forcing my hand in
manually managing bisect-like search for lambda optimum.


>
>
> > pieces we know about, with some sweat and tears could be solved with a
> more
> > constraining technology B as well as more naturally with its superset
> > technology A, what is the merit of making such choice in favor of B,
> > debatable maturity issues of either of choices aside?
> >
>
> I'm speaking for myself but the huge reason is that technology B is widely
> used and mature, and rightly or wrongly in demand, and customers are trying
> to make use of idle resources exposed via B. If using A is only easier for
> the product developer, that's great (and going to lead to better results
> long-term) but not something the customer is interested in. I say
> "customer" but this goes for consumers of open source code.
>

In other words, maturity arguments. As I said, already considered accepted
those.

Maturity arguments are debatable though. Maturity of a product doesn't seem
necessarily to always improve by age. In fact 0.18-0.20 revisions in my
stats were much more accident free than CDH3 and on. Mapr built a business
around fixing operational issues in the Apache version. At some point
tasktrackers in cdh3 had a memory leak and we had to round-boot it every
couple days or so. Just the last long weekend our sysops were firehosing
namenode problems again. Yarn is fresh out of the oven. Need i say more.

I accept maturity argument in a sense that hadoop is _commercially_ mature
(having specifically MapR distro in mind). or as EMR. As a long term
production grade redundant store, well, i guess we can agree to disagree
here:)

Customer affinity argument is like a lot of other arguments presented, goes
along the lines "it covers some problems". Doesn't mean it covers close to
100%. Not the case here. Not the case there.

>
> > And finally, on the side of pragmatic project management, why even to
> > artificially favor either of choices if we only rely on non commericial
> > conributions? why do we even want to oppose any diversification attempts
> on
> > any ground as long as we manage it  incubator style along with
> established
> > safe graduation policies to ensure chaos control? Viable things will find
> > their use and adoption. (Well, maybe i am a little bit optimistic here.
> > Nonviable tech seems to be striving for years as well just on the pitch
> > alone). If they dont find their way into Mahout, they will eventually
> > flourish elsewhere (assuming their viability).
> >
>
> I think this leads to a jumble of half-baked code. A playground of bits of
> code is fine, but, why push it together into a project that implies it's
> going to be coherent, supported?


Sorry, i never suggested that.

Just collaborate on Github.
>
>  Any effort is just tacking on more bits and
> pieces that are ever less related to the other bits. This is excellent --
> on Github. Why not stick a fork in it?
>

Ah. but that's exactly what i meant by "incubator mode". Github, or contrib
project, whichever. Except i am saying we probably will be better off by
encoraging and helping new contributors to do so by suggesting best
practices and advising on approaches. By being aware where and what they
do. By communicating to them. By exploring mutual interests. Exchanging
ideas. Put the mainline merging criteria forward. Maybe some of these will
merge it into mainstream if value is demonstrated. Who knows. Doing crunch
adaptations? Cool! Github/sidekick project. Show what you mean there. Doing
scala mapping? Yay! show us what you mean. Etc.

Keep in mind. This discussion is not about new methods and bits. This
discussion is about new environments. And the motto of Mahout has been
declared to be conscious of big data but agnostic of environment. Are you
saying you are not in support of that statement? Why say SGD is deemed a
valuable contribution but adaptive ALS on spark would not? Neither relies
on Hadoop. What technically sets those choices so much apart?

Ok i probably already laid out my case.

Re: Discussion Of ML environment/MR, Mahout

Reply via email to