Re: Discussion Of ML environment/MR, Mahout

Dmitriy Lyubimov Wed, 13 Mar 2013 05:06:47 -0700

On Mar 13, 2013 3:00 AM, "Sean Owen" <[email protected]> wrote:
>
> On Wed, Mar 13, 2013 at 2:04 AM, Dmitriy Lyubimov <[email protected]>
wrote:
>
> > Yeah. The stuck point for me is page-rankish-finding stationary
> > distributions and extremely popular ALS based stuff. We've beaten the
heck
> > out of it a year ago and Sebastian conclusively stated Giraph ALS knocks
> > the socks off MR version. Add to that a bisect search for a good
> >
>
> This keeps being said, but, I thought Sebastian just said that the M/R
> version he mentioned being much slower was a different version, deleted
> from this project? See my other email. The current version is similar to
> the one I just benchmarked, and that appeared to be about as fast as
> GraphLab (still not clear if the same amount of work is being compared
> though).

Let's see what being said just out of this thread

On Mar 11, 2013 1:39 PM, "Sebastian Schelter" <[email protected]> wrote:
>
> > Anyway, what i am
> > saying, isn't it more or less truthful to say that in pragmatic ways ALS
> > stuff in Mahout is lagging for the very reason of Mahout being
constrained
> > to MR?
>
> Definitely.
>

By how much, is difficult to say as  you say it, hard to be sure and at the
very least  seems we have entered contradicting statements (at least with
graphlab paper).

Also

"Sebastian Schelter" <[email protected]> wrote:
> It's not a bad way per se, its a repartition join. Side-loading means
> doing a broadcast join, which is more performant but only works as long
> as the broadcasted side fits into the receivers memory. You're right, we
> removed the repartition variant.

That is, with it works for als with caveats, as opposed to a simpler model
with no such constraints, but we are not about just strictly als here.
There is a whole class of things, meaning there's still something in the
future, unless we deckare perpetual code freeze. Sometimes relying on
clever tricks  is just not cost efficient in perspective.

Also. I still have an impression as i mentioned that adaptive version of
algorithm is not available and specifying lambda for als-wr is left to
operator's intuition? This is probably a bigger issue even than the
performance per se because it directly affects the quality of results, and
the whole premise is that at that volume we cannot really conifrm if the
regrate we chose was optimal.

If adaptive version is there though, please let me know as  i would very
much like to figure how it managed to be efficient in a batch to clue me in
on similar things in my other work.

> I point it out in case this is underpinning many people's logic for
> rebuilding a bunch of stuff because it will be a *lot* faster. Surely some
> stuff can be done more naturally in a graph paradigm but not everything,
or
> most? I'm worried about the conclusion because of cases like this.

I dont think i am to undertake something that is already solved within
requirements, so i dont think i would jump on als or anything else if any
off the shelf solution fits the running time and hardware specs i am handed
down. In that sense i am totally technologically secular. Again, concrete
non-adaptive (or even adaptive) als case is not what motivates it at all
for me today.

Suppose for a moment that Mahout was a commerical project with a lot of
things in roadmap. And we had to make strategic decisions about something
we dont yet know, and even if we could demonstrate that some of vital
pieces we know about, with some sweat and tears could be solved with a more
constraining technology B as well as more naturally with its superset
technology A, what is the merit of making such choice in favor of B,
debatable maturity issues of either of choices aside?

Next, wasnt it declared quite some time ago that Mahout phylosophy is about
learning at scale, not about learning with Hadoop or MR etc. As long as
underpinning  structures and data formats are reused, there is no any sort
of such exclusivity, as officially stated.

And finally, on the side of pragmatic project management, why even to
artificially favor either of choices if we only rely on non commericial
conributions? why do we even want to oppose any diversification attempts on
any ground as long as we manage it  incubator style along with established
safe graduation policies to ensure chaos control? Viable things will find
their use and adoption. (Well, maybe i am a little bit optimistic here.
Nonviable tech seems to be striving for years as well just on the pitch
alone). If they dont find their way into Mahout, they will eventually
flourish elsewhere (assuming their viability).

Why miss any  chance at striking the gold?  Why say nay to any of even
ugliest ducklings as long as we can vote its rank of ugliness or beauty --
or say check its popularity with community via somethig like github star
rating? Or even  counting mentions on user list for that matter? I assume
our goal is optimize for adoption rate within stated problem spectrum.

In that sense i like the github model, it doesnt discorage anything,
incoherent mumbling just dies on its own, and useful things get attention.
Once things start getting consistent attention (or consistently not), any
critic opinion on their goodness or ugliness becomes increasingly
irrelevant. Yes, we can have our personal intuitions based on experience
about how things are going to work out. Yes we can advise potential
contributors by pointing out potential problems, best desired practices, or
percieved lack of a new ground. But i will be the first to admit i dont
know anything for sure. Over the years i figured reality check is the only
proven reward function. Just like evolution. The only way to evolve  so far
has been to keep pressing that "check" button.

Re: Discussion Of ML environment/MR, Mahout

Reply via email to