On Mar 13, 2013 3:00 AM, "Sean Owen" <[email protected]> wrote: > > On Wed, Mar 13, 2013 at 2:04 AM, Dmitriy Lyubimov <[email protected]> wrote: > > > Yeah. The stuck point for me is page-rankish-finding stationary > > distributions and extremely popular ALS based stuff. We've beaten the heck > > out of it a year ago and Sebastian conclusively stated Giraph ALS knocks > > the socks off MR version. Add to that a bisect search for a good > > > > This keeps being said, but, I thought Sebastian just said that the M/R > version he mentioned being much slower was a different version, deleted > from this project? See my other email. The current version is similar to > the one I just benchmarked, and that appeared to be about as fast as > GraphLab (still not clear if the same amount of work is being compared > though).
Let's see what being said just out of this thread On Mar 11, 2013 1:39 PM, "Sebastian Schelter" <[email protected]> wrote: > > > Anyway, what i am > > saying, isn't it more or less truthful to say that in pragmatic ways ALS > > stuff in Mahout is lagging for the very reason of Mahout being constrained > > to MR? > > Definitely. > By how much, is difficult to say as you say it, hard to be sure and at the very least seems we have entered contradicting statements (at least with graphlab paper). Also "Sebastian Schelter" <[email protected]> wrote: > It's not a bad way per se, its a repartition join. Side-loading means > doing a broadcast join, which is more performant but only works as long > as the broadcasted side fits into the receivers memory. You're right, we > removed the repartition variant. That is, with it works for als with caveats, as opposed to a simpler model with no such constraints, but we are not about just strictly als here. There is a whole class of things, meaning there's still something in the future, unless we deckare perpetual code freeze. Sometimes relying on clever tricks is just not cost efficient in perspective. Also. I still have an impression as i mentioned that adaptive version of algorithm is not available and specifying lambda for als-wr is left to operator's intuition? This is probably a bigger issue even than the performance per se because it directly affects the quality of results, and the whole premise is that at that volume we cannot really conifrm if the regrate we chose was optimal. If adaptive version is there though, please let me know as i would very much like to figure how it managed to be efficient in a batch to clue me in on similar things in my other work. > I point it out in case this is underpinning many people's logic for > rebuilding a bunch of stuff because it will be a *lot* faster. Surely some > stuff can be done more naturally in a graph paradigm but not everything, or > most? I'm worried about the conclusion because of cases like this. I dont think i am to undertake something that is already solved within requirements, so i dont think i would jump on als or anything else if any off the shelf solution fits the running time and hardware specs i am handed down. In that sense i am totally technologically secular. Again, concrete non-adaptive (or even adaptive) als case is not what motivates it at all for me today. Suppose for a moment that Mahout was a commerical project with a lot of things in roadmap. And we had to make strategic decisions about something we dont yet know, and even if we could demonstrate that some of vital pieces we know about, with some sweat and tears could be solved with a more constraining technology B as well as more naturally with its superset technology A, what is the merit of making such choice in favor of B, debatable maturity issues of either of choices aside? Next, wasnt it declared quite some time ago that Mahout phylosophy is about learning at scale, not about learning with Hadoop or MR etc. As long as underpinning structures and data formats are reused, there is no any sort of such exclusivity, as officially stated. And finally, on the side of pragmatic project management, why even to artificially favor either of choices if we only rely on non commericial conributions? why do we even want to oppose any diversification attempts on any ground as long as we manage it incubator style along with established safe graduation policies to ensure chaos control? Viable things will find their use and adoption. (Well, maybe i am a little bit optimistic here. Nonviable tech seems to be striving for years as well just on the pitch alone). If they dont find their way into Mahout, they will eventually flourish elsewhere (assuming their viability). Why miss any chance at striking the gold? Why say nay to any of even ugliest ducklings as long as we can vote its rank of ugliness or beauty -- or say check its popularity with community via somethig like github star rating? Or even counting mentions on user list for that matter? I assume our goal is optimize for adoption rate within stated problem spectrum. In that sense i like the github model, it doesnt discorage anything, incoherent mumbling just dies on its own, and useful things get attention. Once things start getting consistent attention (or consistently not), any critic opinion on their goodness or ugliness becomes increasingly irrelevant. Yes, we can have our personal intuitions based on experience about how things are going to work out. Yes we can advise potential contributors by pointing out potential problems, best desired practices, or percieved lack of a new ground. But i will be the first to admit i dont know anything for sure. Over the years i figured reality check is the only proven reward function. Just like evolution. The only way to evolve so far has been to keep pressing that "check" button.
