I'm concerned that the diff (see https://github.com/andytwigg/mahout) is now becoming quite large against trunk, and I don't the first patch to be a scary one. I put some refactoring effort in to separate the existing inmemory and the new streaming implementations, while trying to retain a shared interface (eg for tree building, bagging, data loading, classifying, etc.). This is mostly fine but it does mean there are some significant '---''s. I don't have a lot more time to spend on this right now, but should I try to pull a patch out early, and if so, should it be a refactoring one or a contributing implementation?
re iteration: it does feel like mahout is perhaps ready for a spring clean. Does anyone use the existing DF classifier in production? Andy On 8 March 2013 13:39, Sebastian Schelter <[email protected]> wrote: > Well, this is certainly possible and is an approach that is used in our > ALS code. But the startup latency and the need to rescan > iteration-invariant data usually typically induce an overhead of an > order of magnitude compared to approaches specialized for distributed > iterations. > > Best, > Sebastian > > On 08.03.2013 14:36, Marty Kube wrote: >> What about using one map reduce job per iteration? The models you load >> into distributed cache are the model from the last round and the reducer >> can emit the expanded model. We are presumably working with large data >> sets so I would not expect start-up latency to be an issue. >> >> On 03/07/2013 04:56 PM, Ted Dunning wrote: >>> On Thu, Mar 7, 2013 at 6:25 AM, Andy Twigg <[email protected]> wrote: >>> >>>> ... Right now what we have is a >>>> single-machine procedure for scanning through some data, building a >>>> set of histograms, combining histograms and then expanding the tree. >>>> The next step is to decide the best way to distribute this. I'm not an >>>> expert here, so any advice or help here is welcome. >>>> >>> That sounds good so far. >>> >>> >>>> I think the easiest approach would be to use the mappers to construct >>>> the set of histograms, and then send all histograms for a given leaf >>>> to a reducer, which decides how to expand that leaf. The code I have >>>> can be almost be ported as-is to a mapper and reducer in this way. >>>> Would using the distributed cache to send the updated tree be wise, or >>>> is there a better way? >>>> >>> Distributed cache is a very limited thing. You can only put things in at >>> program launch and they must remain constant throughout the program's >>> run. >>> >>> The problem here is that iterated map-reduce is pretty heinously >>> inefficient. >>> >>> The best candidate approaches for avoiding that are to use a BSP sort of >>> model (see the Pregel paper at >>> http://kowshik.github.com/JPregel/pregel_paper.pdf ) or use an >>> unsynchronized model update cycle the way that Vowpal Wabbit does with >>> all-reduce or the way that Google's deep learning system does. >>> >>> Running these approaches on Hadoop without Yarn or Mesos requires a >>> slight >>> perversion of the map-reduce paradigm, but is quite doable. >>> >> > -- Dr Andy Twigg Junior Research Fellow, St Johns College, Oxford Room 351, Department of Computer Science http://www.cs.ox.ac.uk/people/andy.twigg/ [email protected] | +447799647538
