I'm concerned that the diff (see https://github.com/andytwigg/mahout)
is now becoming quite large against trunk, and I don't the first patch
to be a scary one. I put some refactoring effort in to separate the
existing inmemory and the new streaming implementations, while trying
to retain a shared interface (eg for tree building, bagging,  data
loading, classifying, etc.). This is mostly fine but it does mean
there are some significant '---''s. I don't have a lot more time to
spend on this right now, but should I try to pull a patch out early,
and if so, should it be a refactoring one or a contributing
implementation?

re iteration: it does feel like mahout is perhaps ready for a  spring
clean. Does anyone use the existing DF classifier in production?

Andy



On 8 March 2013 13:39, Sebastian Schelter <[email protected]> wrote:
> Well, this is certainly possible and is an approach that is used in our
> ALS code. But the startup latency and the need to rescan
> iteration-invariant data usually typically induce an overhead of an
> order of magnitude compared to approaches specialized for distributed
> iterations.
>
> Best,
> Sebastian
>
> On 08.03.2013 14:36, Marty Kube wrote:
>> What about using one map reduce job per iteration?  The models you load
>> into distributed cache are the model from the last round and the reducer
>> can emit the expanded model.  We are presumably working with large data
>> sets so I would not expect start-up latency to be an issue.
>>
>> On 03/07/2013 04:56 PM, Ted Dunning wrote:
>>> On Thu, Mar 7, 2013 at 6:25 AM, Andy Twigg <[email protected]> wrote:
>>>
>>>> ... Right now what we have is a
>>>> single-machine procedure for scanning through some data, building a
>>>> set of histograms, combining histograms and then expanding the tree.
>>>> The next step is to decide the best way to distribute this. I'm not an
>>>> expert here, so any advice or help here is welcome.
>>>>
>>> That sounds good so far.
>>>
>>>
>>>> I think the easiest approach would be to use the mappers to construct
>>>> the set of histograms, and then send all histograms for a given leaf
>>>> to a reducer, which decides how to expand that leaf. The code I have
>>>> can be almost be ported as-is to a mapper and reducer in this way.
>>>> Would using the distributed cache to send the updated tree be wise, or
>>>> is there a better way?
>>>>
>>> Distributed cache is a very limited thing.  You can only put things in at
>>> program launch and they must remain constant throughout the program's
>>> run.
>>>
>>> The problem here is that iterated map-reduce is pretty heinously
>>> inefficient.
>>>
>>> The best candidate approaches for avoiding that are to use a BSP sort of
>>> model (see the Pregel paper at
>>> http://kowshik.github.com/JPregel/pregel_paper.pdf ) or use an
>>> unsynchronized model update cycle the way that Vowpal Wabbit does with
>>> all-reduce or the way that Google's deep learning system does.
>>>
>>> Running these approaches on Hadoop without Yarn or Mesos requires a
>>> slight
>>> perversion of the map-reduce paradigm, but is quite doable.
>>>
>>
>



--
Dr Andy Twigg
Junior Research Fellow, St Johns College, Oxford
Room 351, Department of Computer Science
http://www.cs.ox.ac.uk/people/andy.twigg/
[email protected] | +447799647538

Reply via email to