Hey Andy,

What is the use case that is driving your question?
Are you looking at the training phase - I didn't realise that one needed to keep the data in memory. I have a use case where even keeping the trees, much less the data, in memory during classification is an issue.

Ted, do you have some references on the algorithms below? I've done some work on using mmap trees during classification. I'd be happy to contribute along those lines.

On 01/25/2013 04:52 PM, Ted Dunning wrote:
Hey Andy,

There are no plans for this.  You are correct that multiple passes aren't
too difficult, but they do go against the standard map-reduce paradigm a
bit if you want to avoid iterative map-reduce.

It definitely would be nice to have a really competitive random forest
implementation that uses the global  accumulator style plus long-lived
mappers.  The basic idea would be to use the same sort of tricks that
Vowpal Wabbit or Giraph use to get a bunch of long-lived mappers and then
have them asynchronously talk to a tree repository.

On Fri, Jan 25, 2013 at 6:58 PM, Andy Twigg <[email protected]> wrote:

Hi,

I'm new to this list so I apologise if this is covered elsewhere (but
I couldn't find it..)

I'm looking at the Random Forests implementations, both mapreduce
("partial") and non-distributed. Both appear to require the data
loaded into memory. Random forests should be straightforward to
construct with multiple passes through the data without storing the
data in memory. Is there such an implementation in Mahout? If not, is
there a ticket/plan ?

Thanks,
Andy


--
Dr Andy Twigg
Junior Research Fellow, St Johns College, Oxford
Room 351, Department of Computer Science
http://www.cs.ox.ac.uk/people/andy.twigg/
[email protected] | +447799647538


Reply via email to