> As you point out, the problem (that I see) with mahout is that it offers no > reason to stay within the framework, for example there is no consistent way > of representing datasets unlike say numpy or pandas. Even something like > that - load some data, have it automatically passed into a consistent > tabular format for regression/classification would be very helpful.
+1. Would be great if there was some consistency in the way the classifiers handle their inputs indeed. Could be SequenceFiles containing Vectors with a separation of concern between the code preprocessing the inputs (from Lucene indices, text files on FS, computation of weights, etc...) and the classifiers themselves. Julien wrote: > > > On Fri, Feb 15, 2013 at 1:06 AM, Marty Kube < > > [email protected]> wrote: > > > > > > > > On 01/28/2013 02:33 PM, Ted Dunning wrote: > > > > > >> I think I was suggesting something weaker. > > >> > > >> I was suggesting that trees get built against a portion of the data > and > > >> each node builds some number of trees against just the data it sees. > > This > > >> is in the spirit of random forests, but not the letter. > > >> > > > I'm looking at the Mahout partial implementation. It appears to me > that > > > some number of trees get built against the portion of the data each > node > > > sees based on the map reduce input split. > > > Am I wrong here? If not, Mahout already has an out-of-core > > implementation. > > > > > > > Indeed, each mapper will grow a subset of the forest using only the data > > passed to it. Actually, this was based on a suggestion by Ted: > > > > "modify the original algorithm to build multiple trees for different > > portions of the data. That loses some of the solidity of the original > > method, but could actually do better if the splits exposed non-stationary > > behavior." > > > > > > > BTW - where did the name Partial Implementation come from? Is this > > > partially completed work :-) > > > > > > Ha ha ha, the name came from the fact that each mapper/node has a > "partial" > > view of the dataset. But I do agree, that it's partially completed :P I > > guess they were never enough user interest in Mahout's Decision Forest > > implementation to motivate me to keep working on it, but I do see more > and > > more questions on the mailing lists about this, maybe I will get > motivated > > enough to work on this again. > > > > I did some research, some time ago, about how to grow the trees on a > > cluster using all data when the dataset is too big to fit in memory. Two > > papers got my attention: > > > > *RainForest - A Framework for Fast Decision Tree Construction of Large > > Datasets [1]* > > This paper describes how to grow the trees without loading all data in > > memory. Using this, you need to copy the dataset on all computing nodes. > > > > *PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce > [2]* > > Although it uses map-reduce to grow the trees, it uses lot's of hacks and > > tricks to make it work. > > > > Shouldn't be too difficult to use [1] and improve the current > > implementation, this way the forest can be grown on "full" large dataset > on > > a computing cluster. One question though, do we really need Hadoop for > this > > ? I mean, as long as each node has access to the dataset files this > should > > be enough to build the forest on any number of nodes. > > > > [1] > http://www.cs.cornell.edu/johannes/papers/1998/vldb1998-rainforest.pdf > > [2] http://dejanseo.com.au/research/google/36296.pdf > > > > > > > > > >> Another option is that all the nodes start some trees based on their > own > > >> data and spit the set of all such trees to a central (possibly > > distributed > > >> store). They also grab some number of trees back down from the > central > > >> store and try elaborating those trees using their own data. These > will > > >> also be sent back to the central store. This is closer to random > > forests. > > >> > > >> A third option is like option two, but the trees that get pulled down > > and > > >> spit back up collect statistics until enough nodes report such stats > at > > >> which point a new branch is added and the stats are reset. This is > very > > >> close to real random forests. It should also be efficient because > each > > >> node is passing many trees across its own data in each pass. If that > > data > > >> fits in memory (even in file cache) then the pass will be very fast. > If > > >> it > > >> doesn't fit in memory, then it will run as fast as the disk will let > it > > >> run. Allowing a tree to commit with only 90% coverage should allow > good > > >> robustness against laggards while doing almost exactly the same thing > as > > >> the real random forest algorithm. > > >> > > >> On Mon, Jan 28, 2013 at 8:56 AM, Andy Twigg <[email protected]> > > wrote: > > >> > > >> Ted, > > >>> > > >>> Sorry, I don't understand. Are you suggesting that a single decision > > >>> tree can be built efficiently in a distributed fashion? > > >>> > > >>> The following 2 ways seem like the naive ways of doing this: > > >>> 1) each machine constructs one node of the tree, by scanning all the > > >>> data, filtering those that don't get to its node (this is expensive), > > >>> computing the split and then writing the split out. This requires a > > >>> pass through the data for each node of the tree. > > >>> 2) as 1), except that each machine writes out the filtered data after > > >>> its node in the tree. This requires less scanning (as much as the > > >>> local case I described earlier), but lots of data movement. > > >>> > > >>> Do you have an alternative method? > > >>> > > >>> Andy > > >>> > > >>> > > >>> On 28 January 2013 16:42, Ted Dunning <[email protected]> wrote: > > >>> > > >>>> IF we have a step which permutes data (once) then I doubt that > > >>>> redistribution is necessary. At that point the randomness consists > of > > >>>> building trees based on different variable subsets and data subsets. > > >>>> The > > >>>> original random forests only split on variable subsets. How much > this > > >>>> matters is an open question. > > >>>> > > >>>> On Mon, Jan 28, 2013 at 2:36 AM, Sean Owen <[email protected]> > wrote: > > >>>> > > >>>> It sounds OK to me. Yes, I don't think you want to try to > parallelize > > >>>>> the building of each tree just to build it off all the data. It > will > > >>>>> be too slow. > > >>>>> > > >>>>> I imagine the game is to parallelize such that each worker has > 1/Nth > > >>>>> of the data, where N is as low as you can manage. Each worker is > > >>>>> building one or more trees from its slice. Then iterate, and split > up > > >>>>> the data differently. Each worker can now cross-validate previous > > >>>>> trees based on the new slice, or even build new ones, and iterate > > >>>>> again. > > >>>>> > > >>>>> I'm guessing this process will tend to need to build more trees > than > > >>>>> usual. I think it's also possible that it will want to try to > discard > > >>>>> trees with high out-of-bag error. > > >>>>> > > >>>>> Just riffing. I think this is some potential green-field territory > to > > >>>>> really think about how you build trees when it doesn't nearly fit > in > > >>>>> memory, and what shape of computation gives a really nice tradeoff > > >>>>> between speed and accuracy. I don't think the conventional approach > > is > > >>>>> that point but it's more like the above. > > >>>>> > > >>>>> On Mon, Jan 28, 2013 at 10:14 AM, Andy Twigg <[email protected] > > > > >>>>> > > >>>> wrote: > > >>> > > >>>> Here's my thought for what might work well. Comments very welcome. > > >>>>>> > > >>>>>> We construct k decision trees (where k might be 20-100). Each tree > > has > > >>>>>> depth roughly 10-30. Rather than construct each tree in a > > distributed > > >>>>>> fashion, construct each tree locally by using multiple passes over > > the > > >>>>>> bagged sample for that tree (construct in BFS manner, assuming the > > >>>>>> tree can fit in memory). Do this in parallel k times. I know this > is > > >>>>>> perhaps not the "mapreduce" way, but I'd be interested to hear > > >>>>>> alternatives. It seems a bad idea to shuffle the data around on > > every > > >>>>>> BFS iteration per tree. > > >>>>>> > > >>>>>> I'm not planning to work on this right now, but would like to > figure > > >>>>>> out if Mahout is a good platform to work with if I want such an > > >>>>>> algorithm. > > >>>>>> > > >>>>>> Andy > > >>>>>> > > >>>>> > > >>> > > >>> -- > > >>> Dr Andy Twigg > > >>> Junior Research Fellow, St Johns College, Oxford > > >>> Room 351, Department of Computer Science > > >>> http://www.cs.ox.ac.uk/people/**andy.twigg/< > > http://www.cs.ox.ac.uk/people/andy.twigg/> > > >>> [email protected] | +447799647538 > > >>> > > >>> > > > > > > -- * *Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble
