Ted, Please keep in mind, I just downloaded the Mahout code today. My knowledge is from a single presentation and Cloudera's Hadoop tutorial. My goal is to have two stages - training takes a sequence of vectors and classifications and creates a large hdfs file of the form vector-classification. This file is streamed to each node classifying an incoming set of vectors. Each vector is compared against the vector to be classified and the table of k best matches is created from this. Majority wins, resulting in key-classification or classification-key output. With streaming of the training file, only k+2 vectors are needed in memory, achieving O(1) memory use and embarrassingly parallel execution.
Daniel On Sat, Mar 26, 2011 at 5:25 PM, Ted Dunning <[email protected]> wrote: > Daniel, > This is a bit confusing. > How do you do Knn with O(1) memory? Sampling? > Or does each mapper find n nearest neighbors in a slice of the training data > and then pass that on to the reducer which keeps the k best from all the > mappers for a particular input? > Also, what is the standard input to the mapper? The training data with the > instances to classify on the side to be read by all mappers? > > On Sat, Mar 26, 2011 at 1:40 PM, Daniel McEnnis <[email protected]> wrote: >> >> Josh, >> >> The initial plan is to keep it quite simple. No ball trees, no >> enhancements. Ball trees are likely to require each node to have in >> memory the ground truth - too high a memory requirement. The goal is >> a simple KNN that uses O(c) memory in a map stage that assigns the >> class. Not very interesting. >> >> Daniel. >> >> On Sat, Mar 26, 2011 at 4:23 PM, Josh Patterson <[email protected]> wrote: >> > What kind of approach would you use? I've done one of these before >> > with a balltree which was effective. I'd be interested in working on >> > spatial trees in mahout. >> > >> > Josh >> > >> > On Saturday, March 26, 2011, Daniel McEnnis <[email protected]> wrote: >> >> Dear Mahout developers, >> >> >> >> While I'm learning the code, I thought I'd ask if there was any >> >> objection to me working on a KNN classifier module as my learning >> >> project. I should be able to make this at worst O(n) space over the >> >> training set and O(c) space over the input set using Map Reduce. Its >> >> something I'm quite familiar with and fills a gap in the classifier >> >> portfolio. >> >> >> >> Sincerely, >> >> >> >> Daniel McEnnis. >> >> >> > >> > -- >> > Twitter: @jpatanooga >> > Solution Architect @ Cloudera >> > hadoop: http://www.cloudera.com >> > blog: http://jpatterson.floe.tv >> > > >
