On Sat, Mar 26, 2011 at 4:29 PM, Daniel McEnnis <[email protected]> wrote:
> Ted, > > Please keep in mind, I just downloaded the Mahout code today. My > knowledge is from a single presentation and Cloudera's Hadoop > tutorial. That shouldn't be a problem. There are many who can help you state things concisely in fashionable terminology. > My goal is to have two stages - training takes a sequence > of vectors and classifications and creates a large hdfs file of the > form vector-classification. "training" is a big word here. Reformatting might be as accurate since no transformation to speak of occurs. > This file is streamed to each node > classifying an incoming set of vectors. A key here is that the training data are streamed to each node. In map-reduce, there is typically an asymmetry between the function represented by the map and the sequence of values to which the map function is applied. Often, the map function is parametrized by some second kind of data that is distinct from the sequence of values presented to it. In some computations, there are two kinds of inputs and we must decide which represents parameters to the map functions and which represents streaming input. Commonly the larger of these two is considered best to stream and the lesser best to serve as parameters. It is also helpful to consider which of the two kinds of input can be split most easily. In your case, the larger is probably going to commonly be the input data. Also, both can be split. The test cases can be split trivially and the training data can be split if you allow for the k-best from each split to be merged later in a reduce function. Each vector is compared > against the vector to be classified and the table of k best matches is > created from this. Majority wins, resulting in key-classification or > classification-key output. With streaming of the training file, only > k+2 vectors are needed in memory, achieving O(1) memory use and > embarrassingly parallel execution. > Almost true. Each mapper only needs k best elements so far plus the latest training example plus the vector being classified. But two things increase this slightly. - first, since each mapper only sees a subset of all input, the output from each mapper must be combined. This typically requires 2k vectors in the reducer. - secondly, since it is very expensive to run an entire map-reduce program, it is nice to classify many vectors at once. If we classify p vectors in a single execution, then we need p (k+1) + 1 vectors in the mapper and 2 k vectors in the reducer. If p is very large, then we might like to read batches of training vectors and then iteratively explicitly iterate through batches of test vectors in the mapper. The iteration can also be forced outside of the map-reduce program entirely such that we would run the map-reduce program several times with p small enough in each run to allow memory to suffice.
