2010/1/14 Robin Anil <[email protected]>: > Some issues I am encountering. > > I use a chunk of the dictionary on every map/reduce pass. to create partial > vectors. > > > - If i do the vectorization in the reducer. Lot of data(the entire > dataset) gets thrown around the network during shuffle > - If i do the vectorization in the mapper, the input split size is upper > bound to 64/128MB. So Its not efficient to read a 300MB distributed cache on > setup of every mapper. If i decrease the chunk size that would cause too > many map/reduce passes for partial vectorization
Have you tried setting the Reducer as a Combiner too? -- Olivier http://twitter.com/ogrisel - http://code.oliviergrisel.name
