2010/1/14 Robin Anil <[email protected]>:
> Some issues I am encountering.
>
> I use a chunk of the dictionary on every map/reduce pass. to create partial
> vectors.
>
>
>   - If i do the vectorization in the reducer. Lot of data(the entire
>   dataset) gets thrown around the network during shuffle
>   - If i do the vectorization in the mapper, the input split size is upper
>   bound to 64/128MB. So Its not efficient to read a 300MB distributed cache on
>   setup of every mapper. If i decrease the chunk size that would cause too
>   many map/reduce passes for partial vectorization

Have you tried setting the Reducer as a Combiner too?

-- 
Olivier
http://twitter.com/ogrisel - http://code.oliviergrisel.name

Reply via email to