Hi Sean,

In the Map/Reduce pass of RowWeightMapper and
WeightedOccurrencesPerColumnReducer two things are accomplished:

 a) a measure-specific "weight" for each row is computed (when you use
Tanimoto-coefficient as similarity the "weight" for each row would be
the number of nonzero entries for example)
 b) the matrix is transposed to create an inverted index from columns to
row entries

Each resulting WeightedOccurrenceArray is an entry in that inverted
index, it contains all entries of a columnvector of the original matrix
with the weight of the related row attached to each element.

Slide 13 of http://www.slideshare.net/sscdotopen/mahoutcf shows what's
going on in this M/R pass (using a simplified example because the weight
computation is left out here).

In WeightedOccurrencesPerColumnReducer the Set is used to buffer all the
entries for the column and we will see as much WeightedOccurrences as we
have nonzero entries in the related column vector.

All of this is a bit tricky and hard to see from the code directly,
maybe a little example helps:

Example matrix:

 1 - 3
 2 4 -

RowWeightMapper with Tanimoto:

 Weight of 1,-,3 is 2
 Weight of 2,4,- is 2

 We map out the following column-(row,value,weight) tuples
(VarIntWritable/WeightedOccurrence tuples)

 1 (1,1,2)
 3 (1,3,2)
 1 (2,2,2)
 2 (2,4,2)

WeightedOccurrencesPerColumnReducer now receives:

 1 (1,1,2),(2,2,2)
 2 (2,4,2) 
 3 (1,3,2)

Hope that answers your question.

--sebastian

Am 20.11.2010 12:14, schrieb Sean Owen:
> I'm looking at RowWeightMapper as part of getting inside the
> distributed similarity computation. This bit raised a question:
>
>
>   /**
>    * collects all {...@link WeightedOccurrence}s for a column and writes
> them to a {...@link WeightedOccurrenceArray}
>    */
>   public static class WeightedOccurrencesPerColumnReducer
>       extends
> Reducer<VarIntWritable,WeightedOccurrence,VarIntWritable,WeightedOccurrenceArray>
> {
>
>     @Override
>     protected void reduce(VarIntWritable column,
> Iterable<WeightedOccurrence> weightedOccurrences, Context ctx)
>         throws IOException, InterruptedException {
>
>       Set<WeightedOccurrence> collectedWeightedOccurrences = new
> HashSet<WeightedOccurrence>();
>       for (WeightedOccurrence weightedOccurrence : weightedOccurrences) {
>         collectedWeightedOccurrences.add(weightedOccurrence.clone());
>       }
>
>       ctx.write(column, new
> WeightedOccurrenceArray(collectedWeightedOccurrences.toArray(
>           new WeightedOccurrence[collectedWeightedOccurrences.size()])));
>     }
>   }
>
>
> Because WeightedOccurrence implements equals/hashCode based only on
> row, this seems to have the effect of throwing out all but one
> WeightedOccurrence for each column. Makes sense to me -- should have
> only ever generated 0 or 1 entries for a row/column pair of course.
>
> But then what's the point of de-duping with a Set, and writing out an
> array? Seems like it's always 0 or 1 entries?
>
> I bet I'll figure it out as I proceed but will be grateful for help as
> I step through this aspect.

Reply via email to