I'm looking at RowWeightMapper as part of getting inside the distributed
similarity computation. This bit raised a question:
/**
* collects all {...@link WeightedOccurrence}s for a column and writes them
to a {...@link WeightedOccurrenceArray}
*/
public static class WeightedOccurrencesPerColumnReducer
extends
Reducer<VarIntWritable,WeightedOccurrence,VarIntWritable,WeightedOccurrenceArray>
{
@Override
protected void reduce(VarIntWritable column,
Iterable<WeightedOccurrence> weightedOccurrences, Context ctx)
throws IOException, InterruptedException {
Set<WeightedOccurrence> collectedWeightedOccurrences = new
HashSet<WeightedOccurrence>();
for (WeightedOccurrence weightedOccurrence : weightedOccurrences) {
collectedWeightedOccurrences.add(weightedOccurrence.clone());
}
ctx.write(column, new
WeightedOccurrenceArray(collectedWeightedOccurrences.toArray(
new WeightedOccurrence[collectedWeightedOccurrences.size()])));
}
}
Because WeightedOccurrence implements equals/hashCode based only on row,
this seems to have the effect of throwing out all but one WeightedOccurrence
for each column. Makes sense to me -- should have only ever generated 0 or 1
entries for a row/column pair of course.
But then what's the point of de-duping with a Set, and writing out an array?
Seems like it's always 0 or 1 entries?
I bet I'll figure it out as I proceed but will be grateful for help as I
step through this aspect.