I'm looking at RowWeightMapper as part of getting inside the distributed
similarity computation. This bit raised a question:


  /**
   * collects all {...@link WeightedOccurrence}s for a column and writes them
to a {...@link WeightedOccurrenceArray}
   */
  public static class WeightedOccurrencesPerColumnReducer
      extends
Reducer<VarIntWritable,WeightedOccurrence,VarIntWritable,WeightedOccurrenceArray>
{

    @Override
    protected void reduce(VarIntWritable column,
Iterable<WeightedOccurrence> weightedOccurrences, Context ctx)
        throws IOException, InterruptedException {

      Set<WeightedOccurrence> collectedWeightedOccurrences = new
HashSet<WeightedOccurrence>();
      for (WeightedOccurrence weightedOccurrence : weightedOccurrences) {
        collectedWeightedOccurrences.add(weightedOccurrence.clone());
      }

      ctx.write(column, new
WeightedOccurrenceArray(collectedWeightedOccurrences.toArray(
          new WeightedOccurrence[collectedWeightedOccurrences.size()])));
    }
  }


Because WeightedOccurrence implements equals/hashCode based only on row,
this seems to have the effect of throwing out all but one WeightedOccurrence
for each column. Makes sense to me -- should have only ever generated 0 or 1
entries for a row/column pair of course.

But then what's the point of de-duping with a Set, and writing out an array?
Seems like it's always 0 or 1 entries?

I bet I'll figure it out as I proceed but will be grateful for help as I
step through this aspect.

Reply via email to