distributing a time consuming single reduce task

Ahmed Abdeen Hamed Mon, 23 Jan 2012 12:30:21 -0800

Hello friends,

I wrote a reduce() that receives a large dataset as a text values from the
map(). The purpose of the reduce() is to compute the distance between each
item in the values text. When I do, I run out of memory. I tried to
increase the heap size but that didn't scale either. I am wondering if
there is a way that I can distribute the reduce() to get it to scale. If
this is possible, can you kindly share your idea?
Please note, it is crucial for the values to be passed together in the
fashion that I am doing, so they can be clustered into groups.


Here is what the reduce() looks like:



public static class BrandClusteringReducer extends Reducer<Text, Text,
Text, Text> {
 Text key = new Text("1");

        Set<String> inputSet = new HashSet<String>();
        StringBuilder clusterBuilder = new StringBuilder();
        Set<Set<String>> clClustering = null;
        Text group = new Text();

        // Complete-Link Clusterer
        HierarchicalClusterer<String> clClusterer = new
CompleteLinkClusterer<String>(MAX_DISTANCE, EDIT_DISTANCE);
        String[] brandsList = null;
public void reduce(Text productID, Iterable<Text> brandNames, Context
context) throws IOException, InterruptedException {
 for(Text brand: brandNames){
inputSet.add(brand.toString());
}
        // perform clustering on the inputSet
        clClustering = clClusterer.cluster(inputSet);

        Iterator<Set<String>> itr = clClustering.iterator();
        while(itr.hasNext()){

         Set<String> brandsSet = itr.next();
         clusterBuilder.append("[");
         for(String aBrand: brandsSet){
         clusterBuilder.append(aBrand + ",");
         }
         clusterBuilder.append("]");
        }
        group.set(clusterBuilder.toString());
        clusterBuilder = new StringBuilder();
        context.write(key, group);

 }
}



Thanks,
-Ahmed

distributing a time consuming single reduce task

Reply via email to