Hello friends, I wrote a reduce() that receives a large dataset as a text values from the map(). The purpose of the reduce() is to compute the distance between each item in the values text. When I do, I run out of memory. I tried to increase the heap size but that didn't scale either. I am wondering if there is a way that I can distribute the reduce() to get it to scale. If this is possible, can you kindly share your idea? Please note, it is crucial for the values to be passed together in the fashion that I am doing, so they can be clustered into groups.
Here is what the reduce() looks like: public static class BrandClusteringReducer extends Reducer<Text, Text, Text, Text> { Text key = new Text("1"); Set<String> inputSet = new HashSet<String>(); StringBuilder clusterBuilder = new StringBuilder(); Set<Set<String>> clClustering = null; Text group = new Text(); // Complete-Link Clusterer HierarchicalClusterer<String> clClusterer = new CompleteLinkClusterer<String>(MAX_DISTANCE, EDIT_DISTANCE); String[] brandsList = null; public void reduce(Text productID, Iterable<Text> brandNames, Context context) throws IOException, InterruptedException { for(Text brand: brandNames){ inputSet.add(brand.toString()); } // perform clustering on the inputSet clClustering = clClusterer.cluster(inputSet); Iterator<Set<String>> itr = clClustering.iterator(); while(itr.hasNext()){ Set<String> brandsSet = itr.next(); clusterBuilder.append("["); for(String aBrand: brandsSet){ clusterBuilder.append(aBrand + ","); } clusterBuilder.append("]"); } group.set(clusterBuilder.toString()); clusterBuilder = new StringBuilder(); context.write(key, group); } } Thanks, -Ahmed