You could try using more, smaller input splits, but large datasets and too-small distance thresholds will choke up the mappers with number of canopies approaching the number of points seen by the mapper. Also the single reducer will choke unless the thresholds allow condensing the mapper canopies. I think the OME is just another (quicker) indication that your thresholds are wrong; getting several million clusters out of canopy is probably not very useful anyway.

On 5/2/10 4:14 AM, Robin Anil wrote:
Keeping all canopies in memory is not making things scale. I frequently run
into out of memory errors when the distance thresholds are not good on
reuters. Any ideas on optimizing this?

Robin


Reply via email to