You could try using more, smaller input splits, but large datasets and
too-small distance thresholds will choke up the mappers with number of
canopies approaching the number of points seen by the mapper. Also the
single reducer will choke unless the thresholds allow condensing the
mapper canopies. I think the OME is just another (quicker) indication
that your thresholds are wrong; getting several million clusters out of
canopy is probably not very useful anyway.
On 5/2/10 4:14 AM, Robin Anil wrote:
Keeping all canopies in memory is not making things scale. I frequently run
into out of memory errors when the distance thresholds are not good on
reuters. Any ideas on optimizing this?
Robin