On Thu, Feb 18, 2010 at 11:55 AM, Robin Anil <robin.a...@gmail.com> wrote:
> I was trying out SeqAccessSparseVector on Canopy Clustering using Manhattan > distance. I found performance to be really bad. So I profiled it with > Yourkit(Thanks a lot for providing us free license) > > Since i was trying out manhattan distance, there were a lot of A-B which > created a lot of clone operation 5% of the total time > there were also so many A+B for adding a point to the canopy to average. > this was also creating a lot of clone operations. 90% of the total time > SequentialAccessSparseVector should only be used in a read-only fashion. If you are creating an average centroid which is sparse, but it is mutating, then it should be RandomAccessSparseVector. The points which are being used to create it can be SequentialAccessSparseVector (if they themselves never change), but then the method called should be SequentialAccessSparseVector.addTo(RandomAccessSparseVector) - this exploits the fast sequential iteration of SeqAcc, and the fast random-access mutatability of RandAcc. > > So we definitely needs to improve that.. > > For a small hack. I made the cluster centers RandomAccess Vector. Things > are fast again. I dont know whether to commit or not. But something to look > into in 0.4? > Yeah, cluster *centers* should indeed be RandomAccess. JIRA / patch so we can see exactly what the change is? -jake