Re: Profiling SequentialAccessSparseVector

Jake Mannix Thu, 18 Feb 2010 12:26:35 -0800

On Thu, Feb 18, 2010 at 11:55 AM, Robin Anil <robin.a...@gmail.com> wrote:


> I was trying out SeqAccessSparseVector on Canopy Clustering using Manhattan
> distance. I found performance to be really bad. So I profiled it with
> Yourkit(Thanks a lot for providing us free license)
>
> Since i was trying out manhattan distance, there were a lot of A-B which
> created a lot of clone operation 5% of the total time
> there were also so many A+B for adding a point to the canopy to average.
> this was also creating a lot of clone operations.  90% of the total time
>

SequentialAccessSparseVector should only be used in a read-only fashion.  If
you are creating an average centroid which is sparse, but it is mutating,
then it should be RandomAccessSparseVector.  The points which are being used
to create it can be SequentialAccessSparseVector (if they themselves never
change), but then the method called should be
SequentialAccessSparseVector.addTo(RandomAccessSparseVector) - this exploits
the fast sequential iteration of SeqAcc, and the fast random-access
mutatability of RandAcc.


>
> So we definitely needs to improve that..
>
> For a small hack. I made the cluster centers RandomAccess Vector. Things
> are fast again. I dont know whether to commit or not. But something to look
> into in 0.4?
>

Yeah, cluster *centers* should indeed be RandomAccess.  JIRA / patch so we
can see exactly what the change is?

  -jake

Re: Profiling SequentialAccessSparseVector

Reply via email to