On Wed, Sep 21, 2011 at 8:56 AM, Robert Layton <[email protected]> wrote:

> I do really like the metric='precomputed' concept, which allows both
> implementing actual metrics (euclidean, manhattan) as well as passing a
> precomputed array in. If the algorithm doesn't allow it for whatever reason*
> throw an error. The same interface works with kernels as well.
> * k-means springs to mind - its only 'proven' for Euclidean distance, which
> means that it should error if anything else is passed to it. I have an
> implementation that works solely using a distance matrix, but I don't know
> if it retains the qualities that the base algorithm does.

I like metric="precomputed" too and I think we should continue to
provide it when it's easy, but its main problem is its O(n_samples^2)
space complexity. It would be better if we had an efficient way to
work with triangular matrices but it would only cut the memory
consumption by 2. Also, the point of algorithms such as SVMs is
precisely that they have sparse solutions so it seems like overkill to
precompute everything in advance.

The solution, for me, is to use a cache (in the spirit of a kernel
cache) and to give a way to the algorithm to recompute pairwise
similarities / distances on demand, for example with a callable
object. That would be slow in Python, which makes me think that we
should start thinking of providing a Cython API, in addition to our
Python API, when appropriate.

My 2c,
Mathieu

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2dcopy1
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general

Reply via email to