Hi folks,
On and off for the last several months, I've been looking at the Ball Tree
code with a mind to revamping it for increased flexibility and
maintainability.
A few points I'm aiming for:
1) a more intuitive algorithm for walking the tree (the current code is
very confusing to read).
2) separating out some of the more complicated workhorse routines so that
they can be unit-tested individually (current code has everything in one
file, with no unit tests).
3) using typed memoryviews (cython 0.17+) rather than raw pointers for
ease of reading, maintaining, and debugging the code.
4) adding the ability to specify distance metrics other than euclidean &
minkowski.
I've been playing with this in various places (see [1], [2] below). The
reason I haven't done a PR yet is because I keep running into a fundamental
tradeoff between maintainability and performance.
Goals #1 and #2 above don't seem to be a problem. I could make these
changes now, and actually speed up the ball tree by about a factor of 2 or
so, while making it a bit easier to understand and debug.
Goals #3 and #4, however, end up leading to speed penalties: #3 because
there is some small overhead involved in creating and pushing around typed
memory views (I did some blog posts a while ago exploring this: see [3] &
[4] below). Goal #4 causes speed penalties because allowing compile-time
switching of the distance function means it can no longer be inlined at
compilation. The net result is that code using memory views and flexible
metrics is ~2-3 times slower than the current code, and ~4-6 times slower
than a more optimized version of the code.
So here's the question: should we prioritize flexibility (i.e. multiple
metrics) and maintainability (i.e. memory views rather than raw pointers)
even if the code becomes 2-3x slower than its present version? Or should
we prioritize the fastest possible code for the most common case, not
worrying about memory views & other metrics?
I'd love to hear people's thoughts on this -- once I've settled on what to
prioritize, I'll be able to (finally) start working on a PR!
Thanks
Jake
[1] https://github.com/jakevdp/pyDistances
[2] https://github.com/jakevdp/BinaryTree
[3] http://jakevdp.github.com/blog/2012/08/08/memoryview-benchmarks/
[4] http://jakevdp.github.com/blog/2012/08/16/memoryview-benchmarks-2/
------------------------------------------------------------------------------
Everyone hates slow websites. So do we.
Make your web apps faster with AppDynamics
Download AppDynamics Lite for free today:
http://p.sf.net/sfu/appdyn_d2d_jan
_______________________________________________
Scikit-learn-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/scikit-learn-general