See http://github.com/tdunning/knn
The algorithms definitely need more work but what work they need is something that needs more testing. To get that testing mileage, we need to make those algorithms available in a standard framework. One thought that I have is that we should be able to build synthetic data sets that emulate the clustering and search performance of realistic data. If we can avoid looking at anything but a few generalization scores, then we have a very solid anonymization story because we won't even be generating the same *types* of data in the random generator. This alone would be an interesting thesis topic. Again, however, we need runtime from current clustering users to get the scores. On Fri, Oct 12, 2012 at 4:41 AM, Dan Filimon <[email protected]>wrote: > > On my side: > > > > - I will provide mentor support for this project > > > > - I will help you write up the results by reviewing your write-ups and > > suggesting structure and content. > > > > The benefits to you will be deep knowledge of advanced clustering > > algorithms as well as practical experience in how integration like this > can > > happen. > > Could you explain a bit what working on the integration would entail? > > I don't want to sound ungrateful here, I definitely want to work with > you, but ideally, I'd like to work *on* these advanced clustering > algorithms (helping improve them maybe? overambitious?), not just > integrate them. >
