On Oct 12, 2012, at 22:15, Ted Dunning <[email protected]> wrote:

> See http://github.com/tdunning/knn
> 
> The algorithms definitely need more work but what work they need is
> something that needs more testing.
> 
> To get that testing mileage, we need to make those algorithms available in
> a standard framework.
> 
> One thought that I have is that we should be able to build synthetic data
> sets that emulate the clustering and search performance of realistic data.
> If we can avoid looking at anything but a few generalization scores, then
> we have a very solid anonymization story because we won't even be
> generating the same *types* of data in the random generator.  This alone
> would be an interesting thesis topic.
> 
> Again, however, we need runtime from current clustering users to get the
> scores.

Alright, let's do this.
I think we'll get more details clarified as we go.

Now, where do I start? What would a plan for the coming months look like?
Should I start by first reading the theory? Learn more about Mahout?

> On Fri, Oct 12,  2012 at 4:41 AM, Dan Filimon 
> <[email protected]>wrote:
> 
>>> On my side:
>>> 
>>> - I will provide mentor support for this project
>>> 
>>> - I will help you write up the results by reviewing your write-ups and
>>> suggesting structure and content.
>>> 
>>> The benefits to you will be deep knowledge of advanced clustering
>>> algorithms as well as practical experience in how integration like this
>> can
>>> happen.
>> 
>> Could you explain a bit what working on the integration would entail?
>> 
>> I don't want to sound ungrateful here, I definitely want to work with
>> you, but ideally, I'd like to work *on* these advanced clustering
>> algorithms (helping improve them maybe? overambitious?), not just
>> integrate them.
>> 

Reply via email to