On Oct 12, 2012, at 22:15, Ted Dunning <[email protected]> wrote:
> See http://github.com/tdunning/knn
>
> The algorithms definitely need more work but what work they need is
> something that needs more testing.
>
> To get that testing mileage, we need to make those algorithms available in
> a standard framework.
>
> One thought that I have is that we should be able to build synthetic data
> sets that emulate the clustering and search performance of realistic data.
> If we can avoid looking at anything but a few generalization scores, then
> we have a very solid anonymization story because we won't even be
> generating the same *types* of data in the random generator. This alone
> would be an interesting thesis topic.
>
> Again, however, we need runtime from current clustering users to get the
> scores.
Alright, let's do this.
I think we'll get more details clarified as we go.
Now, where do I start? What would a plan for the coming months look like?
Should I start by first reading the theory? Learn more about Mahout?
> On Fri, Oct 12, 2012 at 4:41 AM, Dan Filimon
> <[email protected]>wrote:
>
>>> On my side:
>>>
>>> - I will provide mentor support for this project
>>>
>>> - I will help you write up the results by reviewing your write-ups and
>>> suggesting structure and content.
>>>
>>> The benefits to you will be deep knowledge of advanced clustering
>>> algorithms as well as practical experience in how integration like this
>> can
>>> happen.
>>
>> Could you explain a bit what working on the integration would entail?
>>
>> I don't want to sound ungrateful here, I definitely want to work with
>> you, but ideally, I'd like to work *on* these advanced clustering
>> algorithms (helping improve them maybe? overambitious?), not just
>> integrate them.
>>