[ 
https://issues.apache.org/jira/browse/LUCENE-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12533689
 ] 

Karl Wettin commented on LUCENE-1025:
-------------------------------------

There is now new example output here: 
http://ginandtonique.org/~kalle/LUCENE-1025/

I recommend out_5.5.txt, but what number best demonstrate the clusterer will 
change as the tokenization and similarity alogrithm chages.

> Document clusterer
> ------------------
>
>                 Key: LUCENE-1025
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1025
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis, Term Vectors
>            Reporter: Karl Wettin
>            Priority: Minor
>         Attachments: LUCENE-1025.txt
>
>
> A two-dimensional desicion tree in conjunction with cosine coefficient 
> similarity is the base of this document clusterer. It uses Lucene for 
> tokenization and length normalization. 
> Example output of 3500 clustered news articles dated the thee first days of 
> January 2004 from a number of sources can be found here: < 
> http://ginandtonique.org/~kalle/LUCENE-1025/out_4.0.txt >. One thing missing 
> is automatic calculation of cluster boundaries. Not impossible to implement, 
> nor is it really needed. 4.5 in the URL above is that distance.
> The example was calculated limited to the top 1000 terms from instance, 
> divided with siblings and re-pruned all the way to the root. On my dual core 
> it took about 100ms to insert a new document in the tree, no matter if it 
> contained 100 or 10,000 instances. 1GB RAM held about 10,000 news articles. 
> Next steps for this code is persistency of the tree using BDB or a even 
> perhaps something similar to the Lucene segmented solution. Perhaps even 
> using Lucene Directory. The plan is to keep this clusterer synchronized with 
> the index, allowing really speedy "more like this" features.
> Later on I'll introduce map/reduce for better training speed.
> This code is far from perfect, nor is the results as good as many other 
> products.  Knowing I didn't put in more than a few handful of hours, this 
> works quite well.
> By displaying neighboring clusters (as in the example) one will definetly get 
> more related documents at a fairly low false-positive cost. Perhaps it would 
> be interesting to analyse user behavior to find out if any of them could be 
> merged. Perhaps some reinforcement learning?
> There are no ROC-curves, precision/recall-values nor tp/fp-rates as I have no 
> manually clustered corpus for me to compare with.
> I've been looking for an archive of the Lucene-users forum for 
> demonstrational use, but could not find it. Any ideas on where I can find 
> that? It could for instance be neat to tweak this code to identify frequently 
> asked questions and match it with an answer in the Wiki, but perhaps an SVM, 
> NB or something-implementation would be better suited for that.
> Don't hesitate to comment on this if you have an idea, request or question.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to