[ 
https://issues.apache.org/jira/browse/LUCENE-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wettin updated LUCENE-1025:
--------------------------------

    Attachment: LUCENE-1025.txt

Introduced a halfbaked Markov chain (not to be confused with the token filter 
in this patch with a similar name that concatenate tokens) in order to to 
determine a mean title from all instances in a cluster. Sort of works like the 
MegaHAL, the talking bot, but usually makes more sense as all titles in a 
cluster are similar. It needs work with limiting length, it should look further 
ahead than one link and it is terrible unoptimized. Still, it already behaves 
quite well with the news article corpus I test with.

Perhaps it would make sense to rather select the title of the most central 
instance in a cluster, but that is not as fun.

> Document clusterer
> ------------------
>
>                 Key: LUCENE-1025
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1025
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Analysis, Term Vectors
>            Reporter: Karl Wettin
>            Priority: Minor
>         Attachments: LUCENE-1025.txt, LUCENE-1025.txt
>
>
> A two-dimensional desicion tree in conjunction with cosine coefficient 
> similarity is the base of this document clusterer. It uses Lucene for 
> tokenization and length normalization. 
> Example output of 3500 clustered news articles dated the thee first days of 
> January 2004 from a number of sources can be found here: < 
> http://ginandtonique.org/~kalle/LUCENE-1025/out_4.0.txt >. One thing missing 
> is automatic calculation of cluster boundaries. Not impossible to implement, 
> nor is it really needed. 4.5 in the URL above is that distance.
> The example was calculated limited to the top 1000 terms from instance, 
> divided with siblings and re-pruned all the way to the root. On my dual core 
> it took about 100ms to insert a new document in the tree, no matter if it 
> contained 100 or 10,000 instances. 1GB RAM held about 10,000 news articles. 
> Next steps for this code is persistency of the tree using BDB or a even 
> perhaps something similar to the Lucene segmented solution. Perhaps even 
> using Lucene Directory. The plan is to keep this clusterer synchronized with 
> the index, allowing really speedy "more like this" features.
> Later on I'll introduce map/reduce for better training speed.
> This code is far from perfect, nor is the results as good as many other 
> products.  Knowing I didn't put in more than a few handful of hours, this 
> works quite well.
> By displaying neighboring clusters (as in the example) one will definetly get 
> more related documents at a fairly low false-positive cost. Perhaps it would 
> be interesting to analyse user behavior to find out if any of them could be 
> merged. Perhaps some reinforcement learning?
> There are no ROC-curves, precision/recall-values nor tp/fp-rates as I have no 
> manually clustered corpus for me to compare with.
> I've been looking for an archive of the Lucene-users forum for 
> demonstrational use, but could not find it. Any ideas on where I can find 
> that? It could for instance be neat to tweak this code to identify frequently 
> asked questions and match it with an answer in the Wiki, but perhaps an SVM, 
> NB or something-implementation would be better suited for that.
> Don't hesitate to comment on this if you have an idea, request or question.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to