[
https://issues.apache.org/jira/browse/LUCENE-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Karl Wettin closed LUCENE-1025.
-------------------------------
Resolution: Won't Fix
MAHOUT-19 is a much better implementation.
> Document clusterer
> ------------------
>
> Key: LUCENE-1025
> URL: https://issues.apache.org/jira/browse/LUCENE-1025
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Analysis, Term Vectors
> Reporter: Karl Wettin
> Priority: Minor
> Attachments: LUCENE-1025.txt, LUCENE-1025.txt
>
>
> A two-dimensional desicion tree in conjunction with cosine coefficient
> similarity is the base of this document clusterer. It uses Lucene for
> tokenization and length normalization.
> Example output of 3500 clustered news articles dated the thee first days of
> January 2004 from a number of sources can be found here: <
> http://ginandtonique.org/~kalle/LUCENE-1025/out_4.0.txt >. One thing missing
> is automatic calculation of cluster boundaries. Not impossible to implement,
> nor is it really needed. 4.5 in the URL above is that distance.
> The example was calculated limited to the top 1000 terms from instance,
> divided with siblings and re-pruned all the way to the root. On my dual core
> it took about 100ms to insert a new document in the tree, no matter if it
> contained 100 or 10,000 instances. 1GB RAM held about 10,000 news articles.
> Next steps for this code is persistency of the tree using BDB or a even
> perhaps something similar to the Lucene segmented solution. Perhaps even
> using Lucene Directory. The plan is to keep this clusterer synchronized with
> the index, allowing really speedy "more like this" features.
> Later on I'll introduce map/reduce for better training speed.
> This code is far from perfect, nor is the results as good as many other
> products. Knowing I didn't put in more than a few handful of hours, this
> works quite well.
> By displaying neighboring clusters (as in the example) one will definetly get
> more related documents at a fairly low false-positive cost. Perhaps it would
> be interesting to analyse user behavior to find out if any of them could be
> merged. Perhaps some reinforcement learning?
> There are no ROC-curves, precision/recall-values nor tp/fp-rates as I have no
> manually clustered corpus for me to compare with.
> I've been looking for an archive of the Lucene-users forum for
> demonstrational use, but could not find it. Any ideas on where I can find
> that? It could for instance be neat to tweak this code to identify frequently
> asked questions and match it with an answer in the Wiki, but perhaps an SVM,
> NB or something-implementation would be better suited for that.
> Don't hesitate to comment on this if you have an idea, request or question.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]