[ https://issues.apache.org/jira/browse/LUCENE-1025?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12533689 ]
Karl Wettin commented on LUCENE-1025: ------------------------------------- There is now new example output here: http://ginandtonique.org/~kalle/LUCENE-1025/ I recommend out_5.5.txt, but what number best demonstrate the clusterer will change as the tokenization and similarity alogrithm chages. > Document clusterer > ------------------ > > Key: LUCENE-1025 > URL: https://issues.apache.org/jira/browse/LUCENE-1025 > Project: Lucene - Java > Issue Type: New Feature > Components: Analysis, Term Vectors > Reporter: Karl Wettin > Priority: Minor > Attachments: LUCENE-1025.txt > > > A two-dimensional desicion tree in conjunction with cosine coefficient > similarity is the base of this document clusterer. It uses Lucene for > tokenization and length normalization. > Example output of 3500 clustered news articles dated the thee first days of > January 2004 from a number of sources can be found here: < > http://ginandtonique.org/~kalle/LUCENE-1025/out_4.0.txt >. One thing missing > is automatic calculation of cluster boundaries. Not impossible to implement, > nor is it really needed. 4.5 in the URL above is that distance. > The example was calculated limited to the top 1000 terms from instance, > divided with siblings and re-pruned all the way to the root. On my dual core > it took about 100ms to insert a new document in the tree, no matter if it > contained 100 or 10,000 instances. 1GB RAM held about 10,000 news articles. > Next steps for this code is persistency of the tree using BDB or a even > perhaps something similar to the Lucene segmented solution. Perhaps even > using Lucene Directory. The plan is to keep this clusterer synchronized with > the index, allowing really speedy "more like this" features. > Later on I'll introduce map/reduce for better training speed. > This code is far from perfect, nor is the results as good as many other > products. Knowing I didn't put in more than a few handful of hours, this > works quite well. > By displaying neighboring clusters (as in the example) one will definetly get > more related documents at a fairly low false-positive cost. Perhaps it would > be interesting to analyse user behavior to find out if any of them could be > merged. Perhaps some reinforcement learning? > There are no ROC-curves, precision/recall-values nor tp/fp-rates as I have no > manually clustered corpus for me to compare with. > I've been looking for an archive of the Lucene-users forum for > demonstrational use, but could not find it. Any ideas on where I can find > that? It could for instance be neat to tweak this code to identify frequently > asked questions and match it with an answer in the Wiki, but perhaps an SVM, > NB or something-implementation would be better suited for that. > Don't hesitate to comment on this if you have an idea, request or question. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]