Re: Clustering Demo

Grant Ingersoll Sat, 24 May 2008 04:13:50 -0700


On May 23, 2008, at 2:15 PM, Karl Wettin wrote:

17 maj 2008 kl. 13.39 skrev Grant Ingersoll:
On May 12, 2008, at 11:24 AM, Karl Wettin wrote:
Did anyone do anything with this? If not I'll come up with somethingin the begining of June. I think it should be abstract enough tohandle other similar data sources (Apache mbox archives).



This would be cool.

In what way can we prepare so it makes as much sense for as manythings as possible we might want to show off? What class fieldscan we extract from the headers except for author and threadidentity? How do we want to tokenize the text (grams of words andcharachters, stemming, stopwords, etc), do we want to seperatequotation from author text so we can use diffrent weights toquotation, et c?
Let's just start simple with words and then enhance.
It might be interesting to take a look at what sort of tokenizerother libs do, the Weka StringToWordVector for instance (best viewedfrom their GUI). We should be able to much better than that withwhats available in Lucene. But a default chain of token streams thatis easy to set up is not a bad idea.
I also think we want some simple algorithmic stop word extraction.There is a simple one in LUCENE-1025 with the incorrect nameHacGqfTermReducer.java.
It would be a simple thing to support different weights for subjectand body. Or any other field we might extract in the future (quotedbody, et c).
We also want to get right of signatures with quotes and what not in.That should be handled by some pre-pre-processing layer though ifyou ask me. LUCENE-725 can help out.
Should we perhaps make this thread an issue?

These are interesting. Perhaps you want to commit LUCENE-725? I waswondering whether we should consider asking Lucene to put up anAnalyzer only jar (i.e. a separate jar that combiners the Analyzer/TokenStream definitions with the contrib Analyzers package.) Ofcourse, we may have uses for the rest of Lucene as well, so maybe not.

Re: Clustering Demo

Reply via email to