Hi there, I would like to add some thoughts about feature selection to this discussion. I'm working on the topic-clustering project at the tu-berlin, that has already been discussed in this mailing-lst (e.g. http://mail-archives.apache.org/mod_mbox/lucene-mahout-user/200911.mbox/%[email protected]%3e).
Choosing the right feature-extraction and clustering algorithms is one part of the story, but what should be the input to this algorithms in the first place? In a thread about preprocessing, my colleague Marc presented our UIMA-based approach( http://mail-archives.apache.org/mod_mbox/lucene-mahout-user/200911.mbox/%[email protected]%3e). Tu sum this up, our pipeline implemens the following preprocessing-steps: stripping of html-tags>pos-tagging and noun-group-chunking, both via wrappers for lingpipe-annotators > stemming > stoppword-filtering. So we could actually pass stemmed words without stopwords to the feature-extactor. but there are more effective (and probably less data-intensive) possiblities. Think about a sentence like this one: (1) "The Me 262 is well known as the world's first fighter aircraft with a jet engine". If you do topic-clustering, which words give a proper representation of this sentence's topic? a good guess seems to be to take the non phrases, i.e. "The Me 262" , "the world's first fighter aircraft", and "a jet engine". Our noun chunker can easily achieve this, if we include number words (262) into the set of grammatical categories occurring inside a noun phrase. But if we stop here, we miss a generalization: a text with a chunk "fighter aircrafts" probably has the same topic. but if we pass them over as an atomic feature, we end up without a match, because this chunk is not string-identical to "the world's first fighter aircraft". To make the feature-extractor/clusterer reecognize the similarity we do the following: stemming (strips off the "s"), excluding determiners ("the") inside chunks, and building the power-set from every chunk, that reflects the grammatical structure. for "the world's first fighter aircraft", we end up with the set{"world's first fighter aircraft","first fighter aircraft", "fighter aircraft" ,"aircraft"}, thus detecting the similarity to chunk "fighter aircrafts" (after stemming, that is). One could argue: Why take complete noun-chunks in the first place, when they cannot be easily matched with other phrases? This is because noun groups can carry meanings that cannot be calculated from their parts. For example, a chunk "bag of words" offers an excellent gues as to what is article is about (namely, text processing). But that is not clear if you only look at the single words "bag" , "of" and "words". As for the words that are not nouns or parts of noun-chunks, many of them can be left aside. For example, a word like "good" is not that specific when it comes to topic clustering. "good" is an adjective, "aircraft" is a noun. so a selection of topic-specific words can be done on the basis of grammatical categories. that's what we have the POS-Tagger for. Any comments on this approach are of course welcome. In particular, I have a question about building n-grams (subsets) from noun-chunks. In the power-sets of noun-chunks, we don't want to have subsets like "world's first". That would surely spoil the clustering. Every subset should include the grammatical core of the chunk, in this example, "aircraft". Lingpipe's noun-chunker is not able to do this, because it's based on a sequential parse of the POS-Tags. If you have a chunk "wizard of warcraft", the core of the chunk is "wizard", appearing on the outer left of the chunk. In order to detect it, we need a deep parser. But this seems to be much costly. On an off-the-shelf dual-core computer with 4 gigs of memory, we can do the preprocessing of this e-mail within half of a second. That would change dramatically if we would use a deep-parser. Or am I wrong? Greetings, Felix
