I think you are making a very big (and very wrong) assumption here. The non-grammaticality of these chunks does not generally adversely affect topic identification and can actually help it quite a bit.
It is important to avoid "everybody knows" facts in your development at this point. Even if everybody you talk to agrees that you don't even need to look at the data on this topic, you should still be suspicious of strong statements without data. On Sat, Dec 19, 2009 at 8:16 AM, Felix Lange <[email protected]> wrote: > In particular, I have a question about building n-grams (subsets) from > noun-chunks. In the > power-sets of noun-chunks, we don't want to have subsets like "world's > first". That would surely spoil the clustering. Every subset should include > the grammatical core of the chunk, in this example, "aircraft". > -- Ted Dunning, CTO DeepDyve
