Hi , Ted, I agree, sentences don't need to be grammatical for our purposes. My intention was just to cut out noun-less phrase like "very good". I just think in general nouns say more about a topic than adjectives and so I can leave them aside and make the feature vector a bit smaller. @ Drew: Yes, we actually did some testing on unigrams, and the result weren't that bad.
Greetings Felix 2009/12/19 Ted Dunning <[email protected]> > I think you are making a very big (and very wrong) assumption here. > > The non-grammaticality of these chunks does not generally adversely affect > topic identification and can actually help it quite a bit. > > It is important to avoid "everybody knows" facts in your development at > this > point. Even if everybody you talk to agrees that you don't even need to > look at the data on this topic, you should still be suspicious of strong > statements without data. > > On Sat, Dec 19, 2009 at 8:16 AM, Felix Lange <[email protected]> > wrote: > > > In particular, I have a question about building n-grams (subsets) from > > noun-chunks. In the > > power-sets of noun-chunks, we don't want to have subsets like "world's > > first". That would surely spoil the clustering. Every subset should > include > > the grammatical core of the chunk, in this example, "aircraft". > > > > > > -- > Ted Dunning, CTO > DeepDyve >
