Re: Cluster text docs

Felix Lange Mon, 21 Dec 2009 00:33:18 -0800

Hi ,
Ted, I agree, sentences don't need to be grammatical for our purposes. My
intention was just to cut out noun-less phrase like "very good". I just
think in general nouns say more about a topic than adjectives and so I can
leave them aside and make the feature vector a bit smaller.
@ Drew: Yes, we actually did some testing on unigrams, and the result
weren't that bad.


Greetings
Felix



2009/12/19 Ted Dunning <[email protected]>

> I think you are making a very big (and very wrong) assumption here.
>
> The non-grammaticality of these chunks does not generally adversely affect
> topic identification and can actually help it quite a bit.
>
> It is important to avoid "everybody knows" facts in your development at
> this
> point.  Even if everybody you talk to agrees that you don't even need to
> look at the data on this topic, you should still be suspicious of strong
> statements without data.
>
> On Sat, Dec 19, 2009 at 8:16 AM, Felix Lange <[email protected]>
> wrote:
>
> > In particular, I have a question about building n-grams (subsets) from
> > noun-chunks. In the
> > power-sets of noun-chunks, we don't want to have subsets like "world's
> > first". That would surely spoil the clustering. Every subset should
> include
> > the grammatical core of the chunk, in this example, "aircraft".
> >
>
>
>
> --
> Ted Dunning, CTO
> DeepDyve
>

Re: Cluster text docs

Reply via email to