Re: n-gram over-representation?

Drew Farris Tue, 16 Feb 2010 11:01:53 -0800

On Tue, Feb 16, 2010 at 1:38 PM, Ted Dunning <[email protected]> wrote:
>
> To get a substantial improvement over these measures, I would recommend
> adding new data to the mix.  The new data I would look at first is some sort
> of user behavior history.  Do you have anything like that?


I don't have any behavioral history, but this corpus contains
documents that were generated over span of decades, so perhaps it is
valid to partition documents by time somehow. Identifying variable LLR
across documents seems pretty interesting too.

I also was wondering if comparing the ngrams found in this corpus
against a general corpus could be a worthwhile endeavor? Some quick
and dirty work suggests that the overlap in n-grams between this
domain-specific corpus and a general one is pretty low. I have some
follow-up work I need to do there to be certain. The general corpora I
have in hand include either wikipedia or large set of documents
collected from the web. I have the sneaking suspicion that these may
not be general enough when compared to that used for other statistical
work of this ilk (e.g the corpus used in the IBM MT work).

Drew

Re: n-gram over-representation?

Reply via email to