On Tue, Feb 16, 2010 at 1:38 PM, Ted Dunning <[email protected]> wrote: > > To get a substantial improvement over these measures, I would recommend > adding new data to the mix. The new data I would look at first is some sort > of user behavior history. Do you have anything like that?
I don't have any behavioral history, but this corpus contains documents that were generated over span of decades, so perhaps it is valid to partition documents by time somehow. Identifying variable LLR across documents seems pretty interesting too. I also was wondering if comparing the ngrams found in this corpus against a general corpus could be a worthwhile endeavor? Some quick and dirty work suggests that the overlap in n-grams between this domain-specific corpus and a general one is pretty low. I have some follow-up work I need to do there to be certain. The general corpora I have in hand include either wikipedia or large set of documents collected from the web. I have the sneaking suspicion that these may not be general enough when compared to that used for other statistical work of this ilk (e.g the corpus used in the IBM MT work). Drew
