Hey folks, I'm working on a paper for fast median computation and https://issues.dlang.org/show_bug.cgi?id=16517 came to mind. I see the Google ngram corpus has occurrences of n-grams per year. Is data aggregated for all years available somewhere? I'd like to compute e.g. "the word (1-gram) with the median frequency across all English books" so I don't need the frequencies per year, only totals.

Of course I can download the entire corpus and then do some processing, but that would take a long time.

Also, if you can think of any large corpus that would be pertinent for median computation, please let me know!


Thanks,

Andrei

Reply via email to