Re: [OT] All your medians are belong to me

jmh530 via Digitalmars-d Mon, 21 Nov 2016 10:21:07 -0800

On Monday, 21 November 2016 at 17:39:40 UTC, Andrei Alexandrescuwrote:

Hey folks, I'm working on a paper for fast median computationand https://issues.dlang.org/show_bug.cgi?id=16517 came tomind. I see the Google ngram corpus has occurrences of n-gramsper year. Is data aggregated for all years available somewhere?I'd like to compute e.g. "the word (1-gram) with the medianfrequency across all English books" so I don't need thefrequencies per year, only totals.
Of course I can download the entire corpus and then do someprocessing, but that would take a long time.
Also, if you can think of any large corpus that would bepertinent for median computation, please let me know!
Thanks,

Andrei


You might following worthwhile.

http://opendata.stackexchange.com/questions/6114/dataset-for-english-words-of-dictionary-for-a-nlp-project

I would just generate a bunch of integers randomly and use that,but I don't know if you specifically need to work with strings.

Re: [OT] All your medians are belong to me

Reply via email to