Re: [OT] All your medians are belong to me
On Monday, 21 November 2016 at 18:39:26 UTC, Andrei Alexandrescu wrote: On 11/21/2016 01:18 PM, jmh530 wrote: I would just generate a bunch of integers randomly and use that, but I don't know if you specifically need to work with strings. I have that, too, but was looking for some real data as well. It would be a nice addition. -- Andrei I don't really know what kind of data you would need but there are the European Unions Language Technology Resources corpuses made available for the research community. There are several different data sets in different formats (documents, alignments, xml) and in all European languages that can be used for experiments and real world use. The data is public domain and is free to use. The DGT-TM dataset is compiled by myself and updated yearly. It consist of around 12 billion characters or 1.8 billion words or 111 million segments in 28 languages. https://ec.europa.eu/jrc/en/language-technologies
Re: [OT] All your medians are belong to me
On 11/21/2016 01:18 PM, jmh530 wrote: I would just generate a bunch of integers randomly and use that, but I don't know if you specifically need to work with strings. I have that, too, but was looking for some real data as well. It would be a nice addition. -- Andrei
Re: [OT] All your medians are belong to me
On Monday, 21 November 2016 at 17:39:40 UTC, Andrei Alexandrescu wrote: Hey folks, I'm working on a paper for fast median computation and https://issues.dlang.org/show_bug.cgi?id=16517 came to mind. I see the Google ngram corpus has occurrences of n-grams per year. Is data aggregated for all years available somewhere? I'd like to compute e.g. "the word (1-gram) with the median frequency across all English books" so I don't need the frequencies per year, only totals. Of course I can download the entire corpus and then do some processing, but that would take a long time. Also, if you can think of any large corpus that would be pertinent for median computation, please let me know! Thanks, Andrei You might following worthwhile. http://opendata.stackexchange.com/questions/6114/dataset-for-english-words-of-dictionary-for-a-nlp-project I would just generate a bunch of integers randomly and use that, but I don't know if you specifically need to work with strings.
[OT] All your medians are belong to me
Hey folks, I'm working on a paper for fast median computation and https://issues.dlang.org/show_bug.cgi?id=16517 came to mind. I see the Google ngram corpus has occurrences of n-grams per year. Is data aggregated for all years available somewhere? I'd like to compute e.g. "the word (1-gram) with the median frequency across all English books" so I don't need the frequencies per year, only totals. Of course I can download the entire corpus and then do some processing, but that would take a long time. Also, if you can think of any large corpus that would be pertinent for median computation, please let me know! Thanks, Andrei