Very nice!
On Jul 3, 2012, at 7:28 PM, Behdad Esfahbod wrote: > Hi, > > As promised, here is the word-list data extracted from various language > Wikipedias, ready for public consumption. > > There are 63 languages included. Chinese and Japanese (zh and ja) are > intentionally left out as they were too big / not so interesting. Other than > that, English is particularly large, as expected, and the rest vary in size, > from a few thousand to tens of millions of unique words. > > Word frequency data is included in separate files. The format is bare > minimum. Ie. there is no format. One word per line, sorted by decreasing > frequencies. Bzip2ed. > > The canonical source of the data is here: > > http://code.google.com/p/harfbuzz-testing-wikipedia/downloads/list > > With mirrors, including one big bzip2 file, here: > > http://www.freedesktop.org/software/harfbuzz/testing/texts/wikipedia/ > http://fedorapeople.org/groups/harfbuzz-testing/texts/wikipedia/ > > License of the data is CC-BY_SA as is Wikipedia. I will publish the code > generating these at some point. Thanks Roozbeh for extracting these. > > Cheers, > behdad > _______________________________________________ > HarfBuzz mailing list > [email protected] > http://lists.freedesktop.org/mailman/listinfo/harfbuzz _______________________________________________ HarfBuzz mailing list [email protected] http://lists.freedesktop.org/mailman/listinfo/harfbuzz
