After some discussion in the #dspace channel, here are word frequency lists for the Māori language for configuring Māori language support when searching.
The first set of words is derived ultimately from the Māori Niupepa Collection at http://www.nzdl.org/niupepa These are mainly 19th Century newspapers in traditional orthography (= no macrons). The commandline used to generate them is: cat [0-9]*/doc.xml | sed 's/<Metadata [^>]*>[^>]*>//' | sed 's/<[^>]*>//g' | sed 's/<[^;]*;//g' | grep -vi '[qysdflbvxzc]' | tr -cs '[^a-zA-Z]' '\012' |tr '[A-Z]' '[a-z]' |sort | uniq -c | sort -n 9534 ata 9673 wha 9734 kino 9975 motu 10025 kahore 10064 katahi 10083 tahi 10086 marama 10120 whai 10492 ingoa 10532 wahine 10548 wa 10610 kau 10685 muri 10740 heoi 10748 mau 10864 pa 10941 kawanatanga 11182 kaha 11299 rangatira 11308 whaka 11518 ahua 11560 taha 11592 tamariki 11619 rongo 11717 hui 12014 mana 12016 mohio 12244 ora 12353 rua 12697 take 13280 puta 13977 engari 14079 taku 14103 wahi 14201 ona 14417 ahau 14789 raua 15247 ture 15407 kotahi 15688 utu 15813 reira 16081 kite 16249 noho 16332 moni 16586 tera 16993 whakaaro 17184 tino 17299 iho 17416 tika 17721 enei 18284 ara 18680 tena 19048 ranei 19086 tana 19758 hoa 20478 koe 21323 tikanga 21592 tatou 21719 aua 21987 noa 22317 tae 22356 whare 22374 tu 22400 etahi 22415 matou 22916 kaore 23346 ake 23887 rawa 25629 au 25816 mate 26329 pakeha 27063 tau 27326 ta 27380 kore 27961 koutou 27977 tonu 28943 kupu 31213 tona 31711 pai 31881 runga 31903 korero 32158 roto 34556 whenua 34991 tetahi 35540 katoa 36485 no 36609 nui 36674 kai 37706 haere 38645 iwi 38904 to 39048 kei 39090 ma 42886 ra 45527 mahi 48027 hei 48219 taua 50332 na 50620 ratou 55736 maori 56889 kua 58158 hoki 61591 ano 65470 ia 69170 tenei 72397 tangata 72615 mea 72836 ai 75123 nei 77790 atu 80224 mo 82101 mai 90419 me 98716 kia 117543 ana 141715 ka 147326 ko 156732 he 193002 a 283013 nga 302250 o 306391 e 310758 ki 474233 i 833949 te The second set of words is derived from a private corpus (not distributable for copyright reasons). This is modern text (20th and 21st Century), primarily in modern orthography (= macrons are used)and primarily from government and official channels. The commandline used to generate them is: cat *.xml | sed 's/<[^ ]* xml:lang="en">[^>]*>//' | sed 's/<[^>]*>//g' | tr ' \(\)\{\}\[\];:",.0-9-' '\012' |grep -vi '[qysdflbvxzc]' |tr '[A-Z]' '[a-z]' |sort | uniq -c | sort -n 3043 ota 3096 mō 3136 ara 3204 kaunihera 3206 kore 3303 £ 3346 taha 3387 tu 3403 rohe 3406 iho 3432 noho 3462 the 3588 riihi 3728 tae 3787 whakahaere 3841 nui 3860 koe 3887 aua 3892 etahi 4055 mau 4057 tona 4063 iwi 4078 tika 4168 utu 4209 pukapuka 4258 poraka 4278 take 4391 reira 4424 wahi 4533 tekau 4623 tekiona 4634 whai 4636 tonu 4702 haere 4721 ā 4770 tuku 4935 no 5099 takiwa 5102 tono 5331 ano 5359 nama 5383 ingoa 5560 na 5586 kupu 5917 to 6164 mana 6295 ake 6348 mea 6712 katoa 7111 mahi 7113 moni 7161 kooti 7233 ratou 7288 tau 7704 tikanga 7845 raro 7945 kei 7956 ma 8562 ranei 8733 kai 9005 hoki 9856 ra 10239 hei 10271 tetahi 10695 ai 11492 roto 11612 tenei 11654 tangata 11664 runga 11855 ture 12186 mai 12527 ia 13468 kua 14336 taua 15843 nei 16143 maori 18231 mo 18411 ngā 18833 kia 21203 atu 22526 he 22921 ana 25268 whenua 26545 me 28052 ka 32916 ko 44994 a 52515 nga 60488 e 69607 ki 92490 o 128901 i 208905 te -- Stuart Yeates http://www.nzetc.org/ New Zealand Electronic Text Centre http://researcharchive.vuw.ac.nz/ Institutional Repository ------------------------------------------------------------------------------ Download Intel® Parallel Studio Eval Try the new software tools for yourself. Speed compiling, find bugs proactively, and fine-tune applications for parallel performance. See why Intel Parallel Studio got high marks during beta. http://p.sf.net/sfu/intel-sw-dev _______________________________________________ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech