[ https://issues.apache.org/jira/browse/LUCENE-3745?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13200339#comment-13200339 ]
Christian Moen commented on LUCENE-3745: ---------------------------------------- I'm attaching some lexical assets that are useful for building stopwords and stoptag lists. The frequency lists are made from ~1.5 million segmented Japanese Wikipedia documents from after some scrubbing and handling. I'd prefer to use a more balanced corpus for this, but I believe Wikipedia will be fine for this. The following files are attached in TSV format using UTF-8 encoding: * {{top-pos.txt}} - Part-of-speech tag distribution * {{top-100000.txt}} - Top 100,000 most frequent surface forms and their frequencies * {{top-1000000-pos.txt}} - Top 1,000,000 most frequent surface form and part-of-speech tag combinations and their frequencies There's also a tool {{filter_stoptags.py}} attached that reads a set of stoptags and evaluates it on {{top-1000000-pos.txt}} to give us an idea what passes through any given stoptag set. An example with my current stoptag set is given below. {noformat} filter_stoptags.py -s stoptags.txt top-1000000-pos.txt stop: 、 freq: 14426806 pos: 記号-読点 stop: の freq: 14212851 pos: 助詞-連体化 stop: 。 freq: 10553747 pos: 記号-句点 stop: は freq: 8956177 pos: 助詞-係助詞 stop: に freq: 8757138 pos: 助詞-格助詞-一般 stop: を freq: 7723958 pos: 助詞-格助詞-一般 stop: freq: 7417005 pos: 記号-空白 stop: た freq: 7366368 pos: 助動詞 stop: が freq: 5427730 pos: 助詞-格助詞-一般 stop: て freq: 4874861 pos: 助詞-接続助詞 pass: し freq: 4312613 pos: 動詞-自立 stop: で freq: 3702106 pos: 助詞-格助詞-一般 stop: freq: 3485125 pos: 記号-空白 stop: ) freq: 3049861 pos: 記号-括弧閉 stop: ( freq: 3045461 pos: 記号-括弧開 pass: れ freq: 2722773 pos: 動詞-接尾 pass: さ freq: 2441965 pos: 動詞-自立 stop: で freq: 2403133 pos: 助動詞 stop: ・ freq: 2250725 pos: 記号-一般 stop: も freq: 1962142 pos: 助詞-係助詞 pass: する freq: 1959374 pos: 動詞-自立 pass: いる freq: 1937789 pos: 動詞-非自立 stop: と freq: 1927529 pos: 助詞-格助詞-引用 pass: 年 freq: 1796435 pos: 名詞-接尾-助数詞 stop: 「 freq: 1701848 pos: 記号-括弧開 stop: と freq: 1697926 pos: 助詞-格助詞-一般 stop: 」 freq: 1672052 pos: 記号-括弧閉 stop: から freq: 1414661 pos: 助詞-格助詞-一般 stop: ある freq: 1400235 pos: 助動詞 stop: freq: 1319235 pos: 記号-空白 pass: こと freq: 1272503 pos: 名詞-非自立-一般 stop: な freq: 1254673 pos: 助動詞 stop: が freq: 1110771 pos: 助詞-接続助詞 pass: の freq: 1037815 pos: 名詞-非自立-一般 stop: として freq: 1002940 pos: 助詞-格助詞-連語 stop: freq: 989166 pos: 記号-空白 pass: い freq: 923836 pos: 動詞-非自立 (...) {noformat} > Need stopwords and stoptags lists for default Japanese configuration > -------------------------------------------------------------------- > > Key: LUCENE-3745 > URL: https://issues.apache.org/jira/browse/LUCENE-3745 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/analysis > Reporter: Christian Moen > Attachments: filter_stoptags.py, top-100000.txt, top-1000000-pos.txt, > top-pos.txt > > > Stopwords and stoptags lists for Japanese needs to be developed, tested and > integrated into Lucene. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org