krickert commented on code in PR #1056: URL: https://github.com/apache/opennlp/pull/1056#discussion_r3281179153
########## opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/fi.txt: ########## @@ -0,0 +1,96 @@ +# From https://snowballstem.org/algorithms/finnish/stop.txt +# This file is distributed under the BSD License. +# See https://snowballstem.org/license.html +# Also see https://opensource.org/licenses/bsd-license.html +# - Encoding was converted to UTF-8. +# - This notice was added. +# - Comments were changed from `|` to `#` so that this list can be parsed by OpenNLP's stopword loader. +# + +# forms of BE + +olla +olen +olet +on +olemme +olette +ovat +ole + +oli +olisi +olisit +olisin +olisimme +olisitte +olisivat +olit +olin +olimme +olitte +olivat +ollut +olleet + +en +et +ei +emme +ette +eivät + +#Nom Gen Acc Part Iness Elat Illat Adess Ablat Allat Ess Trans +minä minun minut minua minussa minusta minuun minulla minulta minulle Review Comment: I checked the bundled files against the loader format (one line = one entry; whitespace on a line joins tokens into a single n-gram). fi.txt is the only language file with non-comment lines that contain multiple tokens (the Snowball paradigm rows around lines 44-64). Those lines are loaded as one long multi-word entry, so tokens like minä and sinä on those rows are not registered as individual stopwords. Russian keeps the same paradigm material in # comments, which works with the current parser. For Finnish we should either split those rows to one token per line (same as en.txt / de.txt) or comment them out like ru.txt. Worth adding a small test that a few common Finnish forms from those rows are recognized (not only ja, which is already on its own line). -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
