Re: [PR] OPENNLP-660: Include list of stop words for various languages (opennlp)

via GitHub Thu, 21 May 2026 05:38:08 -0700


krickert commented on code in PR #1056:
URL: https://github.com/apache/opennlp/pull/1056#discussion_r3281179153



##########
opennlp-core/opennlp-runtime/src/main/resources/opennlp/tools/stopword/fi.txt:
##########
@@ -0,0 +1,96 @@
+# From https://snowballstem.org/algorithms/finnish/stop.txt
+# This file is distributed under the BSD License.
+# See https://snowballstem.org/license.html
+# Also see https://opensource.org/licenses/bsd-license.html
+#  - Encoding was converted to UTF-8.
+#  - This notice was added.
+#  - Comments were changed from `|` to `#` so that this list can be parsed by 
OpenNLP's stopword loader.
+#
+
+# forms of BE
+
+olla
+olen
+olet
+on
+olemme
+olette
+ovat
+ole
+
+oli
+olisi
+olisit
+olisin
+olisimme
+olisitte
+olisivat
+olit
+olin
+olimme
+olitte
+olivat
+ollut
+olleet
+
+en
+et
+ei
+emme
+ette
+eivät
+
+#Nom   Gen    Acc    Part   Iness   Elat    Illat  Adess   Ablat   Allat   Ess 
   Trans
+minä   minun  minut  minua  minussa minusta minuun minulla minulta minulle

Review Comment:
   I checked the bundled files against the loader format (one line = one entry; 
whitespace on a line joins tokens into a single n-gram).
   
   fi.txt is the only language file with non-comment lines that contain 
multiple tokens (the Snowball paradigm rows around lines 44-64). Those lines 
are loaded as one long multi-word entry, so tokens like minä and sinä on those 
rows are not registered as individual stopwords.
   
   Russian keeps the same paradigm material in # comments, which works with the 
current parser. For Finnish we should either split those rows to one token per 
line (same as en.txt / de.txt) or comment them out like ru.txt.
   
   Worth adding a small test that a few common Finnish forms from those rows 
are recognized (not only ja, which is already on its own line).



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] OPENNLP-660: Include list of stop words for various languages (opennlp)

Reply via email to