(opennlp-models) 01/01: OPENNLP-1642 Provide a README.md section for ud-train script - adds relevant infos on Model Training via ud-train to README.md - switches CREATE_RELEASE to 'false' by default, as non-PMC members won't be able to sign with a key from the KEYS file - updates default `UD_HOME` value to use ud treebanks release version 2.15 (released Nov 15 2024)

mawiesne Fri, 15 Nov 2024 14:56:24 -0800

This is an automated email from the ASF dual-hosted git repository.

mawiesne pushed a commit to branch 
OPENNLP-1642-Provide-a-README.md-section-for-ud-train-script
in repository https://gitbox.apache.org/repos/asf/opennlp-models.git


commit 1a51a8dbadb03dbac2078cbe327e054ef7060bb4
Author: Martin Wiesner <[email protected]>
AuthorDate: Fri Nov 15 23:50:32 2024 +0100

    OPENNLP-1642 Provide a README.md section for ud-train script
    - adds relevant infos on Model Training via ud-train to README.md
    - switches CREATE_RELEASE to 'false' by default, as non-PMC members won't 
be able to sign with a key from the KEYS file
    - updates default `UD_HOME` value to use ud treebanks release version 2.15 
(released Nov 15 2024)
---
 README.md                                          | 86 +++++++++++++++++++++-
 .../src/main/resources/ud-train.sh                 |  4 +-
 2 files changed, 85 insertions(+), 5 deletions(-)

diff --git a/README.md b/README.md
index 0120d8e..458ad88 100644
--- a/README.md
+++ b/README.md
@@ -81,10 +81,27 @@ It is compatible with OpenNLP `>= 1.8.3`. Model details are 
available [here](htt
 
 ## Getting Started
 
-You can import a model artifact directly via Maven, SBT or Gradle, for 
instance:
+The [Universal Dependencies](https://universaldependencies.org) (UD) community 
provides a framework for consistent annotation of grammar across different 
human languages.
+The project is developing cross-linguistically consistent treebank annotation 
for 150+ languages.           
+
+### Referencing published Models
+
+You can import UD-based model artifacts directly via Maven, SBT or Gradle, for 
instance:
 
 #### Maven
 
+```
+<dependency>
+    <groupId>org.apache.opennlp</groupId>
+    <artifactId>opennlp-models-pos-de</artifactId>
+    <version>${opennlp.models.version}</version>
+</dependency>
+```
+
+for all **23** supported languages, listed on the Apache OpenNLP [Model 
page](https://opennlp.apache.org/models.html).
+
+The broader langdetect model can be referenced like this:   
+
 ```
 <dependency>
     <groupId>org.apache.opennlp</groupId>
@@ -107,9 +124,72 @@ compile group: "org.apache.opennlp", name: 
"opennlp-models-langdetect", version:
 
 For more details please check our 
[documentation](https://opennlp.apache.org/docs/)
 
-## Adding a new Model
 
-Ensure to add a new model to the `expected-models.txt` file located in 
`opennlp-models-test`.
+### Training Models
+
+All released _sentence detection_, _tokenization_, _lemmatizer_, and _POS 
tagging_ models were and can be trained via the `ud-train.sh` script.
+It is located in the _opennlp-models-training-ud_ directory in this 
repository. 
+
+#### Preparing the environment
+
+Before training UD-based OpenNLP models, the execution environment needs the 
latest [OpenNLP release](https://opennlp.apache.org/download.html) and the 
latest set of [UD treebanks](https://universaldependencies.org/#download).
+Download the corresponding archive files and uncompress them both in the same 
directory in which the training script resides.
+Rename both folders according to the `OPENNLP_HOME` and `UD_HOME` variables. 
+
+> [!IMPORTANT]
+> Check and adjust the version string in both variables, that is, to the 
versions you have actually downloaded. 
+
+#### Selecting model types
+
+Next, select what type of models should be trained. By default, the script 
defines:
+
+```
+TRAIN_TOKENIZER="true"
+TRAIN_POSTAGGER="true"
+TRAIN_SENTDETECT="true"
+TRAIN_LEMMATIZER="true"
+```
+
+Simply switch off a certain type, by setting the corresponding variable to 
false.
+
+#### Selecting languages
+
+By default, treebanks of 23 supported languages are included in the `MODELS` 
variable of the script.
+If only a smaller or different (sub-)set is required, this variable can simply 
be edited.
+The format must be followed: `<Language>|<2-digit-locale-code>|<UD treebank 
name>`, for example: `English|en|EWT` or `Swedish|sv|Talbanken`.
+
+> [!NOTE]
+> The full list of supported languages and related treebanks is available 
[here](https://universaldependencies.org/#current-ud-languages).
+> Yet, even listed on the UD page, training OpenNLP models might not succeed. 
If it succeeds, check the evaluation logs (_*.eval_) if the computed accuracy 
meets your expectations.
+                       
+#### Adjusting training parameters
+
+Once you're done with the preparations, check the `ud-train.conf` file. With 
this config file, you can adjust the number of threads used for certain 
training steps. 
+Moreover, it is possible to adjust the number of iterations (default: 150) to 
achieve (slightly) better model performance.
+
+#### Executing 'ud-train.sh'
+
+Make sure to make the `ud-train.sh` script executable. 
+On Unix-oid environments this can simply be achieved by setting the execute 
bit: `chmod 744 ud-train.sh`.
+
+> [!TIP]
+> As model training(s) can be a long-running task, depending on CPU type and 
number of CPU cores,
+> the script should be started inside a 
[`screen`](https://www.man7.org/linux/man-pages/man1/screen.1.html) instance.
+
+Finally, execute the script via invoking `./ud-train.sh` and start brewing and 
enjoying some :coffee:.
+
+The script logs each training (and evaluation) step per selected language / 
treebank, thus allowing progress tracking. 
+
+#### Evaluating trained Models
+
+After a training step succeeds, a corresponding evaluation step is executed. 
If you want to skip it, set `EVAL_AFTER_TRAINING` to `false`.
+In case the evaluation is run, the resulting performance (accuracy) is written 
to files ending with `.eval`.                                                   
                                                                     
+
+### Adding new Models
+
+When adding new models to the `pom.xml`, ensure to add new models to the 
`expected-models.txt` file located in `opennlp-models-test`.
+In addition, make sure a sha256 hash is computed on each binary artifact. 
+The corresponding value must be set or updated correctly for each model type 
and language.                                       
 
 ## Contributing
 
diff --git 
a/opennlp-models-training/opennlp-models-training-ud/src/main/resources/ud-train.sh
 
b/opennlp-models-training/opennlp-models-training-ud/src/main/resources/ud-train.sh
index 867e250..bf4c625 100755
--- 
a/opennlp-models-training/opennlp-models-training-ud/src/main/resources/ud-train.sh
+++ 
b/opennlp-models-training/opennlp-models-training-ud/src/main/resources/ud-train.sh
@@ -39,7 +39,7 @@ OPENNLP_VERSION_NUMERIC="2.5.0"
 # The directory the resulting binary models are written to
 OUTPUT_MODELS="./ud-models-2.5.0"
 # The directory the ud treebanks are located in
-UD_HOME="./ud-treebanks-v2.14"
+UD_HOME="./ud-treebanks-v2.15"
 
 #################################################
 # Parameters for training, evaluation & release #
@@ -54,7 +54,7 @@ EVAL_AFTER_TRAINING="true"
 # If 'true, training of experimental languages will be attempted, otherwise 
only stable languages & treebanks are used
 EXPERIMENTAL_LANGUAGES="false"
 # If 'true', all release preparation steps are conducted, 'false' otherwise
-CREATE_RELEASE="true"
+CREATE_RELEASE="false"
 # The public key from the OPENNLP KEYS file in short form
 GPG_PUBLIC_KEY=""

Reply via email to