Hi all,
I have posted a first release candidate for the Apache OpenNLP 3.0.0-M4 release
and it is ready for testing.
The 3.x release line of Apache OpenNLP introduces no known breaking changes
while significantly modularizing the project to improve library usage and
future extensibility.
The core API remains stable and fully compatible with 2.x, so existing projects
can continue using the opennlp-tools artifact without modifications.
Key Highlights:
• New Features:
• Include list of stop words for various languages (OPENNLP-660)
• Add SymSpell-based spell correction component (OPENNLP-1832)
• Add BertTokenizer with BERT basic tokenization (OPENNLP-1837)
• Bug Fixes:
• This release ships four bug fixes for: OPENNLP-1826, OPENNLP-1836,
OPENNLP-1839, and OPENNLP-1840
• Improvements:
• Harden SvmDoccatModel.deserialize() with ObjectInputFilter and
resource limits (OPENNLP-1823)
• Tolerate unsupported XML parser security options (OPENNLP-1835)
• Fix NameFinderDL only worked with Person, expand to all types
(OPENNLP-1846)
• Several updates of dependencies were conducted, see Jira release notes
listing - URL down below
• Some minor tasks have been completed
• IMPORTANT Changes:
• The ONNX input encoding in SentenceVectorsDL was fixed, which changes
the produced sentence vectors. Any embeddings persisted with the old encoding
are not comparable to the new output and must be re-generated. (OPENNLP-1836 -
PR #1072)
• WordpieceTokenizer (public API, used by opennlp-dl) now splits
punctuation runs into single tokens, collapses partially-matched words to a
single [UNK], and throws from tokenizePos instead of returning null. These
change tokenization output for existing callers. (OPENNLP-1837 - PR #1073)
• NameFinderDL now decodes all BIO entity types (PER/ORG/LOC/…) instead
of only persons. Span.getType() now returns the entity label rather than the
covered text, which is a contract change for existing callers. (OPENNLP-1846 -
PR #1086)
• The opennlp-dl components are now thread-safe; as part of this,
loadVocab became public static (source- and binary-incompatible) and
AbstractDL's implicit no-arg constructor was removed. Both affect downstream
code that calls loadVocab or extends AbstractDL. (OPENNLP-1844 - PR #1084)
Thank you to everyone who contributed to this release, including all of our
users and the people who submitted bug reports, contributed code or
documentation enhancements.
The release was made using the OpenNLP release process, documented on the
website:
https://opennlp.apache.org/release.html
Maven Repo:
https://repository.apache.org/content/repositories/orgapacheopennlp-1070
<repositories>
<repository>
<id>opennlp-3.0.0-M4-RC1</id>
<name>Testing OpenNLP 3.0.0-M4 release candidate</name>
<url>
https://repository.apache.org/content/repositories/orgapacheopennlp-1070
</url>
</repository>
</repositories>
Binaries & Source:
https://dist.apache.org/repos/dist/dev/opennlp/opennlp-3.0.0-M4-rc1/
Tag:
https://github.com/apache/opennlp/releases/tag/opennlp-3.0.0-M4
Tag Hash: 1e05d1ef5a7c35b83015ebce87bb9a43c55e2226
Release notes:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12311215&version=12356941
The results of the eval tests for the aforementioned tag can be found
here: https://ci-builds.apache.org/job/OpenNLP/job/eval-tests-releases/35/
Reminder: The up-2-date KEYS file for signature verification can be
found here: https://dist.apache.org/repos/dist/release/opennlp/KEYS
Checklist for reference:
[ ] Both source (tar.gz/zip) and binary artifacts (tar.gz/zip) are present,
along with .asc and .sha512 files for each.
[ ] PGP signatures are valid for the release artifacts using the KEYS file from
dist.apache.org
[ ] SHA512 checksums are correct and verified.
[ ] LICENSE and NOTICE files exist and are accurate.
[ ] No unexpected binary files in the source release.
[ ] All source files have appropriate ASF headers (excluding generated files
and legacy files).
[ ] Build completes successfully from source and the instruction to do so are
clear.
Please vote on releasing these packages as Apache OpenNLP 3.0.0-M4
The vote is open for at least the next 72 hours.
Only votes from OpenNLP PMC are binding, but everyone is welcome to
check the release candidate and vote.
The vote passes if at least three binding +1 votes are cast.
Please VOTE
[+1] go ship it
[+0] meh, don't care
[-1] stop, there is a ${showstopper}
Thanks!
Martin | mawiesne