rzo1 commented on PR #568: URL: https://github.com/apache/opennlp/pull/568#issuecomment-1859746321
Hi all, I added related interner / dedup implementations inspired / based on the code provided by @mawiesne from Aleksey. I added the following interner impls: - CHMStringInterner (as provided by Aleksey), thread-safe - CHMStringDeduplicator (as provided by Aleksey in his talk), thread-safe -> relaxes the canonical requirements on interning. It is more or less a probabilistic deduplication - HMStringInterner (as provided by Aleksey), not thread-safe - JvmStringInterner -> relies on `String.intern()` - can be used to get the previous OpenNLP behaviour ;) - NoOpStringInterner -> doesn't actually intern/dedup Strings. The implementation of the static `StringInterners` is based on how Hadoop is doing it [1]. The default interner used in OpenNLP is now: `CHMStringInterner` I added a system property `opennlp.interner.class`, which can be used to specify the interner implementation which will be used at runtime: - If people want the old behaviour back: they can. - If people do not want interning at all: they can. - If people want probabilistic dedup: they can. Currently, an updated benchmark of the different impls is running as well as a full eval build with the default. Will update this PR with the JMH results in a few hours. - [1] https://github.com/c9n/hadoop/blob/master/hadoop-common-project%2Fhadoop-common%2Fsrc%2Fmain%2Fjava%2Forg%2Fapache%2Fhadoop%2Futil%2FStringInterner.java -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@opennlp.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org