https://github.com/apache/lucene/pull/11955
On Sat, Nov 19, 2022 at 10:43 PM Robert Muir <rcm...@gmail.com> wrote: > > Hi, > > Is this 'synchronized' really needed? > > 1. Lucene tokenstreams are only used by a single thread. If you index > with 10 threads, 10 tokenstreams are used. > 2. These OpenNLP Factories make a new *Op for each tokenstream that > they create. so there's no thread hazard. > 3. If i remove 'synchronized' keyword everywhere from opennlp module > (NLPChunkerOp, NLPNERTaggerOp, NLPPOSTaggerOp, NLPSentenceDetectorOp, > NLPTokenizerOp), then all the tests pass. > > On Sat, Nov 19, 2022 at 10:26 PM Luke Kot-Zaniewski (BLOOMBERG/ 919 > 3RD A) <lkotzanie...@bloomberg.net> wrote: > > > > Greetings, > > I would greatly appreciate anyone sharing their experience doing > > NLP/lemmatization and am also very curious to gauge the opinion of the > > lucene community regarding open-nlp. I know there are a few other libraries > > out there, some of which can’t be directly included in the lucene project > > because of licensing issues. If anyone has any suggestions/experiences, > > please do share them :-) > > As a side note I’ll add that I’ve been experimenting with open-nlp’s > > PoS/lemmatization capabilities via lucene’s integration. During the process > > I uncovered some issues which made me question whether open-nlp is the > > right tool for the job. The first issue was a “low-hanging bug”, which > > would have most likely been addressed sooner if this solution was popular, > > this simple bug was at least 5 years old -> > > https://github.com/apache/lucene/issues/11771 > > > > Second issue has more to do with the open-nlp library itself. It is not > > thread-safe in some very unexpected ways. Looking at the library internals > > reveals unsynchronized lazy initialization of shared components. > > Unfortunately the lucene integration kind of sweeps this under the rug by > > wrapping everything in a pretty big synchronized block, here is an example > > https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPPOSTaggerOp.java#L36 > > . This itself is problematic because these functions run in really tight > > loops and probably shouldn’t be blocking. Even if one did decide to do > > blocking initialization, it can still be done at a much lower level than > > currently. From what I gather, the functions that are synchronized at the > > lucene-level could be made thread-safe in a much more performant way if > > they were fixed in open-nlp. But I am also starting to doubt if this is > > worth pursuing since I don't know whether anyone would find this useful, > > hence the original inquiry. > > I’ll add that I have separately used the open-nlp sentence break iterator > > (which suffers from the same problem > > https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPSentenceDetectorOp.java#L39 > > ) at production scale and discovered really bad performance during certain > > conditions which I attribute to this unnecessary synching. I suspect this > > may have impacted others as well > > https://stackoverflow.com/questions/42960569/indexing-taking-long-time-when-using-opennlp-lemmatizer-with-solr > > Many thanks, > > Luke Kot-Zaniewski > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org