Hi Guan, I think I've confused everyone a little bit, including myself. When I initially went down the rabbit hole of understanding the synchronization of these wrapping methods I kept an eye out for all potential thread safety issues within open-nlp. I ended up finding issues unrelated to the synchronized methods at hand. Most notably, open-nlp does unsafe member initialization in a couple of places within shared factories such as POSTaggerFactory that I described in more detail in the linked PR. These unsafe methods actually get called in parallel from lucene's FilterFactory::create. I've simply short-circuited these factories in my application and I am still deciding what to do long term.
Luke On 2022/11/21 20:12:34 "Wang, Guan" wrote: > Hi Luke, > > For what you've described as a "bug" for NLPPOSTaggerOp, I do agree with you that there could be a more elegant solution than simply synchronizing the entire method. That has been said, IMHO, I don't see there is a thread-safe issue. Lucene TokenFilters are not supposed to be shared among threads. They can be re-used among threads though. > > NLPs, stemming for example, on the other hand, are slow. If you have to put NLP processing inside the analysis chain, you may have to give up certain NLP capacities... > > My 2cents, > > Guan > > -----Original Message----- > From: Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) <lk...@bloomberg.net> > Sent: Saturday, November 19, 2022 10:27 PM > To: java-user@lucene.apache.org > Subject: Integrating NLP into Lucene Analysis Chain > > External Email - Use Caution > > Greetings, > I would greatly appreciate anyone sharing their experience doing NLP/lemmatization and am also very curious to gauge the opinion of the lucene community regarding open-nlp. I know there are a few other libraries out there, some of which can’t be directly included in the lucene project because of licensing issues. If anyone has any suggestions/experiences, please do share them :-) As a side note I’ll add that I’ve been experimenting with open-nlp’s PoS/lemmatization capabilities via lucene’s integration. During the process I uncovered some issues which made me question whether open-nlp is the right tool for the job. The first issue was a “low-hanging bug”, which would have most likely been addressed sooner if this solution was popular, this simple bug was at least 5 years old -> https://github.com/apache/lucene/issues/11771 > > Second issue has more to do with the open-nlp library itself. It is not thread-safe in some very unexpected ways. Looking at the library internals reveals unsynchronized lazy initialization of shared components. Unfortunately the lucene integration kind of sweeps this under the rug by wrapping everything in a pretty big synchronized block, here is an example https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPPOSTaggerOp.java#L36 . This itself is problematic because these functions run in really tight loops and probably shouldn’t be blocking. Even if one did decide to do blocking initialization, it can still be done at a much lower level than currently. From what I gather, the functions that are synchronized at the lucene-level could be made thread-safe in a much more performant way if they were fixed in open-nlp. But I am also starting to doubt if this is worth pursuing since I don't know whether anyone would find this useful, hence the original inquiry. > I’ll add that I have separately used the open-nlp sentence break iterator (which suffers from the same problem https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPSentenceDetectorOp.java#L39 ) at production scale and discovered really bad performance during certain conditions which I attribute to this unnecessary synching. I suspect this may have impacted others as well https://stackoverflow.com/questions/42960569/indexing-taking-long-time-when-using-opennlp-lemmatizer-with-solr > Many thanks, > Luke Kot-Zaniewski > > ********************************************************** > Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues >