Re: Integrating NLP into Lucene Analysis Chain

Robert Muir Sat, 19 Nov 2022 19:44:21 -0800

Hi,

Is this 'synchronized' really needed?


1. Lucene tokenstreams are only used by a single thread. If you index
with 10 threads, 10 tokenstreams are used.
2. These OpenNLP Factories make a new *Op for each tokenstream that
they create. so there's no thread hazard.
3. If i remove 'synchronized' keyword everywhere from opennlp module
(NLPChunkerOp, NLPNERTaggerOp, NLPPOSTaggerOp, NLPSentenceDetectorOp,
NLPTokenizerOp), then all the tests pass.

On Sat, Nov 19, 2022 at 10:26 PM Luke Kot-Zaniewski (BLOOMBERG/ 919
3RD A) <lkotzanie...@bloomberg.net> wrote:
>
> Greetings,
> I would greatly appreciate anyone sharing their experience doing 
> NLP/lemmatization and am also very curious to gauge the opinion of the lucene 
> community regarding open-nlp. I know there are a few other libraries out 
> there, some of which can’t be directly included in the lucene project because 
> of licensing issues. If anyone has any suggestions/experiences, please do 
> share them :-)
> As a side note I’ll add that I’ve been experimenting with open-nlp’s 
> PoS/lemmatization capabilities via lucene’s integration. During the process I 
> uncovered some issues which made me question whether open-nlp is the right 
> tool for the job. The first issue was a “low-hanging bug”, which would have 
> most likely been addressed sooner if this solution was popular, this simple 
> bug was at least 5 years old -> https://github.com/apache/lucene/issues/11771
>
> Second issue has more to do with the open-nlp library itself. It is not 
> thread-safe in some very unexpected ways. Looking at the library internals 
> reveals unsynchronized lazy initialization of shared components. 
> Unfortunately the lucene integration kind of sweeps this under the rug by 
> wrapping everything in a pretty big synchronized block, here is an example 
> https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPPOSTaggerOp.java#L36
>  . This itself is problematic because these functions run in really tight 
> loops and probably shouldn’t be blocking. Even if one did decide to do 
> blocking initialization, it can still be done at a much lower level than 
> currently. From what I gather, the functions that are synchronized at the 
> lucene-level could be made thread-safe in a much more performant way if they 
> were fixed in open-nlp. But I am also starting to doubt if this is worth 
> pursuing since I don't know whether anyone would find this useful, hence the 
> original inquiry.
> I’ll add that I have separately used the open-nlp sentence break iterator 
> (which suffers from the same problem 
> https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPSentenceDetectorOp.java#L39
>  ) at production scale and discovered really bad performance during certain 
> conditions which I attribute to this unnecessary synching. I suspect this may 
> have impacted others as well 
> https://stackoverflow.com/questions/42960569/indexing-taking-long-time-when-using-opennlp-lemmatizer-with-solr
> Many thanks,
> Luke Kot-Zaniewski
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Integrating NLP into Lucene Analysis Chain

Reply via email to