Greetings,
I would greatly appreciate anyone sharing their experience doing 
NLP/lemmatization and am also very curious to gauge the opinion of the lucene 
community regarding open-nlp. I know there are a few other libraries out there, 
some of which can’t be directly included in the lucene project because of 
licensing issues. If anyone has any suggestions/experiences, please do share 
them :-)
As a side note I’ll add that I’ve been experimenting with open-nlp’s 
PoS/lemmatization capabilities via lucene’s integration. During the process I 
uncovered some issues which made me question whether open-nlp is the right tool 
for the job. The first issue was a “low-hanging bug”, which would have most 
likely been addressed sooner if this solution was popular, this simple bug was 
at least 5 years old -> https://github.com/apache/lucene/issues/11771

Second issue has more to do with the open-nlp library itself. It is not 
thread-safe in some very unexpected ways. Looking at the library internals 
reveals unsynchronized lazy initialization of shared components. Unfortunately 
the lucene integration kind of sweeps this under the rug by wrapping everything 
in a pretty big synchronized block, here is an example 
https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPPOSTaggerOp.java#L36
 . This itself is problematic because these functions run in really tight loops 
and probably shouldn’t be blocking. Even if one did decide to do blocking 
initialization, it can still be done at a much lower level than currently. From 
what I gather, the functions that are synchronized at the lucene-level could be 
made thread-safe in a much more performant way if they were fixed in open-nlp. 
But I am also starting to doubt if this is worth pursuing since I don't know 
whether anyone would find this useful, hence the original inquiry.
I’ll add that I have separately used the open-nlp sentence break iterator 
(which suffers from the same problem 
https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPSentenceDetectorOp.java#L39
 ) at production scale and discovered really bad performance during certain 
conditions which I attribute to this unnecessary synching. I suspect this may 
have impacted others as well 
https://stackoverflow.com/questions/42960569/indexing-taking-long-time-when-using-opennlp-lemmatizer-with-solr
Many thanks,
Luke Kot-Zaniewski

Reply via email to