Hi Benoit, Thanks for the reply and link! My application is english-focused so I have the benefit of having a language with little inflection. This along with a few other reasons pushed me towards an index-heavy approach which doesn't have the complexities involved with synonyms of different position length (i.e. where you would need SynonymGraphFilter) and it also simplifies query composition. Having said that, I found that creating a custom filter that packs equal length synonym tokens into the same position to be relatively simple.
Luke On 2022/11/21 18:19:56 Benoit Mercier wrote: > Hi Luke, > > Thank you for your work and information sharing. From my point of view > lemmatization is just a use case of text token annotation. I have been > working with Lucene since 2006 to index lexicographic and linguistic > data and I always miss the fact that (1) token attributes are not > searchable and (2) that it is not straightforward to get all text tokens > indexed at the same position (synonyms) directly from a span query > (ideas and suggestions are welcome!). I think that the NLP community > might be grateful if Lucene could offer a simple way to search on token > annotations (attributes). MTAS project achieve that > (https://github.com/textexploration/mtas), based on Lucene, and supports > the CQL Query Language > (https://meertensinstituut.github.io/mtas/search_cql.html). MTAS is an > inspiring project I came accross recently and from which you might get > inspiration too. But I am currently hesitating to use it because I have > no guarantee that they authors will port their code to support new > Lucene versions. I might come with my own solution but without (2) I > can't see yet how I could achieve it simply without redoing the same > thing that MTAS did! > > Thank you. > > Benoit > > Le 2022-11-19 à 22 h 26, Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) a écrit : > > Greetings, > > I would greatly appreciate anyone sharing their experience doing NLP/lemmatization and am also very curious to gauge the opinion of the lucene community regarding open-nlp. I know there are a few other libraries out there, some of which can’t be directly included in the lucene project because of licensing issues. If anyone has any suggestions/experiences, please do share them :-) > > As a side note I’ll add that I’ve been experimenting with open-nlp’s PoS/lemmatization capabilities via lucene’s integration. During the process I uncovered some issues which made me question whether open-nlp is the right tool for the job. The first issue was a “low-hanging bug”, which would have most likely been addressed sooner if this solution was popular, this simple bug was at least 5 years old -> https://github.com/apache/lucene/issues/11771 > > > > Second issue has more to do with the open-nlp library itself. It is not thread-safe in some very unexpected ways. Looking at the library internals reveals unsynchronized lazy initialization of shared components. Unfortunately the lucene integration kind of sweeps this under the rug by wrapping everything in a pretty big synchronized block, here is an example https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPPOSTaggerOp.java#L36 . This itself is problematic because these functions run in really tight loops and probably shouldn’t be blocking. Even if one did decide to do blocking initialization, it can still be done at a much lower level than currently. From what I gather, the functions that are synchronized at the lucene-level could be made thread-safe in a much more performant way if they were fixed in open-nlp. But I am also starting to doubt if this is worth pursuing since I don't know whether anyone would find this useful, hence the original inquiry. > > I’ll add that I have separately used the open-nlp sentence break iterator (which suffers from the same problem https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPSentenceDetectorOp.java#L39 ) at production scale and discovered really bad performance during certain conditions which I attribute to this unnecessary synching. I suspect this may have impacted others as well https://stackoverflow.com/questions/42960569/indexing-taking-long-time-when-using-opennlp-lemmatizer-with-solr > > Many thanks, > > Luke Kot-Zaniewski > > >