Re: Integrating NLP into Lucene Analysis Chain

Mikhail Khludnev Mon, 21 Nov 2022 11:49:09 -0800

Hello, Benoit.

I just came across
https://lucene.apache.org/core/8_0_0/analyzers-common/org/apache/lucene/analysis/miscellaneous/TypeAsSynonymFilterFactory.html


It sounds similar to what you asking, but it watches TypeAttribute only.
Also, spans are superseded with intervals
https://lucene.apache.org/core/8_2_0/queries/org/apache/lucene/queries/intervals/IntervalQuery.html
It's better to approach them.

On Mon, Nov 21, 2022 at 9:20 PM Benoit Mercier <[email protected]>
wrote:

> Hi Luke,
>
> Thank you for your work and information sharing. From my point of view
> lemmatization is just a use case of text token annotation. I have been
> working with Lucene since 2006  to index lexicographic and linguistic
> data and I always miss the fact that (1) token attributes are not
> searchable and (2) that it is not straightforward to get all text tokens
> indexed at the same position (synonyms) directly from a span query
> (ideas and suggestions are welcome!). I think that the NLP community
> might be grateful if Lucene could offer a simple way to search on token
> annotations (attributes). MTAS project achieve that
> (https://github.com/textexploration/mtas), based on Lucene, and supports
> the CQL Query Language
> (https://meertensinstituut.github.io/mtas/search_cql.html). MTAS is an
> inspiring project I came accross recently and from which you might get
> inspiration too. But I am currently hesitating to use it because I have
> no guarantee that they authors will port their code to support new
> Lucene versions. I might come with my own solution but without (2) I
> can't see yet how I could achieve it simply without redoing the same
> thing that MTAS did!
>
> Thank you.
>
> Benoit
>
> Le 2022-11-19 à 22 h 26, Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) a
> écrit :
> > Greetings,
> > I would greatly appreciate anyone sharing their experience doing
> NLP/lemmatization and am also very curious to gauge the opinion of the
> lucene community regarding open-nlp. I know there are a few other libraries
> out there, some of which can’t be directly included in the lucene project
> because of licensing issues. If anyone has any suggestions/experiences,
> please do share them :-)
> > As a side note I’ll add that I’ve been experimenting with open-nlp’s
> PoS/lemmatization capabilities via lucene’s integration. During the process
> I uncovered some issues which made me question whether open-nlp is the
> right tool for the job. The first issue was a “low-hanging bug”, which
> would have most likely been addressed sooner if this solution was popular,
> this simple bug was at least 5 years old ->
> https://github.com/apache/lucene/issues/11771
> >
> > Second issue has more to do with the open-nlp library itself. It is not
> thread-safe in some very unexpected ways. Looking at the library internals
> reveals unsynchronized lazy initialization of shared components.
> Unfortunately the lucene integration kind of sweeps this under the rug by
> wrapping everything in a pretty big synchronized block, here is an example
> https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPPOSTaggerOp.java#L36
> . This itself is problematic because these functions run in really tight
> loops and probably shouldn’t be blocking. Even if one did decide to do
> blocking initialization, it can still be done at a much lower level than
> currently. From what I gather, the functions that are synchronized at the
> lucene-level could be made thread-safe in a much more performant way if
> they were fixed in open-nlp. But I am also starting to doubt if this is
> worth pursuing since I don't know whether anyone would find this useful,
> hence the original inquiry.
> > I’ll add that I have separately used the open-nlp sentence break
> iterator (which suffers from the same problem
> https://github.com/apache/lucene/blob/main/lucene/analysis/opennlp/src/java/org/apache/lucene/analysis/opennlp/tools/NLPSentenceDetectorOp.java#L39
> ) at production scale and discovered really bad performance during certain
> conditions which I attribute to this unnecessary synching. I suspect this
> may have impacted others as well
> https://stackoverflow.com/questions/42960569/indexing-taking-long-time-when-using-opennlp-lemmatizer-with-solr
> > Many thanks,
> > Luke Kot-Zaniewski
> >
>


-- 
Sincerely yours
Mikhail Khludnev

Re: Integrating NLP into Lucene Analysis Chain

Reply via email to