One more idea. It's possible to ask Solr for essential tokenization via /analysis/field API (here's a clue https://stackoverflow.com/a/37785401), get token stream in structured response, and pass it into NPL pipeline for enrichment.
On Wed, Feb 22, 2023 at 5:26 PM Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) < lkotzanie...@bloomberg.net> wrote: > Hi Mikhail, > > Thanks for the quick reply and the suggestion. This is definitely good to > know about. In my case however, there are several such NLP/data extraction > systems and I am not sure if they all use the same tokenization but I will > give this another look. I can see how this is a more well-defined solution > to the problem I presented. I realize with offsets you would have to make > assumptions when offset-boundaries fall in the middle of a token and other > such odd cases. > > Thanks again, > Luke > > From: java-user@lucene.apache.org At: 02/22/23 02:38:30 UTC-5:00To: > java-user@lucene.apache.org > Subject: Re: Offset-Based Analysis > > Hello Luke. > > Using offsets seems really doubtful to me. What comes to my mind is > pre-analyzed field > > https://solr.apache.org/guide/solr/latest/indexing-guide/external-files-processe > s.html#the-preanalyzedfield-type > <https://solr.apache.org/guide/solr/latest/indexing-guide/external-files-processes.html#the-preanalyzedfield-type> > . > Thus, external NLP service can provide ready-made tokens for > straightforward indexing by Solr. That external NLP will have all power to > inject or suspend synonyms depending on the context, and supply additional > attributes in payload (whenever it's boldness, negative/positive stress, > etc) for retrieval of these payloads later. > > On Wed, Feb 22, 2023 at 6:43 AM Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) < > lkotzanie...@bloomberg.net> wrote: > > > Hi All, > > > > I am trying to enrich a lucene-powered search index with data from > various > > different NLP systems that are distributed throughout my company. Ideally > > this internally-derived data could be tied back to specific positions of > > the original text. I’ve searched around and this is the closest thing > I’ve > > found to what I am trying to do > > https://jorgelbg.me/2018/03/solr-contextual-synonyms-with-payloads/ . > > This approach is neat but it has a few drawbacks because of its reliance > on > > injected delimiters and somewhat inflexible passing of data from > > PayloadAttribute to CharTermAttribute. > > > > One thing that occurred to me is to use an offset-based approach, of > > course assuming input text is already properly encoded and sanitized. I’m > > thinking about implementing a CharFilter that decodes some special > header, > > which itself passes along an offset-sorted list of data for enrichment. > > This metadata could be referenced during analysis via custom attributes > and > > ideally could handle a variety of use cases with the same > offset-accounting > > logic. Some uses that come to mind are stashing values in term/payload > > attributes or even offset based tokenization for those wishing to > tokenize > > outside of their search engine. > > > > Does this approach even make any sense or have any pitfalls I am failing > > to see? Assuming it makes sense does a similar solution already exist? If > > it doesn’t exist yet would it be something that would be of interest to > the > > community? > > Any thoughts on this would be much appreciated. > > > > Thanks, > > Luke > > > -- > Sincerely yours > Mikhail Khludnev > https://t.me/MUST_SEARCH > A caveat: Cyrillic! > > > -- Sincerely yours Mikhail Khludnev https://t.me/MUST_SEARCH A caveat: Cyrillic!