Hi Mikhail, Thanks for the quick reply and the suggestion. This is definitely good to know about. In my case however, there are several such NLP/data extraction systems and I am not sure if they all use the same tokenization but I will give this another look. I can see how this is a more well-defined solution to the problem I presented. I realize with offsets you would have to make assumptions when offset-boundaries fall in the middle of a token and other such odd cases.
Thanks again, Luke From: java-user@lucene.apache.org At: 02/22/23 02:38:30 UTC-5:00To: java-user@lucene.apache.org Subject: Re: Offset-Based Analysis Hello Luke. Using offsets seems really doubtful to me. What comes to my mind is pre-analyzed field https://solr.apache.org/guide/solr/latest/indexing-guide/external-files-processe s.html#the-preanalyzedfield-type. Thus, external NLP service can provide ready-made tokens for straightforward indexing by Solr. That external NLP will have all power to inject or suspend synonyms depending on the context, and supply additional attributes in payload (whenever it's boldness, negative/positive stress, etc) for retrieval of these payloads later. On Wed, Feb 22, 2023 at 6:43 AM Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) < lkotzanie...@bloomberg.net> wrote: > Hi All, > > I am trying to enrich a lucene-powered search index with data from various > different NLP systems that are distributed throughout my company. Ideally > this internally-derived data could be tied back to specific positions of > the original text. I’ve searched around and this is the closest thing I’ve > found to what I am trying to do > https://jorgelbg.me/2018/03/solr-contextual-synonyms-with-payloads/ . > This approach is neat but it has a few drawbacks because of its reliance on > injected delimiters and somewhat inflexible passing of data from > PayloadAttribute to CharTermAttribute. > > One thing that occurred to me is to use an offset-based approach, of > course assuming input text is already properly encoded and sanitized. I’m > thinking about implementing a CharFilter that decodes some special header, > which itself passes along an offset-sorted list of data for enrichment. > This metadata could be referenced during analysis via custom attributes and > ideally could handle a variety of use cases with the same offset-accounting > logic. Some uses that come to mind are stashing values in term/payload > attributes or even offset based tokenization for those wishing to tokenize > outside of their search engine. > > Does this approach even make any sense or have any pitfalls I am failing > to see? Assuming it makes sense does a similar solution already exist? If > it doesn’t exist yet would it be something that would be of interest to the > community? > Any thoughts on this would be much appreciated. > > Thanks, > Luke -- Sincerely yours Mikhail Khludnev https://t.me/MUST_SEARCH A caveat: Cyrillic!