Re: Offset-Based Analysis

Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) Wed, 22 Feb 2023 06:26:40 -0800

Hi Mikhail,

Thanks for the quick reply and the suggestion. This is definitely good to know 
about. In my case however, there are several such NLP/data extraction systems 
and I am not sure if they all use the same tokenization but I will give this 
another look. I can see how this is a more well-defined solution to the problem 
I presented. I realize with offsets you would have to make assumptions when 
offset-boundaries fall in the middle of a token and other such odd cases.


Thanks again,
Luke
 
From: java-user@lucene.apache.org At: 02/22/23 02:38:30 UTC-5:00To:  
java-user@lucene.apache.org
Subject: Re: Offset-Based Analysis

Hello Luke.

Using offsets seems really doubtful to me. What comes to my mind is
pre-analyzed field
https://solr.apache.org/guide/solr/latest/indexing-guide/external-files-processe
s.html#the-preanalyzedfield-type.
Thus, external NLP service can provide ready-made tokens for
straightforward indexing by Solr. That external NLP will have all power to
inject or suspend synonyms depending on the context, and supply additional
attributes in payload (whenever it's boldness, negative/positive stress,
etc) for retrieval of these payloads later.

On Wed, Feb 22, 2023 at 6:43 AM Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) <
lkotzanie...@bloomberg.net> wrote:

> Hi All,
>
> I am trying to enrich a lucene-powered search index with data from various
> different NLP systems that are distributed throughout my company. Ideally
> this internally-derived data could be tied back to specific positions of
> the original text. I’ve searched around and this is the closest thing I’ve
> found to what I am trying to do
> https://jorgelbg.me/2018/03/solr-contextual-synonyms-with-payloads/ .
> This approach is neat but it has a few drawbacks because of its reliance on
> injected delimiters and somewhat inflexible passing of data from
> PayloadAttribute to CharTermAttribute.
>
> One thing that occurred to me is to use an offset-based approach, of
> course assuming input text is already properly encoded and sanitized. I’m
> thinking about implementing a CharFilter that decodes some special header,
> which itself passes along an offset-sorted list of data for enrichment.
> This metadata could be referenced during analysis via custom attributes and
> ideally could handle a variety of use cases with the same offset-accounting
> logic. Some uses that come to mind are stashing values in term/payload
> attributes or even offset based tokenization for those wishing to tokenize
> outside of their search engine.
>
> Does this approach even make any sense or have any pitfalls I am failing
> to see? Assuming it makes sense does a similar solution already exist? If
> it doesn’t exist yet would it be something that would be of interest to the
> community?
> Any thoughts on this would be much appreciated.
>
> Thanks,
> Luke


-- 
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!

Re: Offset-Based Analysis

Reply via email to