Offset-Based Analysis

Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) Tue, 21 Feb 2023 19:43:31 -0800

Hi All,

I am trying to enrich a lucene-powered search index with data from various 
different NLP systems that are distributed throughout my company. Ideally this 
internally-derived data could be tied back to specific positions of the 
original text. I’ve searched around and this is the closest thing I’ve found to 
what I am trying to do 
https://jorgelbg.me/2018/03/solr-contextual-synonyms-with-payloads/ . This 
approach is neat but it has a few drawbacks because of its reliance on injected 
delimiters and somewhat inflexible passing of data from PayloadAttribute to 
CharTermAttribute.


One thing that occurred to me is to use an offset-based approach, of course 
assuming input text is already properly encoded and sanitized. I’m thinking 
about implementing a CharFilter that decodes some special header, which itself 
passes along an offset-sorted list of data for enrichment. This metadata could 
be referenced during analysis via custom attributes and ideally could handle a 
variety of use cases with the same offset-accounting logic. Some uses that come 
to mind are stashing values in term/payload attributes or even offset based 
tokenization for those wishing to tokenize outside of their search engine.

Does this approach even make any sense or have any pitfalls I am failing to 
see? Assuming it makes sense does a similar solution already exist? If it 
doesn’t exist yet would it be something that would be of interest to the 
community?
Any thoughts on this would be much appreciated.

Thanks,
Luke

Offset-Based Analysis

Reply via email to