Hi All, I am trying to enrich a lucene-powered search index with data from various different NLP systems that are distributed throughout my company. Ideally this internally-derived data could be tied back to specific positions of the original text. I’ve searched around and this is the closest thing I’ve found to what I am trying to do https://jorgelbg.me/2018/03/solr-contextual-synonyms-with-payloads/ . This approach is neat but it has a few drawbacks because of its reliance on injected delimiters and somewhat inflexible passing of data from PayloadAttribute to CharTermAttribute.
One thing that occurred to me is to use an offset-based approach, of course assuming input text is already properly encoded and sanitized. I’m thinking about implementing a CharFilter that decodes some special header, which itself passes along an offset-sorted list of data for enrichment. This metadata could be referenced during analysis via custom attributes and ideally could handle a variety of use cases with the same offset-accounting logic. Some uses that come to mind are stashing values in term/payload attributes or even offset based tokenization for those wishing to tokenize outside of their search engine. Does this approach even make any sense or have any pitfalls I am failing to see? Assuming it makes sense does a similar solution already exist? If it doesn’t exist yet would it be something that would be of interest to the community? Any thoughts on this would be much appreciated. Thanks, Luke