One more idea. It's possible to ask Solr for essential tokenization via
/analysis/field API (here's a clue https://stackoverflow.com/a/37785401),
get token stream in structured response, and pass it into NPL pipeline for
enrichment.

On Wed, Feb 22, 2023 at 5:26 PM Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) <
lkotzanie...@bloomberg.net> wrote:

> Hi Mikhail,
>
> Thanks for the quick reply and the suggestion. This is definitely good to
> know about. In my case however, there are several such NLP/data extraction
> systems and I am not sure if they all use the same tokenization but I will
> give this another look. I can see how this is a more well-defined solution
> to the problem I presented. I realize with offsets you would have to make
> assumptions when offset-boundaries fall in the middle of a token and other
> such odd cases.
>
> Thanks again,
> Luke
>
> From: java-user@lucene.apache.org At: 02/22/23 02:38:30 UTC-5:00To:
> java-user@lucene.apache.org
> Subject: Re: Offset-Based Analysis
>
> Hello Luke.
>
> Using offsets seems really doubtful to me. What comes to my mind is
> pre-analyzed field
>
> https://solr.apache.org/guide/solr/latest/indexing-guide/external-files-processe
> s.html#the-preanalyzedfield-type
> <https://solr.apache.org/guide/solr/latest/indexing-guide/external-files-processes.html#the-preanalyzedfield-type>
> .
> Thus, external NLP service can provide ready-made tokens for
> straightforward indexing by Solr. That external NLP will have all power to
> inject or suspend synonyms depending on the context, and supply additional
> attributes in payload (whenever it's boldness, negative/positive stress,
> etc) for retrieval of these payloads later.
>
> On Wed, Feb 22, 2023 at 6:43 AM Luke Kot-Zaniewski (BLOOMBERG/ 919 3RD A) <
> lkotzanie...@bloomberg.net> wrote:
>
> > Hi All,
> >
> > I am trying to enrich a lucene-powered search index with data from
> various
> > different NLP systems that are distributed throughout my company. Ideally
> > this internally-derived data could be tied back to specific positions of
> > the original text. I’ve searched around and this is the closest thing
> I’ve
> > found to what I am trying to do
> > https://jorgelbg.me/2018/03/solr-contextual-synonyms-with-payloads/ .
> > This approach is neat but it has a few drawbacks because of its reliance
> on
> > injected delimiters and somewhat inflexible passing of data from
> > PayloadAttribute to CharTermAttribute.
> >
> > One thing that occurred to me is to use an offset-based approach, of
> > course assuming input text is already properly encoded and sanitized. I’m
> > thinking about implementing a CharFilter that decodes some special
> header,
> > which itself passes along an offset-sorted list of data for enrichment.
> > This metadata could be referenced during analysis via custom attributes
> and
> > ideally could handle a variety of use cases with the same
> offset-accounting
> > logic. Some uses that come to mind are stashing values in term/payload
> > attributes or even offset based tokenization for those wishing to
> tokenize
> > outside of their search engine.
> >
> > Does this approach even make any sense or have any pitfalls I am failing
> > to see? Assuming it makes sense does a similar solution already exist? If
> > it doesn’t exist yet would it be something that would be of interest to
> the
> > community?
> > Any thoughts on this would be much appreciated.
> >
> > Thanks,
> > Luke
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> https://t.me/MUST_SEARCH
> A caveat: Cyrillic!
>
>
>

-- 
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!

Reply via email to