RE: Can an analyzer access other field's data during index time?

Wang, Guan Tue, 25 Apr 2023 06:08:47 -0700

Hi Mikhail,

Again, thank you so much for getting back to me!


Here is the scenario:

Given a document with an added header line:

HEADER_LINE
ORIGINAL_TEXT_LINE
ORIGINAL_TEXT_LINE...

And a field in managed-schema for the document:

<field name="RPT_TEXT" ... indexed="true" stored="true" ... />

I'd like to extract the information in the HEADER_LINE to guide the indexing 
for this document. When the document is stored in the field RPT_TEXT, I'd like 
to remove the HEADER_LINE so only the original document text will be saved.

With a custom tokenizer or the ConditionalToeknFilter as you mentioned, 
extracting the HEADER_LINE should be straightforward. The remaining puzzle is 
to strip the HEADER_LINE during saving.

I took a look at IndexingChain class. It turns out I will have to do a custom 
Field class and override stringValue() method.

In a nutshell, I will need two parts to make this work:

1. a custom tokenizer/filter;
2. a custom field;

Let me know if there is any caveat...

And thank you so much for guiding me through!

Guan

-----Original Message-----
From: Mikhail Khludnev <m...@apache.org>
Sent: Tuesday, April 25, 2023 4:40 AM
To: java-user@lucene.apache.org
Subject: Re: Can an analyzer access other field's data during index time?

External Email - Use Caution

Guan,
I hardly grasp the particular obstacle. But I don't think that the task is out 
of reach overall. Can you share a test case formally describing the desired 
behavior?

On Tue, Apr 25, 2023 at 12:29 AM Wang, Guan <wan...@med.umich.edu> wrote:

> Hi Mikhail,
>
> Thank you for introducing abstract class ConditionalTokenFilter to me!
> Took a quick look, it's a wrapper of the upperstream TokenStream with
> conditional rendition.
>
> So, if I have a document like:
>
> HEADER
> TEXT
> TEXT
>
> Implementing ConditionalToeknFilter could only tokenize line 2 and 3.
> However, all 3 lines would still be stored in the field if index=true
> and stored=true...
>
> I wonder if I could only store line 2 and 3 in the field in such a
> scenario?
>
> Many thanks,
>
> Guan
>
> -----Original Message-----
> From: Mikhail Khludnev <m...@apache.org>
> Sent: Monday, April 24, 2023 4:56 PM
> To: java-user@lucene.apache.org
> Subject: Re: Can an analyzer access other field's data during index time?
>
> External Email - Use Caution
>
> Well.. maybe something like
>
> https://luce/
> ne.apache.org%2Fcore%2F8_5_1%2Fanalyzers-common%2Forg%2Fapache%2Flucen
> e%2Fanalysis%2Fmiscellaneous%2FConditionalTokenFilter.html&data=05%7C0
> 1%7Cwanggu%40med.umich.edu%7C4bbb8f1743494265999e08db4568cd7f%7C1f41d6
> 13d3a14ead918d2a25b10de330%7C0%7C0%7C638180088661302659%7CUnknown%7CTW
> FpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6
> Mn0%3D%7C3000%7C%7C%7C&sdata=pv8ZhUIxfDRK894jqDc79eBizvM6tjuU%2BP0imA9
> FmmU%3D&reserved=0
> ?
>
> On Mon, Apr 24, 2023 at 11:40 PM Wang, Guan <wan...@med.umich.edu> wrote:
>
> > Hi Mikhail,
> >
> > Thank you for the definitive answer!
> >
> > I could "solve" this by adding a header in the document with proper
> > information to guide the indexing process. Header will be parsed
> > then ignored by the tokenizer. However, the header along with the
> > actual text will be stored together in that field...
> >
> > I wonder (again...) if it's possible I may control which part of the
> > text shall be stored during the index process? In other words, is it
> > possible to strip the header when storing the text into the field?
> >
> > Best regards,
> >
> > Guan
> >
> > -----Original Message-----
> > From: Mikhail Khludnev <m...@apache.org>
> > Sent: Monday, April 24, 2023 4:20 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Can an analyzer access other field's data during index time?
> >
> > External Email - Use Caution
> >
> > Hello Guan.
> > It reminds me https://youtu.be/EkkzSLstSAE?t=1531 timecode.
> > I'm afraid it's quite far from the existing codebase where the Field
> > has no reference to enclosing Document. sigh.
> >
> >
> > On Mon, Apr 24, 2023 at 6:00 PM Wang, Guan <wan...@med.umich.edu> wrote:
> >
> > > Hi,
> > >
> > > I understand Lucene analyzer is per field basis. But I wonder if
> > > it's even possible for an analyzer on field A to be able to access
> > > data in field B during the index process on any stage, saying
> > > CharFilter, Tokenizer or TokenFilter?
> > >
> > > I'd like to control the behavior of the indexing process for field
> > > A based upon the value in field B.
> > >
> > > Mighty Lucene community, please let me know if this is doable...
> > >
> > > Many thanks,
> > >
> > > Guan
> > > **********************************************************
> > > Electronic Mail is not secure, may not be read every day, and
> > > should not be used for urgent or sensitive issues
> > >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > https://t/.
> > me%2F&data=05%7C01%7Cwanggu%40med.umich.edu%7C4bbb8f1743494265999e08
> > db4568cd7f%7C1f41d613d3a14ead918d2a25b10de330%7C0%7C0%7C638180088661
> > 302659%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiL
> > CJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=QM%2BJQkijKN8evI
> > 2sIqi9dLXyKDnklF6vkmGYqxoCDrY%3D&reserved=0
> > %2FMUST_SEARCH&data=05%7C01%7Cwanggu%40med.umich.edu%7C0bea50f222a14
> > e4
> > b2ca708db450683cc%7C1f41d613d3a14ead918d2a25b10de330%7C0%7C0%7C63817
> > 96
> > 66504341414%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2lu
> > Mz
> > IiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=2Hy%2B1tCxYQ7
> > ID
> > Ewa36%2ByOl5Jfe284fj4%2B0tutGWOvsk%3D&reserved=0
> > A caveat: Cyrillic!
> > **********************************************************
> > Electronic Mail is not secure, may not be read every day, and should
> > not be used for urgent or sensitive issues
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> https://t.me/
> %2FMUST_SEARCH&data=05%7C01%7Cwanggu%40med.umich.edu%7C4bbb8f174349426
> 5999e08db4568cd7f%7C1f41d613d3a14ead918d2a25b10de330%7C0%7C0%7C6381800
> 88661302659%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMz
> IiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=KrD%2FkNANILwVb
> py%2BshScTcu7qgxtxbRpB%2F48xmtd4Z8%3D&reserved=0
> A caveat: Cyrillic!
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should
> not be used for urgent or sensitive issues
>


--
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!
**********************************************************
Electronic Mail is not secure, may not be read every day, and should not be 
used for urgent or sensitive issues 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

RE: Can an analyzer access other field's data during index time?

Reply via email to