Re: Can an analyzer access other field's data during index time?

Mikhail Khludnev Wed, 26 Apr 2023 13:15:26 -0700

Hello,
It sounds like you are talking about Solr (though it's Lucene core mailing
list).
If you want to manipulate what's been stored, it's not the analyzer's duty
for sure.
Overriding org.apache.solr.schema.FieldType#createFields can be used to
yield indexed and stored (Lucene) fields with different content.
If your logic is so comprehensive you may also consider to completely
extract analysis logic
https://solr.apache.org/guide/solr/latest/indexing-guide/external-files-processes.html#the-preanalyzedfield-type



On Tue, Apr 25, 2023 at 4:08 PM Wang, Guan <[email protected]> wrote:

> Hi Mikhail,
>
> Again, thank you so much for getting back to me!
>
> Here is the scenario:
>
> Given a document with an added header line:
>
> HEADER_LINE
> ORIGINAL_TEXT_LINE
> ORIGINAL_TEXT_LINE...
>
> And a field in managed-schema for the document:
>
> <field name="RPT_TEXT" ... indexed="true" stored="true" ... />
>
> I'd like to extract the information in the HEADER_LINE to guide the
> indexing for this document. When the document is stored in the field
> RPT_TEXT, I'd like to remove the HEADER_LINE so only the original document
> text will be saved.
>
> With a custom tokenizer or the ConditionalToeknFilter as you mentioned,
> extracting the HEADER_LINE should be straightforward. The remaining puzzle
> is to strip the HEADER_LINE during saving.
>
> I took a look at IndexingChain class. It turns out I will have to do a
> custom Field class and override stringValue() method.
>
> In a nutshell, I will need two parts to make this work:
>
> 1. a custom tokenizer/filter;
> 2. a custom field;
>
> Let me know if there is any caveat...
>
> And thank you so much for guiding me through!
>
> Guan
>
> -----Original Message-----
> From: Mikhail Khludnev <[email protected]>
> Sent: Tuesday, April 25, 2023 4:40 AM
> To: [email protected]
> Subject: Re: Can an analyzer access other field's data during index time?
>
> External Email - Use Caution
>
> Guan,
> I hardly grasp the particular obstacle. But I don't think that the task is
> out of reach overall. Can you share a test case formally describing the
> desired behavior?
>
> On Tue, Apr 25, 2023 at 12:29 AM Wang, Guan <[email protected]> wrote:
>
> > Hi Mikhail,
> >
> > Thank you for introducing abstract class ConditionalTokenFilter to me!
> > Took a quick look, it's a wrapper of the upperstream TokenStream with
> > conditional rendition.
> >
> > So, if I have a document like:
> >
> > HEADER
> > TEXT
> > TEXT
> >
> > Implementing ConditionalToeknFilter could only tokenize line 2 and 3.
> > However, all 3 lines would still be stored in the field if index=true
> > and stored=true...
> >
> > I wonder if I could only store line 2 and 3 in the field in such a
> > scenario?
> >
> > Many thanks,
> >
> > Guan
> >
> > -----Original Message-----
> > From: Mikhail Khludnev <[email protected]>
> > Sent: Monday, April 24, 2023 4:56 PM
> > To: [email protected]
> > Subject: Re: Can an analyzer access other field's data during index time?
> >
> > External Email - Use Caution
> >
> > Well.. maybe something like
> >
> > https://luce/
> > ne.apache.org%2Fcore%2F8_5_1%2Fanalyzers-common%2Forg%2Fapache%2Flucen
> > e%2Fanalysis%2Fmiscellaneous%2FConditionalTokenFilter.html&data=05%7C0
> > 1%7Cwanggu%40med.umich.edu%7C4bbb8f1743494265999e08db4568cd7f%7C1f41d6
> > 13d3a14ead918d2a25b10de330%7C0%7C0%7C638180088661302659%7CUnknown%7CTW
> > FpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6
> > Mn0%3D%7C3000%7C%7C%7C&sdata=pv8ZhUIxfDRK894jqDc79eBizvM6tjuU%2BP0imA9
> > FmmU%3D&reserved=0
> > ?
> >
> > On Mon, Apr 24, 2023 at 11:40 PM Wang, Guan <[email protected]>
> wrote:
> >
> > > Hi Mikhail,
> > >
> > > Thank you for the definitive answer!
> > >
> > > I could "solve" this by adding a header in the document with proper
> > > information to guide the indexing process. Header will be parsed
> > > then ignored by the tokenizer. However, the header along with the
> > > actual text will be stored together in that field...
> > >
> > > I wonder (again...) if it's possible I may control which part of the
> > > text shall be stored during the index process? In other words, is it
> > > possible to strip the header when storing the text into the field?
> > >
> > > Best regards,
> > >
> > > Guan
> > >
> > > -----Original Message-----
> > > From: Mikhail Khludnev <[email protected]>
> > > Sent: Monday, April 24, 2023 4:20 PM
> > > To: [email protected]
> > > Subject: Re: Can an analyzer access other field's data during index
> time?
> > >
> > > External Email - Use Caution
> > >
> > > Hello Guan.
> > > It reminds me https://youtu.be/EkkzSLstSAE?t=1531 timecode.
> > > I'm afraid it's quite far from the existing codebase where the Field
> > > has no reference to enclosing Document. sigh.
> > >
> > >
> > > On Mon, Apr 24, 2023 at 6:00 PM Wang, Guan <[email protected]>
> wrote:
> > >
> > > > Hi,
> > > >
> > > > I understand Lucene analyzer is per field basis. But I wonder if
> > > > it's even possible for an analyzer on field A to be able to access
> > > > data in field B during the index process on any stage, saying
> > > > CharFilter, Tokenizer or TokenFilter?
> > > >
> > > > I'd like to control the behavior of the indexing process for field
> > > > A based upon the value in field B.
> > > >
> > > > Mighty Lucene community, please let me know if this is doable...
> > > >
> > > > Many thanks,
> > > >
> > > > Guan
> > > > **********************************************************
> > > > Electronic Mail is not secure, may not be read every day, and
> > > > should not be used for urgent or sensitive issues
> > > >
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > > https://t/.
> > > me%2F&data=05%7C01%7Cwanggu%40med.umich.edu%7C4bbb8f1743494265999e08
> > > db4568cd7f%7C1f41d613d3a14ead918d2a25b10de330%7C0%7C0%7C638180088661
> > > 302659%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiL
> > > CJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=QM%2BJQkijKN8evI
> > > 2sIqi9dLXyKDnklF6vkmGYqxoCDrY%3D&reserved=0
> > > %2FMUST_SEARCH&data=05%7C01%7Cwanggu%40med.umich.edu%7C0bea50f222a14
> > > e4
> > > b2ca708db450683cc%7C1f41d613d3a14ead918d2a25b10de330%7C0%7C0%7C63817
> > > 96
> > > 66504341414%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2lu
> > > Mz
> > > IiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=2Hy%2B1tCxYQ7
> > > ID
> > > Ewa36%2ByOl5Jfe284fj4%2B0tutGWOvsk%3D&reserved=0
> > > A caveat: Cyrillic!
> > > **********************************************************
> > > Electronic Mail is not secure, may not be read every day, and should
> > > not be used for urgent or sensitive issues
> > >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > https://t.me/
> > %2FMUST_SEARCH&data=05%7C01%7Cwanggu%40med.umich.edu%7C4bbb8f174349426
> > 5999e08db4568cd7f%7C1f41d613d3a14ead918d2a25b10de330%7C0%7C0%7C6381800
> > 88661302659%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMz
> > IiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=KrD%2FkNANILwVb
> > py%2BshScTcu7qgxtxbRpB%2F48xmtd4Z8%3D&reserved=0
> > A caveat: Cyrillic!
> > **********************************************************
> > Electronic Mail is not secure, may not be read every day, and should
> > not be used for urgent or sensitive issues
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> https://t.me/MUST_SEARCH
> A caveat: Cyrillic!
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should not
> be used for urgent or sensitive issues
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

-- 
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!

Re: Can an analyzer access other field's data during index time?

Reply via email to