Hello,
It sounds like you are talking about Solr (though it's Lucene core mailing
list).
If you want to manipulate what's been stored, it's not the analyzer's duty
for sure.
Overriding org.apache.solr.schema.FieldType#createFields can be used to
yield indexed and stored (Lucene) fields with different content.
If your logic is so comprehensive you may also consider to completely
extract analysis logic
https://solr.apache.org/guide/solr/latest/indexing-guide/external-files-processes.html#the-preanalyzedfield-type


On Tue, Apr 25, 2023 at 4:08 PM Wang, Guan <wan...@med.umich.edu> wrote:

> Hi Mikhail,
>
> Again, thank you so much for getting back to me!
>
> Here is the scenario:
>
> Given a document with an added header line:
>
> HEADER_LINE
> ORIGINAL_TEXT_LINE
> ORIGINAL_TEXT_LINE...
>
> And a field in managed-schema for the document:
>
> <field name="RPT_TEXT" ... indexed="true" stored="true" ... />
>
> I'd like to extract the information in the HEADER_LINE to guide the
> indexing for this document. When the document is stored in the field
> RPT_TEXT, I'd like to remove the HEADER_LINE so only the original document
> text will be saved.
>
> With a custom tokenizer or the ConditionalToeknFilter as you mentioned,
> extracting the HEADER_LINE should be straightforward. The remaining puzzle
> is to strip the HEADER_LINE during saving.
>
> I took a look at IndexingChain class. It turns out I will have to do a
> custom Field class and override stringValue() method.
>
> In a nutshell, I will need two parts to make this work:
>
> 1. a custom tokenizer/filter;
> 2. a custom field;
>
> Let me know if there is any caveat...
>
> And thank you so much for guiding me through!
>
> Guan
>
> -----Original Message-----
> From: Mikhail Khludnev <m...@apache.org>
> Sent: Tuesday, April 25, 2023 4:40 AM
> To: java-user@lucene.apache.org
> Subject: Re: Can an analyzer access other field's data during index time?
>
> External Email - Use Caution
>
> Guan,
> I hardly grasp the particular obstacle. But I don't think that the task is
> out of reach overall. Can you share a test case formally describing the
> desired behavior?
>
> On Tue, Apr 25, 2023 at 12:29 AM Wang, Guan <wan...@med.umich.edu> wrote:
>
> > Hi Mikhail,
> >
> > Thank you for introducing abstract class ConditionalTokenFilter to me!
> > Took a quick look, it's a wrapper of the upperstream TokenStream with
> > conditional rendition.
> >
> > So, if I have a document like:
> >
> > HEADER
> > TEXT
> > TEXT
> >
> > Implementing ConditionalToeknFilter could only tokenize line 2 and 3.
> > However, all 3 lines would still be stored in the field if index=true
> > and stored=true...
> >
> > I wonder if I could only store line 2 and 3 in the field in such a
> > scenario?
> >
> > Many thanks,
> >
> > Guan
> >
> > -----Original Message-----
> > From: Mikhail Khludnev <m...@apache.org>
> > Sent: Monday, April 24, 2023 4:56 PM
> > To: java-user@lucene.apache.org
> > Subject: Re: Can an analyzer access other field's data during index time?
> >
> > External Email - Use Caution
> >
> > Well.. maybe something like
> >
> > https://luce/
> > ne.apache.org%2Fcore%2F8_5_1%2Fanalyzers-common%2Forg%2Fapache%2Flucen
> > e%2Fanalysis%2Fmiscellaneous%2FConditionalTokenFilter.html&data=05%7C0
> > 1%7Cwanggu%40med.umich.edu%7C4bbb8f1743494265999e08db4568cd7f%7C1f41d6
> > 13d3a14ead918d2a25b10de330%7C0%7C0%7C638180088661302659%7CUnknown%7CTW
> > FpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6
> > Mn0%3D%7C3000%7C%7C%7C&sdata=pv8ZhUIxfDRK894jqDc79eBizvM6tjuU%2BP0imA9
> > FmmU%3D&reserved=0
> > ?
> >
> > On Mon, Apr 24, 2023 at 11:40 PM Wang, Guan <wan...@med.umich.edu>
> wrote:
> >
> > > Hi Mikhail,
> > >
> > > Thank you for the definitive answer!
> > >
> > > I could "solve" this by adding a header in the document with proper
> > > information to guide the indexing process. Header will be parsed
> > > then ignored by the tokenizer. However, the header along with the
> > > actual text will be stored together in that field...
> > >
> > > I wonder (again...) if it's possible I may control which part of the
> > > text shall be stored during the index process? In other words, is it
> > > possible to strip the header when storing the text into the field?
> > >
> > > Best regards,
> > >
> > > Guan
> > >
> > > -----Original Message-----
> > > From: Mikhail Khludnev <m...@apache.org>
> > > Sent: Monday, April 24, 2023 4:20 PM
> > > To: java-user@lucene.apache.org
> > > Subject: Re: Can an analyzer access other field's data during index
> time?
> > >
> > > External Email - Use Caution
> > >
> > > Hello Guan.
> > > It reminds me https://youtu.be/EkkzSLstSAE?t=1531 timecode.
> > > I'm afraid it's quite far from the existing codebase where the Field
> > > has no reference to enclosing Document. sigh.
> > >
> > >
> > > On Mon, Apr 24, 2023 at 6:00 PM Wang, Guan <wan...@med.umich.edu>
> wrote:
> > >
> > > > Hi,
> > > >
> > > > I understand Lucene analyzer is per field basis. But I wonder if
> > > > it's even possible for an analyzer on field A to be able to access
> > > > data in field B during the index process on any stage, saying
> > > > CharFilter, Tokenizer or TokenFilter?
> > > >
> > > > I'd like to control the behavior of the indexing process for field
> > > > A based upon the value in field B.
> > > >
> > > > Mighty Lucene community, please let me know if this is doable...
> > > >
> > > > Many thanks,
> > > >
> > > > Guan
> > > > **********************************************************
> > > > Electronic Mail is not secure, may not be read every day, and
> > > > should not be used for urgent or sensitive issues
> > > >
> > >
> > >
> > > --
> > > Sincerely yours
> > > Mikhail Khludnev
> > > https://t/.
> > > me%2F&data=05%7C01%7Cwanggu%40med.umich.edu%7C4bbb8f1743494265999e08
> > > db4568cd7f%7C1f41d613d3a14ead918d2a25b10de330%7C0%7C0%7C638180088661
> > > 302659%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiL
> > > CJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=QM%2BJQkijKN8evI
> > > 2sIqi9dLXyKDnklF6vkmGYqxoCDrY%3D&reserved=0
> > > %2FMUST_SEARCH&data=05%7C01%7Cwanggu%40med.umich.edu%7C0bea50f222a14
> > > e4
> > > b2ca708db450683cc%7C1f41d613d3a14ead918d2a25b10de330%7C0%7C0%7C63817
> > > 96
> > > 66504341414%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2lu
> > > Mz
> > > IiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=2Hy%2B1tCxYQ7
> > > ID
> > > Ewa36%2ByOl5Jfe284fj4%2B0tutGWOvsk%3D&reserved=0
> > > A caveat: Cyrillic!
> > > **********************************************************
> > > Electronic Mail is not secure, may not be read every day, and should
> > > not be used for urgent or sensitive issues
> > >
> >
> >
> > --
> > Sincerely yours
> > Mikhail Khludnev
> > https://t.me/
> > %2FMUST_SEARCH&data=05%7C01%7Cwanggu%40med.umich.edu%7C4bbb8f174349426
> > 5999e08db4568cd7f%7C1f41d613d3a14ead918d2a25b10de330%7C0%7C0%7C6381800
> > 88661302659%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMz
> > IiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=KrD%2FkNANILwVb
> > py%2BshScTcu7qgxtxbRpB%2F48xmtd4Z8%3D&reserved=0
> > A caveat: Cyrillic!
> > **********************************************************
> > Electronic Mail is not secure, may not be read every day, and should
> > not be used for urgent or sensitive issues
> >
>
>
> --
> Sincerely yours
> Mikhail Khludnev
> https://t.me/MUST_SEARCH
> A caveat: Cyrillic!
> **********************************************************
> Electronic Mail is not secure, may not be read every day, and should not
> be used for urgent or sensitive issues
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
> For additional commands, e-mail: java-user-h...@lucene.apache.org
>
>

-- 
Sincerely yours
Mikhail Khludnev
https://t.me/MUST_SEARCH
A caveat: Cyrillic!

Reply via email to