Hello, It sounds like you are talking about Solr (though it's Lucene core mailing list). If you want to manipulate what's been stored, it's not the analyzer's duty for sure. Overriding org.apache.solr.schema.FieldType#createFields can be used to yield indexed and stored (Lucene) fields with different content. If your logic is so comprehensive you may also consider to completely extract analysis logic https://solr.apache.org/guide/solr/latest/indexing-guide/external-files-processes.html#the-preanalyzedfield-type
On Tue, Apr 25, 2023 at 4:08 PM Wang, Guan <wan...@med.umich.edu> wrote: > Hi Mikhail, > > Again, thank you so much for getting back to me! > > Here is the scenario: > > Given a document with an added header line: > > HEADER_LINE > ORIGINAL_TEXT_LINE > ORIGINAL_TEXT_LINE... > > And a field in managed-schema for the document: > > <field name="RPT_TEXT" ... indexed="true" stored="true" ... /> > > I'd like to extract the information in the HEADER_LINE to guide the > indexing for this document. When the document is stored in the field > RPT_TEXT, I'd like to remove the HEADER_LINE so only the original document > text will be saved. > > With a custom tokenizer or the ConditionalToeknFilter as you mentioned, > extracting the HEADER_LINE should be straightforward. The remaining puzzle > is to strip the HEADER_LINE during saving. > > I took a look at IndexingChain class. It turns out I will have to do a > custom Field class and override stringValue() method. > > In a nutshell, I will need two parts to make this work: > > 1. a custom tokenizer/filter; > 2. a custom field; > > Let me know if there is any caveat... > > And thank you so much for guiding me through! > > Guan > > -----Original Message----- > From: Mikhail Khludnev <m...@apache.org> > Sent: Tuesday, April 25, 2023 4:40 AM > To: java-user@lucene.apache.org > Subject: Re: Can an analyzer access other field's data during index time? > > External Email - Use Caution > > Guan, > I hardly grasp the particular obstacle. But I don't think that the task is > out of reach overall. Can you share a test case formally describing the > desired behavior? > > On Tue, Apr 25, 2023 at 12:29 AM Wang, Guan <wan...@med.umich.edu> wrote: > > > Hi Mikhail, > > > > Thank you for introducing abstract class ConditionalTokenFilter to me! > > Took a quick look, it's a wrapper of the upperstream TokenStream with > > conditional rendition. > > > > So, if I have a document like: > > > > HEADER > > TEXT > > TEXT > > > > Implementing ConditionalToeknFilter could only tokenize line 2 and 3. > > However, all 3 lines would still be stored in the field if index=true > > and stored=true... > > > > I wonder if I could only store line 2 and 3 in the field in such a > > scenario? > > > > Many thanks, > > > > Guan > > > > -----Original Message----- > > From: Mikhail Khludnev <m...@apache.org> > > Sent: Monday, April 24, 2023 4:56 PM > > To: java-user@lucene.apache.org > > Subject: Re: Can an analyzer access other field's data during index time? > > > > External Email - Use Caution > > > > Well.. maybe something like > > > > https://luce/ > > ne.apache.org%2Fcore%2F8_5_1%2Fanalyzers-common%2Forg%2Fapache%2Flucen > > e%2Fanalysis%2Fmiscellaneous%2FConditionalTokenFilter.html&data=05%7C0 > > 1%7Cwanggu%40med.umich.edu%7C4bbb8f1743494265999e08db4568cd7f%7C1f41d6 > > 13d3a14ead918d2a25b10de330%7C0%7C0%7C638180088661302659%7CUnknown%7CTW > > FpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6 > > Mn0%3D%7C3000%7C%7C%7C&sdata=pv8ZhUIxfDRK894jqDc79eBizvM6tjuU%2BP0imA9 > > FmmU%3D&reserved=0 > > ? > > > > On Mon, Apr 24, 2023 at 11:40 PM Wang, Guan <wan...@med.umich.edu> > wrote: > > > > > Hi Mikhail, > > > > > > Thank you for the definitive answer! > > > > > > I could "solve" this by adding a header in the document with proper > > > information to guide the indexing process. Header will be parsed > > > then ignored by the tokenizer. However, the header along with the > > > actual text will be stored together in that field... > > > > > > I wonder (again...) if it's possible I may control which part of the > > > text shall be stored during the index process? In other words, is it > > > possible to strip the header when storing the text into the field? > > > > > > Best regards, > > > > > > Guan > > > > > > -----Original Message----- > > > From: Mikhail Khludnev <m...@apache.org> > > > Sent: Monday, April 24, 2023 4:20 PM > > > To: java-user@lucene.apache.org > > > Subject: Re: Can an analyzer access other field's data during index > time? > > > > > > External Email - Use Caution > > > > > > Hello Guan. > > > It reminds me https://youtu.be/EkkzSLstSAE?t=1531 timecode. > > > I'm afraid it's quite far from the existing codebase where the Field > > > has no reference to enclosing Document. sigh. > > > > > > > > > On Mon, Apr 24, 2023 at 6:00 PM Wang, Guan <wan...@med.umich.edu> > wrote: > > > > > > > Hi, > > > > > > > > I understand Lucene analyzer is per field basis. But I wonder if > > > > it's even possible for an analyzer on field A to be able to access > > > > data in field B during the index process on any stage, saying > > > > CharFilter, Tokenizer or TokenFilter? > > > > > > > > I'd like to control the behavior of the indexing process for field > > > > A based upon the value in field B. > > > > > > > > Mighty Lucene community, please let me know if this is doable... > > > > > > > > Many thanks, > > > > > > > > Guan > > > > ********************************************************** > > > > Electronic Mail is not secure, may not be read every day, and > > > > should not be used for urgent or sensitive issues > > > > > > > > > > > > > -- > > > Sincerely yours > > > Mikhail Khludnev > > > https://t/. > > > me%2F&data=05%7C01%7Cwanggu%40med.umich.edu%7C4bbb8f1743494265999e08 > > > db4568cd7f%7C1f41d613d3a14ead918d2a25b10de330%7C0%7C0%7C638180088661 > > > 302659%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiL > > > CJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=QM%2BJQkijKN8evI > > > 2sIqi9dLXyKDnklF6vkmGYqxoCDrY%3D&reserved=0 > > > %2FMUST_SEARCH&data=05%7C01%7Cwanggu%40med.umich.edu%7C0bea50f222a14 > > > e4 > > > b2ca708db450683cc%7C1f41d613d3a14ead918d2a25b10de330%7C0%7C0%7C63817 > > > 96 > > > 66504341414%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2lu > > > Mz > > > IiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=2Hy%2B1tCxYQ7 > > > ID > > > Ewa36%2ByOl5Jfe284fj4%2B0tutGWOvsk%3D&reserved=0 > > > A caveat: Cyrillic! > > > ********************************************************** > > > Electronic Mail is not secure, may not be read every day, and should > > > not be used for urgent or sensitive issues > > > > > > > > > -- > > Sincerely yours > > Mikhail Khludnev > > https://t.me/ > > %2FMUST_SEARCH&data=05%7C01%7Cwanggu%40med.umich.edu%7C4bbb8f174349426 > > 5999e08db4568cd7f%7C1f41d613d3a14ead918d2a25b10de330%7C0%7C0%7C6381800 > > 88661302659%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMz > > IiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=KrD%2FkNANILwVb > > py%2BshScTcu7qgxtxbRpB%2F48xmtd4Z8%3D&reserved=0 > > A caveat: Cyrillic! > > ********************************************************** > > Electronic Mail is not secure, may not be read every day, and should > > not be used for urgent or sensitive issues > > > > > -- > Sincerely yours > Mikhail Khludnev > https://t.me/MUST_SEARCH > A caveat: Cyrillic! > ********************************************************** > Electronic Mail is not secure, may not be read every day, and should not > be used for urgent or sensitive issues > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > -- Sincerely yours Mikhail Khludnev https://t.me/MUST_SEARCH A caveat: Cyrillic!