Hi Mikhail, Again, thank you so much for getting back to me!
Here is the scenario: Given a document with an added header line: HEADER_LINE ORIGINAL_TEXT_LINE ORIGINAL_TEXT_LINE... And a field in managed-schema for the document: <field name="RPT_TEXT" ... indexed="true" stored="true" ... /> I'd like to extract the information in the HEADER_LINE to guide the indexing for this document. When the document is stored in the field RPT_TEXT, I'd like to remove the HEADER_LINE so only the original document text will be saved. With a custom tokenizer or the ConditionalToeknFilter as you mentioned, extracting the HEADER_LINE should be straightforward. The remaining puzzle is to strip the HEADER_LINE during saving. I took a look at IndexingChain class. It turns out I will have to do a custom Field class and override stringValue() method. In a nutshell, I will need two parts to make this work: 1. a custom tokenizer/filter; 2. a custom field; Let me know if there is any caveat... And thank you so much for guiding me through! Guan -----Original Message----- From: Mikhail Khludnev <m...@apache.org> Sent: Tuesday, April 25, 2023 4:40 AM To: java-user@lucene.apache.org Subject: Re: Can an analyzer access other field's data during index time? External Email - Use Caution Guan, I hardly grasp the particular obstacle. But I don't think that the task is out of reach overall. Can you share a test case formally describing the desired behavior? On Tue, Apr 25, 2023 at 12:29 AM Wang, Guan <wan...@med.umich.edu> wrote: > Hi Mikhail, > > Thank you for introducing abstract class ConditionalTokenFilter to me! > Took a quick look, it's a wrapper of the upperstream TokenStream with > conditional rendition. > > So, if I have a document like: > > HEADER > TEXT > TEXT > > Implementing ConditionalToeknFilter could only tokenize line 2 and 3. > However, all 3 lines would still be stored in the field if index=true > and stored=true... > > I wonder if I could only store line 2 and 3 in the field in such a > scenario? > > Many thanks, > > Guan > > -----Original Message----- > From: Mikhail Khludnev <m...@apache.org> > Sent: Monday, April 24, 2023 4:56 PM > To: java-user@lucene.apache.org > Subject: Re: Can an analyzer access other field's data during index time? > > External Email - Use Caution > > Well.. maybe something like > > https://luce/ > ne.apache.org%2Fcore%2F8_5_1%2Fanalyzers-common%2Forg%2Fapache%2Flucen > e%2Fanalysis%2Fmiscellaneous%2FConditionalTokenFilter.html&data=05%7C0 > 1%7Cwanggu%40med.umich.edu%7C4bbb8f1743494265999e08db4568cd7f%7C1f41d6 > 13d3a14ead918d2a25b10de330%7C0%7C0%7C638180088661302659%7CUnknown%7CTW > FpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6 > Mn0%3D%7C3000%7C%7C%7C&sdata=pv8ZhUIxfDRK894jqDc79eBizvM6tjuU%2BP0imA9 > FmmU%3D&reserved=0 > ? > > On Mon, Apr 24, 2023 at 11:40 PM Wang, Guan <wan...@med.umich.edu> wrote: > > > Hi Mikhail, > > > > Thank you for the definitive answer! > > > > I could "solve" this by adding a header in the document with proper > > information to guide the indexing process. Header will be parsed > > then ignored by the tokenizer. However, the header along with the > > actual text will be stored together in that field... > > > > I wonder (again...) if it's possible I may control which part of the > > text shall be stored during the index process? In other words, is it > > possible to strip the header when storing the text into the field? > > > > Best regards, > > > > Guan > > > > -----Original Message----- > > From: Mikhail Khludnev <m...@apache.org> > > Sent: Monday, April 24, 2023 4:20 PM > > To: java-user@lucene.apache.org > > Subject: Re: Can an analyzer access other field's data during index time? > > > > External Email - Use Caution > > > > Hello Guan. > > It reminds me https://youtu.be/EkkzSLstSAE?t=1531 timecode. > > I'm afraid it's quite far from the existing codebase where the Field > > has no reference to enclosing Document. sigh. > > > > > > On Mon, Apr 24, 2023 at 6:00 PM Wang, Guan <wan...@med.umich.edu> wrote: > > > > > Hi, > > > > > > I understand Lucene analyzer is per field basis. But I wonder if > > > it's even possible for an analyzer on field A to be able to access > > > data in field B during the index process on any stage, saying > > > CharFilter, Tokenizer or TokenFilter? > > > > > > I'd like to control the behavior of the indexing process for field > > > A based upon the value in field B. > > > > > > Mighty Lucene community, please let me know if this is doable... > > > > > > Many thanks, > > > > > > Guan > > > ********************************************************** > > > Electronic Mail is not secure, may not be read every day, and > > > should not be used for urgent or sensitive issues > > > > > > > > > -- > > Sincerely yours > > Mikhail Khludnev > > https://t/. > > me%2F&data=05%7C01%7Cwanggu%40med.umich.edu%7C4bbb8f1743494265999e08 > > db4568cd7f%7C1f41d613d3a14ead918d2a25b10de330%7C0%7C0%7C638180088661 > > 302659%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiL > > CJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=QM%2BJQkijKN8evI > > 2sIqi9dLXyKDnklF6vkmGYqxoCDrY%3D&reserved=0 > > %2FMUST_SEARCH&data=05%7C01%7Cwanggu%40med.umich.edu%7C0bea50f222a14 > > e4 > > b2ca708db450683cc%7C1f41d613d3a14ead918d2a25b10de330%7C0%7C0%7C63817 > > 96 > > 66504341414%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2lu > > Mz > > IiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=2Hy%2B1tCxYQ7 > > ID > > Ewa36%2ByOl5Jfe284fj4%2B0tutGWOvsk%3D&reserved=0 > > A caveat: Cyrillic! > > ********************************************************** > > Electronic Mail is not secure, may not be read every day, and should > > not be used for urgent or sensitive issues > > > > > -- > Sincerely yours > Mikhail Khludnev > https://t.me/ > %2FMUST_SEARCH&data=05%7C01%7Cwanggu%40med.umich.edu%7C4bbb8f174349426 > 5999e08db4568cd7f%7C1f41d613d3a14ead918d2a25b10de330%7C0%7C0%7C6381800 > 88661302659%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMz > IiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=KrD%2FkNANILwVb > py%2BshScTcu7qgxtxbRpB%2F48xmtd4Z8%3D&reserved=0 > A caveat: Cyrillic! > ********************************************************** > Electronic Mail is not secure, may not be read every day, and should > not be used for urgent or sensitive issues > -- Sincerely yours Mikhail Khludnev https://t.me/MUST_SEARCH A caveat: Cyrillic! ********************************************************** Electronic Mail is not secure, may not be read every day, and should not be used for urgent or sensitive issues --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org