Hi Lukas, 1. My problem was not parser. I am able to extract the required text fragments from the html document and index it. But when lucene returns a Hit, I am not sure how I can correlate it back to different portions of html document. Assuming that I use JTidy and I have a DOM, how will I know whether matched keywords are from an hyperlink or header node ?
2. I didn't know what Phrase query would work once I split the content into multiple fields. I will try out this. However, your last suggestion to have one "content" field that contains everything and other fields for hyperlink and header tags should work too. In any case, I need to retokenize/reparse the original document to figure out if matched terms belong to a specific tag. thanks Ramesh ----- Original Message ----- From: "Lukas Zapletal" <[EMAIL PROTECTED]> To: "Lucene Users List" <[EMAIL PROTECTED]> Sent: Wednesday, January 22, 2003 2:35 AM Subject: Re: Correlating matched terms with Document > Hello > > >I have a strange requirement. I am indexing a single HTML Document and > >searching it immediately for one or more keywords (Boolean/Phrase query). > >When the keywords are found in the document, I would like to > >know if the matched keywords are from hyperlink text, a paragraph or one of > ><h1>, <h2> etc tags. > > > I suggest to use JTidy when parsing document. It supports DOM XML model. > You can easily extract all hyperlinks or headlines. The great thing is > that also BAD documents are cleaned and repaired before DOM parsing.. > > >a) I cannot add multiple fields as I need to do "Phrase" query. > > > Why not to split it to fields? It should work... > > >b) During the tokenization, I know exactly if a particular token is from a > >specific tag. Can this be stored in > >the index as some user-defined flags or something like that and later > >retrieve it. Looking at the API, it doesn't seem to be possible. > >I see that I can associate token type (such as "word", "eol" ) with the > >analyzer token, but this is not stored in the index. > > > Change parser. See JTidy above. > > >c) One option seems to be to re-tokenize the document after search - like > >some of the highlight summary examples are doing. Then > >I can match the document tokens with the terms. > > > > > I suggest to make this fields: > > content: all content with links and headlines > headlines: only headlines > hrefs: only hrefs etc > > When you search a phrase, it will match at least content. When there > will be some hits from headlines, you know this is headline. Am I right? > > -- > Lukas Zapletal [[EMAIL PROTECTED]] > http://www.tanecni-olomouc.cz/lzap > No viruses in this mail. AVASThome > > > > > -- > To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: <mailto:[EMAIL PROTECTED]> > > -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
