Hi Lukas,

1. My problem was not parser. I am able to extract the required text
fragments
from the html document and index it. But when lucene returns a Hit, I am not
sure
how I can correlate it back to different portions of html document. Assuming
that I use JTidy and I have
a DOM, how will I know whether matched keywords are from an hyperlink or
header node ?

2. I didn't know what Phrase query would work once I split the content into
multiple fields.
I will try out this. However, your last suggestion to have one "content"
field that contains everything and
other fields for hyperlink and header tags should work too.

In any case, I need to retokenize/reparse the original document to figure
out if matched terms belong
to a specific tag.

thanks
Ramesh

----- Original Message -----
From: "Lukas Zapletal" <[EMAIL PROTECTED]>
To: "Lucene Users List" <[EMAIL PROTECTED]>
Sent: Wednesday, January 22, 2003 2:35 AM
Subject: Re: Correlating matched terms with Document


> Hello
>
> >I have a strange requirement. I am indexing a single HTML Document and
> >searching it immediately for one or more keywords (Boolean/Phrase query).
> >When the keywords are found in the document, I would like to
> >know if the matched keywords are from hyperlink text, a paragraph or one
of
> ><h1>, <h2> etc tags.
> >
> I suggest to use JTidy when parsing document. It supports DOM XML model.
> You can easily extract all hyperlinks or headlines. The great thing is
> that also BAD documents are cleaned and repaired before DOM parsing..
>
> >a) I cannot add multiple fields as I need to do "Phrase" query.
> >
> Why not to split it to fields? It should work...
>
> >b) During the tokenization, I know exactly if a particular token is from
a
> >specific tag. Can this be stored in
> >the index as some user-defined flags or something like that and later
> >retrieve it. Looking at the API, it doesn't seem to be possible.
> >I see that I can associate token type (such as "word", "eol" ) with the
> >analyzer token, but this is not stored in the index.
> >
> Change parser. See JTidy above.
>
> >c) One option seems to be to re-tokenize the document after search - like
> >some of the highlight summary examples are doing.  Then
> >I can match the document tokens with the terms.
> >
> >
> I suggest to make this fields:
>
> content: all content with links and headlines
> headlines: only headlines
> hrefs: only hrefs etc
>
> When you search a phrase, it will match at least content. When there
> will be some hits from headlines, you know this is headline. Am I right?
>
> --
> Lukas Zapletal      [[EMAIL PROTECTED]]
> http://www.tanecni-olomouc.cz/lzap
> No viruses in this mail. AVASThome
>
>
>
>
> --
> To unsubscribe, e-mail:
<mailto:[EMAIL PROTECTED]>
> For additional commands, e-mail:
<mailto:[EMAIL PROTECTED]>
>
>


--
To unsubscribe, e-mail:   <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to