Hello

I have a strange requirement. I am indexing a single HTML Document and
searching it immediately for one or more keywords (Boolean/Phrase query). When the keywords are found in the document, I would like to
know if the matched keywords are from hyperlink text, a paragraph or one of
<h1>, <h2> etc tags.
I suggest to use JTidy when parsing document. It supports DOM XML model. You can easily extract all hyperlinks or headlines. The great thing is that also BAD documents are cleaned and repaired before DOM parsing..

a) I cannot add multiple fields as I need to do "Phrase" query.

Why not to split it to fields? It should work...

b) During the tokenization, I know exactly if a particular token is from a
specific tag. Can this be stored in
the index as some user-defined flags or something like that and later
retrieve it. Looking at the API, it doesn't seem to be possible.
I see that I can associate token type (such as "word", "eol" ) with the
analyzer token, but this is not stored in the index.

Change parser. See JTidy above.

c) One option seems to be to re-tokenize the document after search - like
some of the highlight summary examples are doing. Then
I can match the document tokens with the terms.

I suggest to make this fields:

content: all content with links and headlines
headlines: only headlines
hrefs: only hrefs etc

When you search a phrase, it will match at least content. When there will be some hits from headlines, you know this is headline. Am I right?

--
Lukas Zapletal [[EMAIL PROTECTED]]
http://www.tanecni-olomouc.cz/lzap
No viruses in this mail. AVASThome




--
To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Reply via email to