Depends on the work you want to do. If you want to highlight a simple
XML doc the approach would be to extract all of the text elements and
run them through the highlighter and then correctly update them. That
would be mostly simple DOM manipulation. The same approach should work
with any format but the difficulty in modifying the text may increase.
If you can pull the text out appropriately it would seem you could put
it back in though, or modify it in place as you might with the DOM.
- Mark
Oystein Reigem wrote:
Hi,
I want to implement fulltext search on a collection of documents. I
try to figure out which system is the better choice - eXist, or
Lucene, or some combination of the two. I have some knowledge of
eXist, but don't know too much about Lucene.
I'd like to display the result of a search as a list of
excerpts/snippets with highlighted search words. When the user clicks
an item in the result list to bring up the document in full, I'd like
to have search words highlighted in the full document as well.
The document collection is very diverse. There are pure text documents
and well-formed XML and HTML documents, but unfortunately also HTML
documents that are not quite well-formed, Word documents and PDFs.
Many of the formats go beyond what eXist and Lucene can handle, and I
realise some conversion, or text extraction, is necessary. As far as I
know Lucene can only index and search pure text (and fields), so the
documents must be run through appropriate filters extracting the text
(and field values). Afterwards fulltext search is possible.
But what about highlighting? I know it is possible to get highlighting
in the pure text version, but what about the original document, when
the original document is something else than pure text, e.g, a simple
XML document? Is it at all possible to get the search words tagged in
the XML document?
I assume not, but ask anyway. :-)
Cheers,
- Øystein -
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]