Mark Miller wrote:

Depends on the work you want to do. If you want to highlight a simple XML doc the approach would be to extract all of the text elements and run them through the highlighter and then correctly update them. That would be mostly simple DOM manipulation.

OK.

I guess there will be some details that need special attention. One case that springs to mind is the occurrence of words that in the original document are broken up by encoding, like "en<hyphen/>coding" or "<em>mid</em>range".

The same approach should work with any format but the difficulty in modifying the text may increase. If you can pull the text out appropriately it would seem you could put it back in though, or modify it in place as you might with the DOM.

Do you know if tools (classes) for "appropriate" extraction from "my" file formats already exist in Lucene? I.e, something that not just extracts the text, but keeps track of its position in the original?

I saw POI <http://jakarta.apache.org/poi/> mentioned in a posting on this list. Perhaps a solution for Word documents can be based on POI.

- Øystein -


- Mark

Oystein Reigem wrote:

Hi,

I want to implement fulltext search on a collection of documents. I try to figure out which system is the better choice - eXist, or Lucene, or some combination of the two. I have some knowledge of eXist, but don't know too much about Lucene.

I'd like to display the result of a search as a list of excerpts/snippets with highlighted search words. When the user clicks an item in the result list to bring up the document in full, I'd like to have search words highlighted in the full document as well.

The document collection is very diverse. There are pure text documents and well-formed XML and HTML documents, but unfortunately also HTML documents that are not quite well-formed, Word documents and PDFs. Many of the formats go beyond what eXist and Lucene can handle, and I realise some conversion, or text extraction, is necessary. As far as I know Lucene can only index and search pure text (and fields), so the documents must be run through appropriate filters extracting the text (and field values). Afterwards fulltext search is possible.

But what about highlighting? I know it is possible to get highlighting in the pure text version, but what about the original document, when the original document is something else than pure text, e.g, a simple XML document? Is it at all possible to get the search words tagged in the XML document?

I assume not, but ask anyway. :-)

Cheers,

- Øystein -



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Øystein Reigem, The department of culture, language and information technology (Aksis), Allegt 
27, N-5007 Bergen, Norway. Tel: +47 55 58 32 42. Fax: +47 55 58 94 70. E-mail: <[EMAIL 
PROTECTED]>. Home tel: +47 56 14 06 11. Mobile: +47 97 16 96 64. Home e-mail: <[EMAIL 
PROTECTED]>. Aksis home page: <www.aksis.uib.no>.

Reply via email to