This this is part of the solution and I will look into that. I agree, handling MS word format makes it more difficult.

The other part of the requirement - and maybe I am missing something; is that if a match is found against a document the hit is displayed, however (as I understand it) it is only the first hit for that document, is there a way (eg an option flag) to display all hits in a matched document, so the user can see all occurrences? They would then see (say) 5 sets of summaries - with the search phrase highlighted.
The display of the search results would then go on the next document.

Thanks

John

Stefan Groschupf wrote:

Hi,
I'm not sure what you are looking for but sounds like the key word highlighting as it is available in the google cache pages. Right? Since nutch does not store the content itself in the index but in the segment you can recycle the cache page. The problem might be that you have not html but word, so you have to load the parsed text from the segment and than code somehow the highlighting to a viewer of the parsed text.
Does this somehow give an idea what you need to do?

Stefan

Am 16.12.2005 um 03:58 schrieb John Reidy:

Hi all.
I have a requirement to build an intranet style search engine for a small (<500) set of large Word and PDF documents. What is needed is for all hits (together with the context) of the search phrase in the documents to be returned.

As an example, if the search term is "policy" and the "operations manual" is searched there might be several hits in different sections of the document that would match policy and all would be displayed for the user?

This may be a question better answered on the lucene lists, however at this stage I am looking at the Nutch code and I am hoping there is a fairly high level solution.

Regards

John Reidy.





-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to