This this is part of the solution and I will look into that. I agree,
handling MS word format makes it more difficult.
The other part of the requirement - and maybe I am missing something; is
that if a match is found against a document the hit is displayed,
however (as I understand it) it is only the first hit for that document,
is there a way (eg an option flag) to display all hits in a matched
document, so the user can see all occurrences? They would then see (say)
5 sets of summaries - with the search phrase highlighted.
The display of the search results would then go on the next document.
Thanks
John
Stefan Groschupf wrote:
Hi,
I'm not sure what you are looking for but sounds like the key word
highlighting as it is available in the google cache pages. Right?
Since nutch does not store the content itself in the index but in the
segment you can recycle the cache page.
The problem might be that you have not html but word, so you have to
load the parsed text from the segment and than code somehow the
highlighting to a viewer of the parsed text.
Does this somehow give an idea what you need to do?
Stefan
Am 16.12.2005 um 03:58 schrieb John Reidy:
Hi all.
I have a requirement to build an intranet style search engine for a
small (<500) set of large Word and PDF documents.
What is needed is for all hits (together with the context) of the
search phrase in the documents to be returned.
As an example, if the search term is "policy" and the "operations
manual" is searched there might be several hits in different
sections of the document that would match policy and all would be
displayed for the user?
This may be a question better answered on the lucene lists, however
at this stage I am looking at the Nutch code and I am hoping there
is a fairly high level solution.
Regards
John Reidy.
-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems? Stop! Download the new AJAX search engine that makes
searching your log files as easy as surfing the web. DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general