Hi all, I'm trying to use Nutch for a intranet search. After much reading on the FAQs, wikis, and these lists I have it working very well for JSP pages, with pretty decent quality results. I am however experiencing problems searching for PDF documents.
First I note in the logs that a large number of PDF documents have been fetched, and yet only two have been indexed, and indeed only these two appear in search results. The content limit is set high enough to allow these documents to be indexed, so I can't think why this should be. Secondly for those documents that ARE indexed, rather bizarrely, the document titles in the search results have a '.xls' extension. I can even search for all PDF document just by using the query 'xls'. Note that this suffix is most definitely NOT in the actual title of those files. I also chanced upon a site that seems to use Nutch (no affiliation- I just googled) and found the same problem... http://www.bfm.bm/nutch?query=xls&Submit=Go I don't see any output from the "more.jsp" include either. I'm not certain as I've never seen it working, but I imagine its meant to add a "[PDF]" chunk to the title. Can someone explain why I'm having these problems? Thanks very much, George -- View this message in context: http://www.nabble.com/PDF-problems%2C-inc.-documents-returned-with-XLS-extension-tf4671286.html#a13344771 Sent from the Nutch - User mailing list archive at Nabble.com.
