Hi all,

I'm trying to use Nutch for a intranet search. After much reading on the
FAQs, wikis, and these lists I have it working very well for JSP pages, with
pretty decent quality results. I am however experiencing problems  searching
for PDF documents.

First I note in the logs that a large number of PDF documents have been
fetched, and yet only two have been indexed, and indeed only these two
appear in search results. The content limit is set high enough to allow
these documents to be indexed, so I can't think why this should be.

Secondly for those documents that ARE indexed, rather bizarrely, the
document titles in the search results have a '.xls' extension. I can even
search for all PDF document just by using the query 'xls'. Note that this
suffix is most definitely NOT in the actual title of those files. I also
chanced upon a site that seems to use Nutch (no affiliation- I just googled)
and found the same problem...

http://www.bfm.bm/nutch?query=xls&Submit=Go

I don't see any output from the "more.jsp" include either. I'm not certain
as I've never seen it working, but I imagine its meant to add a "[PDF]"
chunk to the title.

Can someone explain why I'm having these problems?

Thanks very much,
George
-- 
View this message in context: 
http://www.nabble.com/PDF-problems%2C-inc.-documents-returned-with-XLS-extension-tf4671286.html#a13344771
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to