Sami Siren-2 wrote: > > George Weller wrote: >> Hi all, >> >> First I note in the logs that a large number of PDF documents have been >> fetched, and yet only two have been indexed, and indeed only these two >> appear in search results. The content limit is set high enough to allow >> these documents to be indexed, so I can't think why this should be. > > Are there any related errors on log? > >> Secondly for those documents that ARE indexed, rather bizarrely, the >> document titles in the search results have a '.xls' extension. I can even >> search for all PDF document just by using the query 'xls'. Note that this >> suffix is most definitely NOT in the actual title of those files. I also >> chanced upon a site that seems to use Nutch (no affiliation- I just >> googled) >> and found the same problem... > > In the examples from your site the title is extracted from the pdf > metadata so it just uses the title stored within the pdf doc. > > -- > Sami Siren > > Thanks for the reply.
Yes you're absolutely right! I did a sample crawl on our production server and I notice that it also returns some PDFs with ".doc" in the title.... I can now see that this is due to whatever software was used to convert the XLS or DOC documents to PDF format in the first place! I couldn't spot any other errors in the log, but I think I managed to solve the other problem too. I had the content limit set to around 1.6MB IIRC, which after a quick survey of common document I concluded would be enough to allow indexing of the main docs that people would search for (most of which were a couple of hundred kilobytes), but it seems that it wasn't enough. I have now set it to be unlimited (i.e. -1), and I'm getting proper results. Now I just need to find out what "more.jsp" does, and how to get it going... Back to the wiki I think! Thanks again, George -- View this message in context: http://www.nabble.com/PDF-problems%2C-inc.-documents-returned-with-XLS-extension-tf4671286.html#a13381606 Sent from the Nutch - User mailing list archive at Nabble.com.
