George Weller wrote: > Hi all, > > First I note in the logs that a large number of PDF documents have been > fetched, and yet only two have been indexed, and indeed only these two > appear in search results. The content limit is set high enough to allow > these documents to be indexed, so I can't think why this should be.
Are there any related errors on log? > Secondly for those documents that ARE indexed, rather bizarrely, the > document titles in the search results have a '.xls' extension. I can even > search for all PDF document just by using the query 'xls'. Note that this > suffix is most definitely NOT in the actual title of those files. I also > chanced upon a site that seems to use Nutch (no affiliation- I just googled) > and found the same problem... In the examples from your site the title is extracted from the pdf metadata so it just uses the title stored within the pdf doc. -- Sami Siren
