Sami Siren-2 wrote:
> 
> George Weller wrote:
>> Hi all,
>> 
>> First I note in the logs that a large number of PDF documents have been
>> fetched, and yet only two have been indexed, and indeed only these two
>> appear in search results. The content limit is set high enough to allow
>> these documents to be indexed, so I can't think why this should be.
> 
> Are there any related errors on log?
> 
>> Secondly for those documents that ARE indexed, rather bizarrely, the
>> document titles in the search results have a '.xls' extension. I can even
>> search for all PDF document just by using the query 'xls'. Note that this
>> suffix is most definitely NOT in the actual title of those files. I also
>> chanced upon a site that seems to use Nutch (no affiliation- I just
>> googled)
>> and found the same problem...
> 
> In the examples from your site the title is extracted from the pdf
> metadata so it just uses the title stored within the pdf doc.
> 
> -- 
>  Sami Siren
> 
> 
Thanks for the reply.

Yes you're absolutely right! I did a sample crawl on our production server
and I notice that it also returns some PDFs with ".doc" in the title.... I
can now see that this is due to whatever software was used to convert the
XLS or DOC documents to PDF format in the first place!

I couldn't spot any other errors in the log, but I think I managed to solve
the other problem too. I had the content limit set to around 1.6MB IIRC,
which after a quick survey of common document I concluded would be enough to
allow indexing of the main docs that people would search for (most of which
were a couple of hundred kilobytes), but it seems that it wasn't enough. I
have now set it to be unlimited (i.e. -1), and I'm getting proper results.

Now I just need to find out what "more.jsp" does, and how to get it going...
Back to the wiki I think!

Thanks again,
George
-- 
View this message in context: 
http://www.nabble.com/PDF-problems%2C-inc.-documents-returned-with-XLS-extension-tf4671286.html#a13381606
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to