Hello Alexander,
In my nutch-site.xml, both file.content.limit and http.content.limit are set
to -1 (meaning no limit).
This works fine for msword documents.
Do you think that another parameter has to be set for pdf files?



Alexander Aristov wrote:
> 
> I suspect that Nutch has not downloaded full pdf. There is a setting in
> the
> nutch config file to truncate large files. It's efficient for html but
> might
> cause such  errors for other formats.
> 
> Check this setting and adjust accordingly.
> 
> Alexander
> 
> 2008/10/29 olivier_coface <[EMAIL PROTECTED]>
> 
>>
>> I had the following error when crawling on pdf files (it happened on 2
>> pdf
>> files):
>>
>> http://lyra:85/ExternalDocumentation/BusinessComponentApproach_Chapter2.pdf
>> :
>> failed(2,0): Can't be handled as pdf document. java.io.EOFException:
>> Unexpected end of ZLIB input stream
>>
>> Any idea?
>> --
>> View this message in context:
>> http://www.nabble.com/Unexpected-end-of-ZLIB-input-stream-when-parsing-pdf-files-tp20223893p20223893.html
>> Sent from the Nutch - User mailing list archive at Nabble.com.
>>
>>
> 
> 
> -- 
> Best Regards
> Alexander Aristov
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Unexpected-end-of-ZLIB-input-stream-when-parsing-pdf-files-tp20223893p20224816.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to