Hello Alexander, In my nutch-site.xml, both file.content.limit and http.content.limit are set to -1 (meaning no limit). This works fine for msword documents. Do you think that another parameter has to be set for pdf files?
Alexander Aristov wrote: > > I suspect that Nutch has not downloaded full pdf. There is a setting in > the > nutch config file to truncate large files. It's efficient for html but > might > cause such errors for other formats. > > Check this setting and adjust accordingly. > > Alexander > > 2008/10/29 olivier_coface <[EMAIL PROTECTED]> > >> >> I had the following error when crawling on pdf files (it happened on 2 >> pdf >> files): >> >> http://lyra:85/ExternalDocumentation/BusinessComponentApproach_Chapter2.pdf >> : >> failed(2,0): Can't be handled as pdf document. java.io.EOFException: >> Unexpected end of ZLIB input stream >> >> Any idea? >> -- >> View this message in context: >> http://www.nabble.com/Unexpected-end-of-ZLIB-input-stream-when-parsing-pdf-files-tp20223893p20223893.html >> Sent from the Nutch - User mailing list archive at Nabble.com. >> >> > > > -- > Best Regards > Alexander Aristov > > -- View this message in context: http://www.nabble.com/Unexpected-end-of-ZLIB-input-stream-when-parsing-pdf-files-tp20223893p20224816.html Sent from the Nutch - User mailing list archive at Nabble.com.
