There is a check in these two parsers to see if the content length as
reported by the http server in the "Content-Length" header is equal to the
length of the content in the raw byte array of content. The code is in both
files and appears like:

if (contentLength != null
        && raw.length != Integer.parseInt(contentLength)){
        throw new ParseException("Content truncated at " + raw.length
                                                +" bytes. Parser can't
handle incomplete pdf file.");
}

Whilst this is ok in most circumstances it is not ok when the http stream
has been compressed. In this case the content-length is the length of the
compressed data and the raw.length is the length of the uncompressed data.
Well at least the is my understanding of it and what I'm seeing in the wild.

This doesn't effect html pages as there is no check on the content length
because a partial page can still be indexed.

I have created a patch that works around the problem by not performing the
check is the content is compressed but I'm now thinking it might be better
to have a flag actually on the Content object that states whether is has
been truncated or not. This way we mark content as truncated even when it is
from a compressed stream.

I would implement this change and submit a patch but I'm off on holiday
soon. If it hasn't been done by the time I get back I will go ahead and do
it - assuming that everyone agree that this is the best way forward.

For what it's worth the work around changes the above code snippet to
(similarly for msword):

String contentEncoding = content.get("Content-Encoding");
if (conentLength != null
        && raw.length != Integer.parseInt(contentLength)
        && contentEncoding.indexOf("gzip") != -1){
        throw new ParseException("Content truncated at " + raw.length
                                                +" bytes. Parser can't
handle incomplete pdf file.");
}

Hope this helps and makes sense.

Andy


Our name has changed.  Please update your address book to the following format: 
"[EMAIL PROTECTED]".

This message contains information that may be privileged or confidential and is the 
property of the Capgemini Group. It is intended only for the person to whom it is 
addressed. If you are not the intended recipient,  you are not authorized to read, 
print, retain, copy, disseminate,  distribute, or use this message or any part 
thereof. If you receive this  message in error, please notify the sender immediately 
and delete all  copies of this message.



-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - 
digital self defense, top technical experts, no vendor pitches, 
unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to