Re: [Nutch-dev] Content-Length check in PDF and HTML Parser

john Mon, 05 Jul 2004 00:19:08 -0700

Hi, Andy,

On Fri, Jul 02, 2004 at 01:28:47PM +0100, Hedges, Andrew wrote:
> 
> There is a check in these two parsers to see if the content length as
> reported by the http server in the "Content-Length" header is equal to the
> length of the content in the raw byte array of content. The code is in both
> files and appears like:
> 
> if (contentLength != null
>       && raw.length != Integer.parseInt(contentLength)){
>       throw new ParseException("Content truncated at " + raw.length
>                                               +" bytes. Parser can't
> handle incomplete pdf file.");
> }
> 
> Whilst this is ok in most circumstances it is not ok when the http stream
> has been compressed. In this case the content-length is the length of the
> compressed data and the raw.length is the length of the uncompressed data.
> Well at least the is my understanding of it and what I'm seeing in the wild.


You are right. I overlooked the issue, since I mostly worked with file:///
client (for convenience), which does not have content compression mechanism.
The same goes for another one you raised in an earlier email. Thanks.

> 
> This doesn't effect html pages as there is no check on the content length
> because a partial page can still be indexed.
> 
> I have created a patch that works around the problem by not performing the
> check is the content is compressed but I'm now thinking it might be better
> to have a flag actually on the Content object that states whether is has
> been truncated or not. This way we mark content as truncated even when it is
> from a compressed stream.

Yes, a flag indicating truncation is a better approach. This was discussed
when Doug set out to make format changes in various db files about one
month ago.

> 
> I would implement this change and submit a patch but I'm off on holiday
> soon. If it hasn't been done by the time I get back I will go ahead and do
> it - assuming that everyone agree that this is the best way forward.

It will be great if you can prepare the patch.

John

> 
> For what it's worth the work around changes the above code snippet to
> (similarly for msword):
> 
> String contentEncoding = content.get("Content-Encoding");
> if (conentLength != null
>       && raw.length != Integer.parseInt(contentLength)
>       && contentEncoding.indexOf("gzip") != -1){
>       throw new ParseException("Content truncated at " + raw.length
>                                               +" bytes. Parser can't
> handle incomplete pdf file.");
> }
> 
> Hope this helps and makes sense.
> 
> Andy


-------------------------------------------------------
This SF.Net email sponsored by Black Hat Briefings & Training.
Attend Black Hat Briefings & Training, Las Vegas July 24-29 - 
digital self defense, top technical experts, no vendor pitches, 
unmatched networking opportunities. Visit www.blackhat.com
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] Content-Length check in PDF and HTML Parser

Reply via email to