[ https://issues.apache.org/jira/browse/NUTCH-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16471105#comment-16471105 ]
ASF GitHub Bot commented on NUTCH-2575: --------------------------------------- sebastian-nagel closed pull request #327: NUTCH-2575 Storing total number of bytes read after every chunk URL: https://github.com/apache/nutch/pull/327 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java b/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java index c87c11125..591b94298 100644 --- a/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java +++ b/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java @@ -464,6 +464,7 @@ private void readChunkedContent(PushbackInputStream in, StringBuffer line) chunkBytesRead += len; } + contentBytesRead += chunkBytesRead; readLine(in, line, false); } ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > protocol-http does not respect the maximum content-size for chunked responses > ----------------------------------------------------------------------------- > > Key: NUTCH-2575 > URL: https://issues.apache.org/jira/browse/NUTCH-2575 > Project: Nutch > Issue Type: Sub-task > Affects Versions: 1.14 > Reporter: Gerard Bouchar > Priority: Critical > > There is a bug in HttpResponse::readChunkedContent that prevents it to stop > reading content when it exceeds the maximum allowed size. > There [is a variable > contentBytesRead|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L404] > that is used to check how much content has been read, but it is never > updated, so it always stays null, and [the size > check|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L440-L442] > always returns false (unless a single chunk is larger than the maximum > allowed content size). > This allows any server to cause out-of-memory errors on our size. -- This message was sent by Atlassian JIRA (v7.6.3#76005)