[jira] [Commented] (NUTCH-2575) protocol-http does not respect the maximum content-size for chunked responses

2018-05-10 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471213#comment-16471213
 ] 

Hudson commented on NUTCH-2575:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3524 (See 
[https://builds.apache.org/job/Nutch-trunk/3524/])
NUTCH-2575 Storing total number of bytes read after every chunk 
(omkarreddy2008: 
[https://github.com/apache/nutch/commit/b541de8ff20b818667e2765664ae2f133b439dc3])
* (edit) 
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java


> protocol-http does not respect the maximum content-size for chunked responses
> -
>
> Key: NUTCH-2575
> URL: https://issues.apache.org/jira/browse/NUTCH-2575
> Project: Nutch
>  Issue Type: Sub-task
>  Components: protocol
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Priority: Critical
> Fix For: 1.15
>
>
> There is a bug in HttpResponse::readChunkedContent that prevents it to stop 
> reading content when it exceeds the maximum allowed size.
> There [is a variable 
> contentBytesRead|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L404]
>  that is used to check how much content has been read, but it is never 
> updated, so it always stays null, and [the size 
> check|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L440-L442]
>  always returns false (unless a single chunk is larger than the maximum 
> allowed content size).
> This allows any server to cause out-of-memory errors on our size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2562) protocol-http fails to read large chunked HTTP responses

2018-05-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471152#comment-16471152
 ] 

ASF GitHub Bot commented on NUTCH-2562:
---

sebastian-nagel opened a new pull request #329: NUTCH-2562 protocol-http fails 
to read large chunked HTTP responses
URL: https://github.com/apache/nutch/pull/329
 
 
   - if http.content.limit is reached skip remaining chunked content
 including any headers in the trailer


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> protocol-http fails to read large chunked HTTP responses
> 
>
> Key: NUTCH-2562
> URL: https://issues.apache.org/jira/browse/NUTCH-2562
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Priority: Major
> Fix For: 1.15
>
>
> While reading chunked content, if the content size becomes larger than 
> http.getMaxContent(), instead of just stopping and truncate the content, it 
> tries to read a new chunk before having read the previous one completely, 
> resulting in a '{color:#33}bad chunk length' error.{color}
>  
> {color:#33}See: 
> https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L440-L442{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2562) protocol-http fails to read large chunked HTTP responses

2018-05-10 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2562:
---
Affects Version/s: 1.14

> protocol-http fails to read large chunked HTTP responses
> 
>
> Key: NUTCH-2562
> URL: https://issues.apache.org/jira/browse/NUTCH-2562
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Priority: Major
> Fix For: 1.15
>
>
> While reading chunked content, if the content size becomes larger than 
> http.getMaxContent(), instead of just stopping and truncate the content, it 
> tries to read a new chunk before having read the previous one completely, 
> resulting in a '{color:#33}bad chunk length' error.{color}
>  
> {color:#33}See: 
> https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L440-L442{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2562) protocol-http fails to read large chunked HTTP responses

2018-05-10 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2562:
---
Fix Version/s: 1.15

> protocol-http fails to read large chunked HTTP responses
> 
>
> Key: NUTCH-2562
> URL: https://issues.apache.org/jira/browse/NUTCH-2562
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Priority: Major
> Fix For: 1.15
>
>
> While reading chunked content, if the content size becomes larger than 
> http.getMaxContent(), instead of just stopping and truncate the content, it 
> tries to read a new chunk before having read the previous one completely, 
> resulting in a '{color:#33}bad chunk length' error.{color}
>  
> {color:#33}See: 
> https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L440-L442{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2562) protocol-http fails to read large chunked HTTP responses

2018-05-10 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471136#comment-16471136
 ] 

Sebastian Nagel commented on NUTCH-2562:


Confirmed and reproduced. The reason why the remaining chunks are continued was 
obviously to read the optional trailing headers. But you're right: better stop 
and skip the trailing headers (if any) with the remaining content.

> protocol-http fails to read large chunked HTTP responses
> 
>
> Key: NUTCH-2562
> URL: https://issues.apache.org/jira/browse/NUTCH-2562
> Project: Nutch
>  Issue Type: Sub-task
>Reporter: Gerard Bouchar
>Priority: Major
>
> While reading chunked content, if the content size becomes larger than 
> http.getMaxContent(), instead of just stopping and truncate the content, it 
> tries to read a new chunk before having read the previous one completely, 
> resulting in a '{color:#33}bad chunk length' error.{color}
>  
> {color:#33}See: 
> https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L440-L442{color}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2575) protocol-http does not respect the maximum content-size for chunked responses

2018-05-10 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2575.

Resolution: Fixed

Thanks, [~gbouchar]! Thanks, [~omkar20895]!

Solution confirmed, merged PR.

> protocol-http does not respect the maximum content-size for chunked responses
> -
>
> Key: NUTCH-2575
> URL: https://issues.apache.org/jira/browse/NUTCH-2575
> Project: Nutch
>  Issue Type: Sub-task
>  Components: protocol
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Priority: Critical
> Fix For: 1.15
>
>
> There is a bug in HttpResponse::readChunkedContent that prevents it to stop 
> reading content when it exceeds the maximum allowed size.
> There [is a variable 
> contentBytesRead|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L404]
>  that is used to check how much content has been read, but it is never 
> updated, so it always stays null, and [the size 
> check|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L440-L442]
>  always returns false (unless a single chunk is larger than the maximum 
> allowed content size).
> This allows any server to cause out-of-memory errors on our size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2575) protocol-http does not respect the maximum content-size for chunked responses

2018-05-10 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2575:
---
Fix Version/s: 1.15

> protocol-http does not respect the maximum content-size for chunked responses
> -
>
> Key: NUTCH-2575
> URL: https://issues.apache.org/jira/browse/NUTCH-2575
> Project: Nutch
>  Issue Type: Sub-task
>  Components: protocol
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Priority: Critical
> Fix For: 1.15
>
>
> There is a bug in HttpResponse::readChunkedContent that prevents it to stop 
> reading content when it exceeds the maximum allowed size.
> There [is a variable 
> contentBytesRead|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L404]
>  that is used to check how much content has been read, but it is never 
> updated, so it always stays null, and [the size 
> check|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L440-L442]
>  always returns false (unless a single chunk is larger than the maximum 
> allowed content size).
> This allows any server to cause out-of-memory errors on our size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2575) protocol-http does not respect the maximum content-size for chunked responses

2018-05-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471105#comment-16471105
 ] 

ASF GitHub Bot commented on NUTCH-2575:
---

sebastian-nagel closed pull request #327: NUTCH-2575 Storing total number of 
bytes read after every chunk
URL: https://github.com/apache/nutch/pull/327
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
 
b/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
index c87c11125..591b94298 100644
--- 
a/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
+++ 
b/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
@@ -464,6 +464,7 @@ private void readChunkedContent(PushbackInputStream in, 
StringBuffer line)
 chunkBytesRead += len;
   }
 
+  contentBytesRead += chunkBytesRead;
   readLine(in, line, false);
 
 }


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> protocol-http does not respect the maximum content-size for chunked responses
> -
>
> Key: NUTCH-2575
> URL: https://issues.apache.org/jira/browse/NUTCH-2575
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Priority: Critical
>
> There is a bug in HttpResponse::readChunkedContent that prevents it to stop 
> reading content when it exceeds the maximum allowed size.
> There [is a variable 
> contentBytesRead|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L404]
>  that is used to check how much content has been read, but it is never 
> updated, so it always stays null, and [the size 
> check|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L440-L442]
>  always returns false (unless a single chunk is larger than the maximum 
> allowed content size).
> This allows any server to cause out-of-memory errors on our size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2575) protocol-http does not respect the maximum content-size for chunked responses

2018-05-10 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2575:
---
Affects Version/s: 1.14

> protocol-http does not respect the maximum content-size for chunked responses
> -
>
> Key: NUTCH-2575
> URL: https://issues.apache.org/jira/browse/NUTCH-2575
> Project: Nutch
>  Issue Type: Sub-task
>Affects Versions: 1.14
>Reporter: Gerard Bouchar
>Priority: Critical
>
> There is a bug in HttpResponse::readChunkedContent that prevents it to stop 
> reading content when it exceeds the maximum allowed size.
> There [is a variable 
> contentBytesRead|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L404]
>  that is used to check how much content has been read, but it is never 
> updated, so it always stays null, and [the size 
> check|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L440-L442]
>  always returns false (unless a single chunk is larger than the maximum 
> allowed content size).
> This allows any server to cause out-of-memory errors on our size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2575) protocol-http does not respect the maximum content-size for chunked responses

2018-05-10 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2575:
---
Summary: protocol-http does not respect the maximum content-size for 
chunked responses  (was: protocol-http does not respect the maximum 
content-size)

> protocol-http does not respect the maximum content-size for chunked responses
> -
>
> Key: NUTCH-2575
> URL: https://issues.apache.org/jira/browse/NUTCH-2575
> Project: Nutch
>  Issue Type: Sub-task
>Reporter: Gerard Bouchar
>Priority: Critical
>
> There is a bug in HttpResponse::readChunkedContent that prevents it to stop 
> reading content when it exceeds the maximum allowed size.
> There [is a variable 
> contentBytesRead|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L404]
>  that is used to check how much content has been read, but it is never 
> updated, so it always stays null, and [the size 
> check|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L440-L442]
>  always returns false (unless a single chunk is larger than the maximum 
> allowed content size).
> This allows any server to cause out-of-memory errors on our size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2161) Interrupted failed and/or killed tasks fail to clean up temp directories in HDFS

2018-05-10 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2161.

Resolution: Duplicate

It was part of NUTCH-2518 to make sure that data is removed if the job is 
killed (\{{hadoop job -kill ...}}). If the job client is killed - the Java 
program which has launched the job(s) - no clean-up can be done. Hadoop allows 
to setup users and quota to limit the amount of forgotten data.

> Interrupted failed and/or killed tasks fail to clean up temp directories in 
> HDFS
> 
>
> Key: NUTCH-2161
> URL: https://issues.apache.org/jira/browse/NUTCH-2161
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.11
>Reporter: Lewis John McGibbney
>Priority: Major
> Fix For: 1.15
>
>
> If for example one kills an inject or generate job, Nutch does not clean up 
> 'temporary' directories and I have witnessed them remain within HDFS. This is 
> far from ideal if we have a large team of users all hammering away on Yarn 
> and persisting data into HDFS.
> We should investigate how to clean up these directories such that a cluster 
> admin is not left with all of the dross at the end of the long day ;)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script

2018-05-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470312#comment-16470312
 ] 

ASF GitHub Bot commented on NUTCH-2501:
---

sebastian-nagel commented on a change in pull request #279: NUTCH-2501: Take 
NUTCH_HEAPSIZE into account  when crawling using crawl script
URL: https://github.com/apache/nutch/pull/279#discussion_r187313029
 
 

 ##
 File path: src/bin/crawl
 ##
 @@ -171,6 +175,8 @@ fi
 
 CRAWL_PATH="$1"
 LIMIT="$2"
+JAVA_CHILD_HEAP_MB=`expr "$NUTCH_HEAP_MB" / "$NUM_TASKS"`
 
 Review comment:
   To keep it simple, I would suggest to
   1. remove the explicit `mapred.child.java.opts`
   2. and add a new environment variable `NUTCH_HADOOP_OPTS` which is used to 
add further options to the commonOptions
   
   In local mode it's still sufficient to use the environment variable 
NUTCH_OPTS to set the Java heap size. In distributed mode, this would allow 
fine-grained memory settings, e.g.
   ```
   export NUTCH_HADOOP_OPTS="-Dmapreduce.map.memory.mb=8192 
-Dmapreduce.map.java.opts=-Xmx7680m"
   ```
   These are required, as the Java heap must be lower than the total memory 
allocated for a map/reduce task.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Take into account $NUTCH_HEAPSIZE when crawling using crawl script
> --
>
> Key: NUTCH-2501
> URL: https://issues.apache.org/jira/browse/NUTCH-2501
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2514) Segmentation Fault issue while running crawl job.

2018-05-10 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2514.

   Resolution: Incomplete
Fix Version/s: (was: 2.4)

Hi [~kshitij], Nutch does not have any native code which could cause a 
segmentation fault, only Java, the native Hadoop library, or a third-party 
library could be responsible.

It's also not possible to localize the reason with the provided information. We 
need the native stack trace (eg. hs_err_pid*.log) of the crashed task. If you 
can provide more details, please reopen the issue. Thanks!

> Segmentation Fault issue  while running crawl job.
> --
>
> Key: NUTCH-2514
> URL: https://issues.apache.org/jira/browse/NUTCH-2514
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb, fetcher, indexer, parser
>Affects Versions: 2.3.1
> Environment: OS- centos-release-6-9.el6.12.3.x86_64
> Hadoop-2.5.2 cluster with 5 nodes
> Nutch - 2.3.1
> Hbase-0.98.8
> Solr-5.4.1
>Reporter: Kshitij Shukla
>Priority: Major
>
> Error occurs while running crawl job in on fetching, parsing and indexing 
> phase. error posting below:-
> ExitCodeException exitCode=139: /bin/bash: line 1: 68684 Segmentation fault   
>/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.161.x86_64/bin/java 
> -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx13312m 
> -Djava.io.tmpdir=/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1519286094099_0016/container_1519286094099_0016_01_03/tmp
>  -Dlog4j.configuration=container-log4j.properties 
> -Dyarn.app.container.log.dir=/home/c1/hadoop-2.5.2/logs/userlogs/application_1519286094099_0016/container_1519286094099_0016_01_03
>  -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA 
> org.apache.hadoop.mapred.YarnChild 95.142.101.139 35714 
> attempt_1519286094099_0016_r_00_0 3 > 
> /home/c1/hadoop-2.5.2/logs/userlogs/application_1519286094099_0016/container_1519286094099_0016_01_03/stdout
>  2> 
> /home/c1/hadoop-2.5.2/logs/userlogs/application_1519286094099_0016/container_1519286094099_0016_01_03/stderr
>   at org.apache.hadoop.util.Shell.runCommand(Shell.java:538)
>   at org.apache.hadoop.util.Shell.run(Shell.java:455)
>   at 
> org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300)
>   at 
> org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1152)
>   at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622)
>   at java.lang.Thread.run(Thread.java:748)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2575) protocol-http does not respect the maximum content-size

2018-05-10 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470151#comment-16470151
 ] 

ASF GitHub Bot commented on NUTCH-2575:
---

sebastian-nagel commented on issue #327: NUTCH-2575 Storing total number of 
bytes read after every chunk
URL: https://github.com/apache/nutch/pull/327#issuecomment-388005755
 
 
   +1 lgtm. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> protocol-http does not respect the maximum content-size
> ---
>
> Key: NUTCH-2575
> URL: https://issues.apache.org/jira/browse/NUTCH-2575
> Project: Nutch
>  Issue Type: Sub-task
>Reporter: Gerard Bouchar
>Priority: Critical
>
> There is a bug in HttpResponse::readChunkedContent that prevents it to stop 
> reading content when it exceeds the maximum allowed size.
> There [is a variable 
> contentBytesRead|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L404]
>  that is used to check how much content has been read, but it is never 
> updated, so it always stays null, and [the size 
> check|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L440-L442]
>  always returns false (unless a single chunk is larger than the maximum 
> allowed content size).
> This allows any server to cause out-of-memory errors on our size.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)