[jira] [Commented] (NUTCH-2575) protocol-http does not respect the maximum content-size for chunked responses
[ https://issues.apache.org/jira/browse/NUTCH-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471213#comment-16471213 ] Hudson commented on NUTCH-2575: --- SUCCESS: Integrated in Jenkins build Nutch-trunk #3524 (See [https://builds.apache.org/job/Nutch-trunk/3524/]) NUTCH-2575 Storing total number of bytes read after every chunk (omkarreddy2008: [https://github.com/apache/nutch/commit/b541de8ff20b818667e2765664ae2f133b439dc3]) * (edit) src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java > protocol-http does not respect the maximum content-size for chunked responses > - > > Key: NUTCH-2575 > URL: https://issues.apache.org/jira/browse/NUTCH-2575 > Project: Nutch > Issue Type: Sub-task > Components: protocol >Affects Versions: 1.14 >Reporter: Gerard Bouchar >Priority: Critical > Fix For: 1.15 > > > There is a bug in HttpResponse::readChunkedContent that prevents it to stop > reading content when it exceeds the maximum allowed size. > There [is a variable > contentBytesRead|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L404] > that is used to check how much content has been read, but it is never > updated, so it always stays null, and [the size > check|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L440-L442] > always returns false (unless a single chunk is larger than the maximum > allowed content size). > This allows any server to cause out-of-memory errors on our size. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2562) protocol-http fails to read large chunked HTTP responses
[ https://issues.apache.org/jira/browse/NUTCH-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471152#comment-16471152 ] ASF GitHub Bot commented on NUTCH-2562: --- sebastian-nagel opened a new pull request #329: NUTCH-2562 protocol-http fails to read large chunked HTTP responses URL: https://github.com/apache/nutch/pull/329 - if http.content.limit is reached skip remaining chunked content including any headers in the trailer This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > protocol-http fails to read large chunked HTTP responses > > > Key: NUTCH-2562 > URL: https://issues.apache.org/jira/browse/NUTCH-2562 > Project: Nutch > Issue Type: Sub-task >Affects Versions: 1.14 >Reporter: Gerard Bouchar >Priority: Major > Fix For: 1.15 > > > While reading chunked content, if the content size becomes larger than > http.getMaxContent(), instead of just stopping and truncate the content, it > tries to read a new chunk before having read the previous one completely, > resulting in a '{color:#33}bad chunk length' error.{color} > > {color:#33}See: > https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L440-L442{color} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2562) protocol-http fails to read large chunked HTTP responses
[ https://issues.apache.org/jira/browse/NUTCH-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2562: --- Affects Version/s: 1.14 > protocol-http fails to read large chunked HTTP responses > > > Key: NUTCH-2562 > URL: https://issues.apache.org/jira/browse/NUTCH-2562 > Project: Nutch > Issue Type: Sub-task >Affects Versions: 1.14 >Reporter: Gerard Bouchar >Priority: Major > Fix For: 1.15 > > > While reading chunked content, if the content size becomes larger than > http.getMaxContent(), instead of just stopping and truncate the content, it > tries to read a new chunk before having read the previous one completely, > resulting in a '{color:#33}bad chunk length' error.{color} > > {color:#33}See: > https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L440-L442{color} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2562) protocol-http fails to read large chunked HTTP responses
[ https://issues.apache.org/jira/browse/NUTCH-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2562: --- Fix Version/s: 1.15 > protocol-http fails to read large chunked HTTP responses > > > Key: NUTCH-2562 > URL: https://issues.apache.org/jira/browse/NUTCH-2562 > Project: Nutch > Issue Type: Sub-task >Affects Versions: 1.14 >Reporter: Gerard Bouchar >Priority: Major > Fix For: 1.15 > > > While reading chunked content, if the content size becomes larger than > http.getMaxContent(), instead of just stopping and truncate the content, it > tries to read a new chunk before having read the previous one completely, > resulting in a '{color:#33}bad chunk length' error.{color} > > {color:#33}See: > https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L440-L442{color} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2562) protocol-http fails to read large chunked HTTP responses
[ https://issues.apache.org/jira/browse/NUTCH-2562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471136#comment-16471136 ] Sebastian Nagel commented on NUTCH-2562: Confirmed and reproduced. The reason why the remaining chunks are continued was obviously to read the optional trailing headers. But you're right: better stop and skip the trailing headers (if any) with the remaining content. > protocol-http fails to read large chunked HTTP responses > > > Key: NUTCH-2562 > URL: https://issues.apache.org/jira/browse/NUTCH-2562 > Project: Nutch > Issue Type: Sub-task >Reporter: Gerard Bouchar >Priority: Major > > While reading chunked content, if the content size becomes larger than > http.getMaxContent(), instead of just stopping and truncate the content, it > tries to read a new chunk before having read the previous one completely, > resulting in a '{color:#33}bad chunk length' error.{color} > > {color:#33}See: > https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L440-L442{color} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (NUTCH-2575) protocol-http does not respect the maximum content-size for chunked responses
[ https://issues.apache.org/jira/browse/NUTCH-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2575. Resolution: Fixed Thanks, [~gbouchar]! Thanks, [~omkar20895]! Solution confirmed, merged PR. > protocol-http does not respect the maximum content-size for chunked responses > - > > Key: NUTCH-2575 > URL: https://issues.apache.org/jira/browse/NUTCH-2575 > Project: Nutch > Issue Type: Sub-task > Components: protocol >Affects Versions: 1.14 >Reporter: Gerard Bouchar >Priority: Critical > Fix For: 1.15 > > > There is a bug in HttpResponse::readChunkedContent that prevents it to stop > reading content when it exceeds the maximum allowed size. > There [is a variable > contentBytesRead|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L404] > that is used to check how much content has been read, but it is never > updated, so it always stays null, and [the size > check|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L440-L442] > always returns false (unless a single chunk is larger than the maximum > allowed content size). > This allows any server to cause out-of-memory errors on our size. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2575) protocol-http does not respect the maximum content-size for chunked responses
[ https://issues.apache.org/jira/browse/NUTCH-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2575: --- Fix Version/s: 1.15 > protocol-http does not respect the maximum content-size for chunked responses > - > > Key: NUTCH-2575 > URL: https://issues.apache.org/jira/browse/NUTCH-2575 > Project: Nutch > Issue Type: Sub-task > Components: protocol >Affects Versions: 1.14 >Reporter: Gerard Bouchar >Priority: Critical > Fix For: 1.15 > > > There is a bug in HttpResponse::readChunkedContent that prevents it to stop > reading content when it exceeds the maximum allowed size. > There [is a variable > contentBytesRead|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L404] > that is used to check how much content has been read, but it is never > updated, so it always stays null, and [the size > check|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L440-L442] > always returns false (unless a single chunk is larger than the maximum > allowed content size). > This allows any server to cause out-of-memory errors on our size. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2575) protocol-http does not respect the maximum content-size for chunked responses
[ https://issues.apache.org/jira/browse/NUTCH-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16471105#comment-16471105 ] ASF GitHub Bot commented on NUTCH-2575: --- sebastian-nagel closed pull request #327: NUTCH-2575 Storing total number of bytes read after every chunk URL: https://github.com/apache/nutch/pull/327 This is a PR merged from a forked repository. As GitHub hides the original diff on merge, it is displayed below for the sake of provenance: As this is a foreign pull request (from a fork), the diff is supplied below (as it won't show otherwise due to GitHub magic): diff --git a/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java b/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java index c87c11125..591b94298 100644 --- a/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java +++ b/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java @@ -464,6 +464,7 @@ private void readChunkedContent(PushbackInputStream in, StringBuffer line) chunkBytesRead += len; } + contentBytesRead += chunkBytesRead; readLine(in, line, false); } This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > protocol-http does not respect the maximum content-size for chunked responses > - > > Key: NUTCH-2575 > URL: https://issues.apache.org/jira/browse/NUTCH-2575 > Project: Nutch > Issue Type: Sub-task >Affects Versions: 1.14 >Reporter: Gerard Bouchar >Priority: Critical > > There is a bug in HttpResponse::readChunkedContent that prevents it to stop > reading content when it exceeds the maximum allowed size. > There [is a variable > contentBytesRead|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L404] > that is used to check how much content has been read, but it is never > updated, so it always stays null, and [the size > check|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L440-L442] > always returns false (unless a single chunk is larger than the maximum > allowed content size). > This allows any server to cause out-of-memory errors on our size. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2575) protocol-http does not respect the maximum content-size for chunked responses
[ https://issues.apache.org/jira/browse/NUTCH-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2575: --- Affects Version/s: 1.14 > protocol-http does not respect the maximum content-size for chunked responses > - > > Key: NUTCH-2575 > URL: https://issues.apache.org/jira/browse/NUTCH-2575 > Project: Nutch > Issue Type: Sub-task >Affects Versions: 1.14 >Reporter: Gerard Bouchar >Priority: Critical > > There is a bug in HttpResponse::readChunkedContent that prevents it to stop > reading content when it exceeds the maximum allowed size. > There [is a variable > contentBytesRead|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L404] > that is used to check how much content has been read, but it is never > updated, so it always stays null, and [the size > check|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L440-L442] > always returns false (unless a single chunk is larger than the maximum > allowed content size). > This allows any server to cause out-of-memory errors on our size. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2575) protocol-http does not respect the maximum content-size for chunked responses
[ https://issues.apache.org/jira/browse/NUTCH-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2575: --- Summary: protocol-http does not respect the maximum content-size for chunked responses (was: protocol-http does not respect the maximum content-size) > protocol-http does not respect the maximum content-size for chunked responses > - > > Key: NUTCH-2575 > URL: https://issues.apache.org/jira/browse/NUTCH-2575 > Project: Nutch > Issue Type: Sub-task >Reporter: Gerard Bouchar >Priority: Critical > > There is a bug in HttpResponse::readChunkedContent that prevents it to stop > reading content when it exceeds the maximum allowed size. > There [is a variable > contentBytesRead|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L404] > that is used to check how much content has been read, but it is never > updated, so it always stays null, and [the size > check|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L440-L442] > always returns false (unless a single chunk is larger than the maximum > allowed content size). > This allows any server to cause out-of-memory errors on our size. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (NUTCH-2161) Interrupted failed and/or killed tasks fail to clean up temp directories in HDFS
[ https://issues.apache.org/jira/browse/NUTCH-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2161. Resolution: Duplicate It was part of NUTCH-2518 to make sure that data is removed if the job is killed (\{{hadoop job -kill ...}}). If the job client is killed - the Java program which has launched the job(s) - no clean-up can be done. Hadoop allows to setup users and quota to limit the amount of forgotten data. > Interrupted failed and/or killed tasks fail to clean up temp directories in > HDFS > > > Key: NUTCH-2161 > URL: https://issues.apache.org/jira/browse/NUTCH-2161 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.11 >Reporter: Lewis John McGibbney >Priority: Major > Fix For: 1.15 > > > If for example one kills an inject or generate job, Nutch does not clean up > 'temporary' directories and I have witnessed them remain within HDFS. This is > far from ideal if we have a large team of users all hammering away on Yarn > and persisting data into HDFS. > We should investigate how to clean up these directories such that a cluster > admin is not left with all of the dross at the end of the long day ;) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script
[ https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470312#comment-16470312 ] ASF GitHub Bot commented on NUTCH-2501: --- sebastian-nagel commented on a change in pull request #279: NUTCH-2501: Take NUTCH_HEAPSIZE into account when crawling using crawl script URL: https://github.com/apache/nutch/pull/279#discussion_r187313029 ## File path: src/bin/crawl ## @@ -171,6 +175,8 @@ fi CRAWL_PATH="$1" LIMIT="$2" +JAVA_CHILD_HEAP_MB=`expr "$NUTCH_HEAP_MB" / "$NUM_TASKS"` Review comment: To keep it simple, I would suggest to 1. remove the explicit `mapred.child.java.opts` 2. and add a new environment variable `NUTCH_HADOOP_OPTS` which is used to add further options to the commonOptions In local mode it's still sufficient to use the environment variable NUTCH_OPTS to set the Java heap size. In distributed mode, this would allow fine-grained memory settings, e.g. ``` export NUTCH_HADOOP_OPTS="-Dmapreduce.map.memory.mb=8192 -Dmapreduce.map.java.opts=-Xmx7680m" ``` These are required, as the Java heap must be lower than the total memory allocated for a map/reduce task. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Take into account $NUTCH_HEAPSIZE when crawling using crawl script > -- > > Key: NUTCH-2501 > URL: https://issues.apache.org/jira/browse/NUTCH-2501 > Project: Nutch > Issue Type: Improvement >Reporter: Moreno Feltscher >Assignee: Lewis John McGibbney >Priority: Major > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (NUTCH-2514) Segmentation Fault issue while running crawl job.
[ https://issues.apache.org/jira/browse/NUTCH-2514?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2514. Resolution: Incomplete Fix Version/s: (was: 2.4) Hi [~kshitij], Nutch does not have any native code which could cause a segmentation fault, only Java, the native Hadoop library, or a third-party library could be responsible. It's also not possible to localize the reason with the provided information. We need the native stack trace (eg. hs_err_pid*.log) of the crashed task. If you can provide more details, please reopen the issue. Thanks! > Segmentation Fault issue while running crawl job. > -- > > Key: NUTCH-2514 > URL: https://issues.apache.org/jira/browse/NUTCH-2514 > Project: Nutch > Issue Type: Bug > Components: crawldb, fetcher, indexer, parser >Affects Versions: 2.3.1 > Environment: OS- centos-release-6-9.el6.12.3.x86_64 > Hadoop-2.5.2 cluster with 5 nodes > Nutch - 2.3.1 > Hbase-0.98.8 > Solr-5.4.1 >Reporter: Kshitij Shukla >Priority: Major > > Error occurs while running crawl job in on fetching, parsing and indexing > phase. error posting below:- > ExitCodeException exitCode=139: /bin/bash: line 1: 68684 Segmentation fault >/usr/lib/jvm/java-1.7.0-openjdk-1.7.0.161.x86_64/bin/java > -Djava.net.preferIPv4Stack=true -Dhadoop.metrics.log.level=WARN -Xmx13312m > -Djava.io.tmpdir=/tmp/hadoop-root/nm-local-dir/usercache/root/appcache/application_1519286094099_0016/container_1519286094099_0016_01_03/tmp > -Dlog4j.configuration=container-log4j.properties > -Dyarn.app.container.log.dir=/home/c1/hadoop-2.5.2/logs/userlogs/application_1519286094099_0016/container_1519286094099_0016_01_03 > -Dyarn.app.container.log.filesize=0 -Dhadoop.root.logger=INFO,CLA > org.apache.hadoop.mapred.YarnChild 95.142.101.139 35714 > attempt_1519286094099_0016_r_00_0 3 > > /home/c1/hadoop-2.5.2/logs/userlogs/application_1519286094099_0016/container_1519286094099_0016_01_03/stdout > 2> > /home/c1/hadoop-2.5.2/logs/userlogs/application_1519286094099_0016/container_1519286094099_0016_01_03/stderr > at org.apache.hadoop.util.Shell.runCommand(Shell.java:538) > at org.apache.hadoop.util.Shell.run(Shell.java:455) > at > org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:702) > at > org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:195) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:300) > at > org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:81) > at java.util.concurrent.FutureTask.run(FutureTask.java:262) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1152) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:622) > at java.lang.Thread.run(Thread.java:748) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2575) protocol-http does not respect the maximum content-size
[ https://issues.apache.org/jira/browse/NUTCH-2575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16470151#comment-16470151 ] ASF GitHub Bot commented on NUTCH-2575: --- sebastian-nagel commented on issue #327: NUTCH-2575 Storing total number of bytes read after every chunk URL: https://github.com/apache/nutch/pull/327#issuecomment-388005755 +1 lgtm. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > protocol-http does not respect the maximum content-size > --- > > Key: NUTCH-2575 > URL: https://issues.apache.org/jira/browse/NUTCH-2575 > Project: Nutch > Issue Type: Sub-task >Reporter: Gerard Bouchar >Priority: Critical > > There is a bug in HttpResponse::readChunkedContent that prevents it to stop > reading content when it exceeds the maximum allowed size. > There [is a variable > contentBytesRead|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L404] > that is used to check how much content has been read, but it is never > updated, so it always stays null, and [the size > check|https://github.com/apache/nutch/blob/master/src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java#L440-L442] > always returns false (unless a single chunk is larger than the maximum > allowed content size). > This allows any server to cause out-of-memory errors on our size. -- This message was sent by Atlassian JIRA (v7.6.3#76005)