[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386469#comment-16386469 ] Lewis John McGibbney commented on NUTCH-2517: - Thank you [~mebbinghaus] for reporting. This appears to be a major bug and hence a blocker for the next release. I will begin work on a solution ASAP. FYI [~omkar20895] this is post Hadoop upgrade. > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data, parse_text > * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb > mycrawl/segments/20180304134215) > * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments > mycrawl/segments/* -filter) > ** console output: `SegmentMerger: using segment data from: content > crawl_generate crawl_fetch crawl_parse parse_data parse_text` > ** resulting segment: 20180304134535 > * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing > folder: crawl_generate > * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir > mycrawl/MERGEDsegments) which results in a consequential error > ** console output: `LinkDb: adding segment: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] > LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)` > So as it seems mapreduce corrupts the segment folder during mergesegs command. > > Pay attention to the fact that this issue is not related on trying to merge a > single segment like described above. As you can see on the attached > screenshot that problem also appears when executing multiple bin/nutch > generate/fetch/parse/updatedb commands before executing mergesegs - resulting > in a segment count > 1. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2517) mergesegs corrupts segment data
[ https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-2517: Priority: Blocker (was: Major) > mergesegs corrupts segment data > --- > > Key: NUTCH-2517 > URL: https://issues.apache.org/jira/browse/NUTCH-2517 > Project: Nutch > Issue Type: Bug > Components: segment >Affects Versions: 1.15 > Environment: xubuntu 17.10, docker container of apache/nutch LATEST >Reporter: Marco Ebbinghaus >Priority: Blocker > Labels: mapreduce, mergesegs > Fix For: 1.15 > > Attachments: Screenshot_2018-03-03_18-09-28.png > > > The problem probably occurs since commit > [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4] > How to reproduce: > * create container from apache/nutch image (latest) > * open terminal in that container > * set http.agent.name > * create crawldir and urls file > * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls) > * run bin/nutch generate (bin/nutch generate mycrawl/crawldb > mycrawl/segments 1) > ** this results in a segment (e.g. 20180304134215) > * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 > -threads 2) > * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 > -threads 2) > ** ls in the segment folder -> existing folders: content, crawl_fetch, > crawl_generate, crawl_parse, parse_data, parse_text > * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb > mycrawl/segments/20180304134215) > * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments > mycrawl/segments/* -filter) > ** console output: `SegmentMerger: using segment data from: content > crawl_generate crawl_fetch crawl_parse parse_data parse_text` > ** resulting segment: 20180304134535 > * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing > folder: crawl_generate > * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir > mycrawl/MERGEDsegments) which results in a consequential error > ** console output: `LinkDb: adding segment: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535] > LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input > path does not exist: > [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data] > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265) > at > org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59) > at > org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301) > at > org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318) > at > org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290) > at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746) > at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287) > at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308) > at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224) > at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)` > So as it seems mapreduce corrupts the segment folder during mergesegs command. > > Pay attention to the fact that this issue is not related on trying to merge a > single segment like described above. As you can see on the attached > screenshot that problem also appears when executing multiple bin/nutch > generate/fetch/parse/updatedb commands before executing mergesegs - resulting > in a segment count > 1. > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2519) Log mapreduce job counters in local mode
[ https://issues.apache.org/jira/browse/NUTCH-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386339#comment-16386339 ] ASF GitHub Bot commented on NUTCH-2519: --- lewismc commented on issue #287: NUTCH-2519 Log mapreduce job messages and counters in local mode URL: https://github.com/apache/nutch/pull/287#issuecomment-370484982 +1 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Log mapreduce job counters in local mode > > > Key: NUTCH-2519 > URL: https://issues.apache.org/jira/browse/NUTCH-2519 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.3.1, 1.14 >Reporter: Sebastian Nagel >Priority: Trivial > Fix For: 2.4, 1.15 > > > A simple change in the log4j.properties would make the Hadoop job counters > appear in the hadoop.log also in local mode: > {noformat} > log4j.logger.org.apache.hadoop.mapreduce.Job=INFO > {noformat} > This may provide useful information for debugging, esp. if counters are not > explicitly logged by tools (see > [@user|https://lists.apache.org/thread.html/1dd5410b479bd536fb3df98612db4b832cd0a97533099b0dc632eba9@%3Cuser.nutch.apache.org%3E]). > This would make the output also more similar to (pseudo)distributed mode > (Nutch is called via {{hadoop jar}}) Job counters and progress info are > obligatorily logged. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2520) Wrong Accept-Charset sent when http.accept.charset is not defined
[ https://issues.apache.org/jira/browse/NUTCH-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386336#comment-16386336 ] ASF GitHub Bot commented on NUTCH-2520: --- lewismc commented on issue #288: NUTCH-2520 Use default value for Accept-Charset URL: https://github.com/apache/nutch/pull/288#issuecomment-370484879 +1 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Wrong Accept-Charset sent when http.accept.charset is not defined > - > > Key: NUTCH-2520 > URL: https://issues.apache.org/jira/browse/NUTCH-2520 > Project: Nutch > Issue Type: Bug > Components: protocol >Affects Versions: 1.14 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.15 > > > When the property http.accept.charset is not defined, instead of the > hard-wired default value {{utf-8,iso-8859-1;q=0.7,*;q=0.7}} that of the > "Accept" field is used. Introduced by NUTCH-2376 > ([HttpBase|https://github.com/apache/nutch/pull/186/files#diff-432a58c46ab1e686ef05a84cace29790R164]). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2521) SitemapProcessor to use property sitemap.redir.max
[ https://issues.apache.org/jira/browse/NUTCH-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386334#comment-16386334 ] ASF GitHub Bot commented on NUTCH-2521: --- lewismc commented on issue #289: NUTCH-2521 SitemapProcessor to use property sitemap.redir.max URL: https://github.com/apache/nutch/pull/289#issuecomment-370484703 +1 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > SitemapProcessor to use property sitemap.redir.max > -- > > Key: NUTCH-2521 > URL: https://issues.apache.org/jira/browse/NUTCH-2521 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.15 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.15 > > > SitemapProcessor isn't actually using the property sitemap.redir.max > (NUTCH-2466), instead the maximum number of redirects is hardwired (=3). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2523) UpdateHostDB blocks plugins unintenionally
[ https://issues.apache.org/jira/browse/NUTCH-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yossi Tamari updated NUTCH-2523: Attachment: NUTCH-2523.tamari.180305.patch.txt > UpdateHostDB blocks plugins unintenionally > -- > > Key: NUTCH-2523 > URL: https://issues.apache.org/jira/browse/NUTCH-2523 > Project: Nutch > Issue Type: Bug > Components: hostdb >Affects Versions: 1.14 >Reporter: Yossi Tamari >Priority: Major > Attachments: NUTCH-2523.tamari.180305.patch.txt > > > UpdateHostDB blocks the use of urlnormalizer-host and > urlfilter-domainblacklist (it check if they are configured and throws an > exception) without any good reason. > Quoting Markus: "I simply reused the job setup code and forgot to remove that > check. You can safely remove that check in HostDB." -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2523) UpdateHostDB blocks plugins unintenionally
Yossi Tamari created NUTCH-2523: --- Summary: UpdateHostDB blocks plugins unintenionally Key: NUTCH-2523 URL: https://issues.apache.org/jira/browse/NUTCH-2523 Project: Nutch Issue Type: Bug Components: hostdb Affects Versions: 1.14 Reporter: Yossi Tamari UpdateHostDB blocks the use of urlnormalizer-host and urlfilter-domainblacklist (it check if they are configured and throws an exception) without any good reason. Quoting Markus: "I simply reused the job setup code and forgot to remove that check. You can safely remove that check in HostDB." -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2522) Bidirectional URL exemption filter
Semyon Semyonov created NUTCH-2522: -- Summary: Bidirectional URL exemption filter Key: NUTCH-2522 URL: https://issues.apache.org/jira/browse/NUTCH-2522 Project: Nutch Issue Type: Improvement Components: plugin Reporter: Semyon Semyonov The current Nutch Url Exemption plugin exempts based on toUrl only, the new plugin uses both fromUrl and toUrl and after the regex transformation, exempts based on condition regex(fromUrl) == regex(toUrl). This approach allows us to perform more complex url exemption filter checks, such as allow links: http://[www.website.com/|http://www.website.com/]home -> http://[website.com/a|http://www.website.com/]bout ( with/without www). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[no subject]
[jira] [Created] (NUTCH-2521) SitemapProcessor to use property sitemap.redir.max
Sebastian Nagel created NUTCH-2521: -- Summary: SitemapProcessor to use property sitemap.redir.max Key: NUTCH-2521 URL: https://issues.apache.org/jira/browse/NUTCH-2521 Project: Nutch Issue Type: Bug Affects Versions: 1.15 Reporter: Sebastian Nagel Fix For: 1.15 SitemapProcessor isn't actually using the property sitemap.redir.max (NUTCH-2466), instead the maximum number of redirects is hardwired (=3). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2521) SitemapProcessor to use property sitemap.redir.max
[ https://issues.apache.org/jira/browse/NUTCH-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386021#comment-16386021 ] ASF GitHub Bot commented on NUTCH-2521: --- sebastian-nagel opened a new pull request #289: NUTCH-2521 SitemapProcessor to use property sitemap.redir.max URL: https://github.com/apache/nutch/pull/289 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > SitemapProcessor to use property sitemap.redir.max > -- > > Key: NUTCH-2521 > URL: https://issues.apache.org/jira/browse/NUTCH-2521 > Project: Nutch > Issue Type: Bug >Affects Versions: 1.15 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.15 > > > SitemapProcessor isn't actually using the property sitemap.redir.max > (NUTCH-2466), instead the maximum number of redirects is hardwired (=3). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2520) Wrong Accept-Charset sent when http.accept.charset is not defined
[ https://issues.apache.org/jira/browse/NUTCH-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386018#comment-16386018 ] ASF GitHub Bot commented on NUTCH-2520: --- sebastian-nagel opened a new pull request #288: NUTCH-2520 Use default value for Accept-Charset URL: https://github.com/apache/nutch/pull/288 if http.accept.charset is undefined This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Wrong Accept-Charset sent when http.accept.charset is not defined > - > > Key: NUTCH-2520 > URL: https://issues.apache.org/jira/browse/NUTCH-2520 > Project: Nutch > Issue Type: Bug > Components: protocol >Affects Versions: 1.14 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.15 > > > When the property http.accept.charset is not defined, instead of the > hard-wired default value {{utf-8,iso-8859-1;q=0.7,*;q=0.7}} that of the > "Accept" field is used. Introduced by NUTCH-2376 > ([HttpBase|https://github.com/apache/nutch/pull/186/files#diff-432a58c46ab1e686ef05a84cace29790R164]). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2520) Wrong Accept-Charset sent when http.accept.charset is not defined
Sebastian Nagel created NUTCH-2520: -- Summary: Wrong Accept-Charset sent when http.accept.charset is not defined Key: NUTCH-2520 URL: https://issues.apache.org/jira/browse/NUTCH-2520 Project: Nutch Issue Type: Bug Components: protocol Affects Versions: 1.14 Reporter: Sebastian Nagel Fix For: 1.15 When the property http.accept.charset is not defined, instead of the hard-wired default value {{utf-8,iso-8859-1;q=0.7,*;q=0.7}} that of the "Accept" field is used. Introduced by NUTCH-2376 ([HttpBase|https://github.com/apache/nutch/pull/186/files#diff-432a58c46ab1e686ef05a84cace29790R164]). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2519) Log mapreduce job counters in local mode
[ https://issues.apache.org/jira/browse/NUTCH-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16386009#comment-16386009 ] ASF GitHub Bot commented on NUTCH-2519: --- sebastian-nagel opened a new pull request #287: NUTCH-2519 Log mapreduce job messages and counters in local mode URL: https://github.com/apache/nutch/pull/287 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Log mapreduce job counters in local mode > > > Key: NUTCH-2519 > URL: https://issues.apache.org/jira/browse/NUTCH-2519 > Project: Nutch > Issue Type: Improvement >Affects Versions: 2.3.1, 1.14 >Reporter: Sebastian Nagel >Priority: Trivial > Fix For: 2.4, 1.15 > > > A simple change in the log4j.properties would make the Hadoop job counters > appear in the hadoop.log also in local mode: > {noformat} > log4j.logger.org.apache.hadoop.mapreduce.Job=INFO > {noformat} > This may provide useful information for debugging, esp. if counters are not > explicitly logged by tools (see > [@user|https://lists.apache.org/thread.html/1dd5410b479bd536fb3df98612db4b832cd0a97533099b0dc632eba9@%3Cuser.nutch.apache.org%3E]). > This would make the output also more similar to (pseudo)distributed mode > (Nutch is called via {{hadoop jar}}) Job counters and progress info are > obligatorily logged. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2519) Log mapreduce job counters in local mode
Sebastian Nagel created NUTCH-2519: -- Summary: Log mapreduce job counters in local mode Key: NUTCH-2519 URL: https://issues.apache.org/jira/browse/NUTCH-2519 Project: Nutch Issue Type: Improvement Affects Versions: 1.14, 2.3.1 Reporter: Sebastian Nagel Fix For: 2.4, 1.15 A simple change in the log4j.properties would make the Hadoop job counters appear in the hadoop.log also in local mode: {noformat} log4j.logger.org.apache.hadoop.mapreduce.Job=INFO {noformat} This may provide useful information for debugging, esp. if counters are not explicitly logged by tools (see [@user|https://lists.apache.org/thread.html/1dd5410b479bd536fb3df98612db4b832cd0a97533099b0dc632eba9@%3Cuser.nutch.apache.org%3E]). This would make the output also more similar to (pseudo)distributed mode (Nutch is called via {{hadoop jar}}) Job counters and progress info are obligatorily logged. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2518) Must check return value of job.waitForCompletion()
[ https://issues.apache.org/jira/browse/NUTCH-2518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16385963#comment-16385963 ] Sebastian Nagel commented on NUTCH-2518: It seems to affect all 25 occurrences of {code:java} int complete = job.waitForCompletion(true)?0:1;{code} > Must check return value of job.waitForCompletion() > -- > > Key: NUTCH-2518 > URL: https://issues.apache.org/jira/browse/NUTCH-2518 > Project: Nutch > Issue Type: Bug > Components: crawldb, fetcher, generator, hostdb, linkdb >Affects Versions: 1.15 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.15 > > > The return value of job.waitForCompletion() of the new MapReduce API > (NUTCH-2375) must always be checked. If it's not true, the job has been > failed or killed. Accordingly, the program > - should not proceed with further jobs/steps > - must clean-up temporary data, unlock CrawlDB, etc. > - exit with non-zero exit value, so that scripts running the crawl workflow > can handle the failure > Cf. NUTCH-2076, NUTCH-2442, [NUTCH-2375 PR > #221|https://github.com/apache/nutch/pull/221#issuecomment-332941883]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2518) Must check return value of job.waitForCompletion()
[ https://issues.apache.org/jira/browse/NUTCH-2518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16385960#comment-16385960 ] Sebastian Nagel commented on NUTCH-2518: [~kamaci]: wasn't this part of your PR for NUTCH-2375 (may a commit has been lost)? > Must check return value of job.waitForCompletion() > -- > > Key: NUTCH-2518 > URL: https://issues.apache.org/jira/browse/NUTCH-2518 > Project: Nutch > Issue Type: Bug > Components: crawldb, fetcher, generator, hostdb, linkdb >Affects Versions: 1.15 >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.15 > > > The return value of job.waitForCompletion() of the new MapReduce API > (NUTCH-2375) must always be checked. If it's not true, the job has been > failed or killed. Accordingly, the program > - should not proceed with further jobs/steps > - must clean-up temporary data, unlock CrawlDB, etc. > - exit with non-zero exit value, so that scripts running the crawl workflow > can handle the failure > Cf. NUTCH-2076, NUTCH-2442, [NUTCH-2375 PR > #221|https://github.com/apache/nutch/pull/221#issuecomment-332941883]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2518) Must check return value of job.waitForCompletion()
Sebastian Nagel created NUTCH-2518: -- Summary: Must check return value of job.waitForCompletion() Key: NUTCH-2518 URL: https://issues.apache.org/jira/browse/NUTCH-2518 Project: Nutch Issue Type: Bug Components: crawldb, fetcher, generator, hostdb, linkdb Affects Versions: 1.15 Reporter: Sebastian Nagel Fix For: 1.15 The return value of job.waitForCompletion() of the new MapReduce API (NUTCH-2375) must always be checked. If it's not true, the job has been failed or killed. Accordingly, the program - should not proceed with further jobs/steps - must clean-up temporary data, unlock CrawlDB, etc. - exit with non-zero exit value, so that scripts running the crawl workflow can handle the failure Cf. NUTCH-2076, NUTCH-2442, [NUTCH-2375 PR #221|https://github.com/apache/nutch/pull/221#issuecomment-332941883]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2510) Crawl script modification. HostDb : generate, optional usage and descirption
[ https://issues.apache.org/jira/browse/NUTCH-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2510: --- Fix Version/s: (was: 1.14) 1.15 > Crawl script modification. HostDb : generate, optional usage and descirption > > > Key: NUTCH-2510 > URL: https://issues.apache.org/jira/browse/NUTCH-2510 > Project: Nutch > Issue Type: Improvement > Components: bin >Affects Versions: 1.15 >Reporter: Semyon Semyonov >Priority: Minor > Fix For: 1.15 > > > Script crawl now includes hostdb update as a part of crawling cycle, but : > 1) There is no hostdb parameter for generate > 2) Generation of hostdb is not optional, therefore hostdb is generated each > step without asking of user. It should be an optional parameter. > 3) Description of 1 and 2. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2310) Protocol-Selenium does not support HTTPS protocol
[ https://issues.apache.org/jira/browse/NUTCH-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16385853#comment-16385853 ] Sebastian Nagel commented on NUTCH-2310: The [plugin.xml|https://github.com/apache/nutch/blob/master/src/plugin/protocol-selenium/plugin.xml] must also list https as supported protocol. That's done by adding: {noformat} {noformat} But it's likely that more points need to be done to support https. > Protocol-Selenium does not support HTTPS protocol > - > > Key: NUTCH-2310 > URL: https://issues.apache.org/jira/browse/NUTCH-2310 > Project: Nutch > Issue Type: Bug > Components: protocol >Affects Versions: 1.12 >Reporter: Joey Hong >Priority: Major > Labels: easyfix > Fix For: 1.15 > > Original Estimate: 48h > Remaining Estimate: 48h > > The protocol-selenium and protocol-interactiveselenium plugins raise errors > whenever there is a URL with the HTTPS protocol. > From the source code for those plugins, we can see that HTTP is the only > scheme currently accepted, which makes Nutch unable to crawl HTTPS sites with > JS using Selenium Webdrivers. -- This message was sent by Atlassian JIRA (v7.6.3#76005)