[jira] [Commented] (NUTCH-2517) mergesegs corrupts segment data

2018-03-05 Thread Lewis John McGibbney (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386469#comment-16386469
 ] 

Lewis John McGibbney commented on NUTCH-2517:
-

Thank you [~mebbinghaus] for reporting. This appears to be a major bug and 
hence a blocker for the next release. I will begin work on a solution ASAP.
FYI [~omkar20895] this is post Hadoop upgrade.

> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, parse_text
>  * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
> mycrawl/segments/20180304134215)
>  * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
> mycrawl/segments/* -filter)
>  ** console output: `SegmentMerger: using segment data from: content 
> crawl_generate crawl_fetch crawl_parse parse_data parse_text`
>  ** resulting segment: 20180304134535
>  * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing 
> folder: crawl_generate
>  * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
> mycrawl/MERGEDsegments) which results in a consequential error
>  ** console output: `LinkDb: adding segment: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
>  LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
> path does not exist: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>      at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>      at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
>      at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>      at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)`
> So as it seems mapreduce corrupts the segment folder during mergesegs command.
>  
> Pay attention to the fact that this issue is not related on trying to merge a 
> single segment like described above. As you can see on the attached 
> screenshot that problem also appears when executing multiple bin/nutch 
> generate/fetch/parse/updatedb commands before executing mergesegs - resulting 
> in a segment count > 1.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2517) mergesegs corrupts segment data

2018-03-05 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2517:

Priority: Blocker  (was: Major)

> mergesegs corrupts segment data
> ---
>
> Key: NUTCH-2517
> URL: https://issues.apache.org/jira/browse/NUTCH-2517
> Project: Nutch
>  Issue Type: Bug
>  Components: segment
>Affects Versions: 1.15
> Environment: xubuntu 17.10, docker container of apache/nutch LATEST
>Reporter: Marco Ebbinghaus
>Priority: Blocker
>  Labels: mapreduce, mergesegs
> Fix For: 1.15
>
> Attachments: Screenshot_2018-03-03_18-09-28.png
>
>
> The problem probably occurs since commit 
> [https://github.com/apache/nutch/commit/54510e503f7da7301a59f5f0e5bf4509b37d35b4]
> How to reproduce:
>  * create container from apache/nutch image (latest)
>  * open terminal in that container
>  * set http.agent.name
>  * create crawldir and urls file
>  * run bin/nutch inject (bin/nutch inject mycrawl/crawldb urls/urls)
>  * run bin/nutch generate (bin/nutch generate mycrawl/crawldb 
> mycrawl/segments 1)
>  ** this results in a segment (e.g. 20180304134215)
>  * run bin/nutch fetch (bin/nutch fetch mycrawl/segments/20180304134215 
> -threads 2)
>  * run bin/nutch parse (bin/nutch parse mycrawl/segments/20180304134215 
> -threads 2)
>  ** ls in the segment folder -> existing folders: content, crawl_fetch, 
> crawl_generate, crawl_parse, parse_data, parse_text
>  * run bin/nutch updatedb (bin/nutch updatedb mycrawl/crawldb 
> mycrawl/segments/20180304134215)
>  * run bin/nutch mergesegs (bin/nutch mergesegs mycrawl/MERGEDsegments 
> mycrawl/segments/* -filter)
>  ** console output: `SegmentMerger: using segment data from: content 
> crawl_generate crawl_fetch crawl_parse parse_data parse_text`
>  ** resulting segment: 20180304134535
>  * ls in mycrawl/MERGEDsegments/segment/20180304134535 -> only existing 
> folder: crawl_generate
>  * run bin/nutch invertlinks (bin/nutch invertlinks mycrawl/linkdb -dir 
> mycrawl/MERGEDsegments) which results in a consequential error
>  ** console output: `LinkDb: adding segment: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535]
>  LinkDb: org.apache.hadoop.mapreduce.lib.input.InvalidInputException: Input 
> path does not exist: 
> [file:/root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data|file:///root/nutch_source/runtime/local/mycrawl/MERGEDsegments/20180304134535/parse_data]
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:323)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:265)
>      at 
> org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:59)
>      at 
> org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:387)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeNewSplits(JobSubmitter.java:301)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.writeSplits(JobSubmitter.java:318)
>      at 
> org.apache.hadoop.mapreduce.JobSubmitter.submitJobInternal(JobSubmitter.java:196)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1290)
>      at org.apache.hadoop.mapreduce.Job$10.run(Job.java:1287)
>      at java.security.AccessController.doPrivileged(Native Method)
>      at javax.security.auth.Subject.doAs(Subject.java:422)
>      at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1746)
>      at org.apache.hadoop.mapreduce.Job.submit(Job.java:1287)
>      at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:1308)
>      at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:224)
>      at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:353)
>      at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>      at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:313)`
> So as it seems mapreduce corrupts the segment folder during mergesegs command.
>  
> Pay attention to the fact that this issue is not related on trying to merge a 
> single segment like described above. As you can see on the attached 
> screenshot that problem also appears when executing multiple bin/nutch 
> generate/fetch/parse/updatedb commands before executing mergesegs - resulting 
> in a segment count > 1.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2519) Log mapreduce job counters in local mode

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386339#comment-16386339
 ] 

ASF GitHub Bot commented on NUTCH-2519:
---

lewismc commented on issue #287: NUTCH-2519 Log mapreduce job messages and 
counters in local mode
URL: https://github.com/apache/nutch/pull/287#issuecomment-370484982
 
 
   +1


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Log mapreduce job counters in local mode
> 
>
> Key: NUTCH-2519
> URL: https://issues.apache.org/jira/browse/NUTCH-2519
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.3.1, 1.14
>Reporter: Sebastian Nagel
>Priority: Trivial
> Fix For: 2.4, 1.15
>
>
> A simple change in the log4j.properties would make the Hadoop job counters 
> appear in the hadoop.log also in local mode:
> {noformat}
> log4j.logger.org.apache.hadoop.mapreduce.Job=INFO
> {noformat}
> This may provide useful information for debugging, esp. if counters are not 
> explicitly logged by tools (see 
> [@user|https://lists.apache.org/thread.html/1dd5410b479bd536fb3df98612db4b832cd0a97533099b0dc632eba9@%3Cuser.nutch.apache.org%3E]).
>  This would make the output also more similar to (pseudo)distributed mode 
> (Nutch is called via {{hadoop jar}}) Job counters and progress info are 
> obligatorily logged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2520) Wrong Accept-Charset sent when http.accept.charset is not defined

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386336#comment-16386336
 ] 

ASF GitHub Bot commented on NUTCH-2520:
---

lewismc commented on issue #288: NUTCH-2520 Use default value for Accept-Charset
URL: https://github.com/apache/nutch/pull/288#issuecomment-370484879
 
 
   +1


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Wrong Accept-Charset sent when http.accept.charset is not defined
> -
>
> Key: NUTCH-2520
> URL: https://issues.apache.org/jira/browse/NUTCH-2520
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 1.14
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.15
>
>
> When the property http.accept.charset is not defined, instead of the 
> hard-wired default value {{utf-8,iso-8859-1;q=0.7,*;q=0.7}} that of the 
> "Accept" field is used. Introduced by NUTCH-2376 
> ([HttpBase|https://github.com/apache/nutch/pull/186/files#diff-432a58c46ab1e686ef05a84cace29790R164]).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2521) SitemapProcessor to use property sitemap.redir.max

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386334#comment-16386334
 ] 

ASF GitHub Bot commented on NUTCH-2521:
---

lewismc commented on issue #289: NUTCH-2521 SitemapProcessor to use property 
sitemap.redir.max
URL: https://github.com/apache/nutch/pull/289#issuecomment-370484703
 
 
   +1


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> SitemapProcessor to use property sitemap.redir.max
> --
>
> Key: NUTCH-2521
> URL: https://issues.apache.org/jira/browse/NUTCH-2521
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.15
>
>
> SitemapProcessor isn't actually using the property sitemap.redir.max 
> (NUTCH-2466), instead the maximum number of redirects is hardwired (=3).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2523) UpdateHostDB blocks plugins unintenionally

2018-03-05 Thread Yossi Tamari (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yossi Tamari updated NUTCH-2523:

Attachment: NUTCH-2523.tamari.180305.patch.txt

> UpdateHostDB blocks plugins unintenionally
> --
>
> Key: NUTCH-2523
> URL: https://issues.apache.org/jira/browse/NUTCH-2523
> Project: Nutch
>  Issue Type: Bug
>  Components: hostdb
>Affects Versions: 1.14
>Reporter: Yossi Tamari
>Priority: Major
> Attachments: NUTCH-2523.tamari.180305.patch.txt
>
>
> UpdateHostDB blocks the use of urlnormalizer-host and 
> urlfilter-domainblacklist (it check if they are configured and throws an 
> exception) without any good reason.
> Quoting Markus: "I simply reused the job setup code and forgot to remove that 
> check. You can safely remove that check in HostDB."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2523) UpdateHostDB blocks plugins unintenionally

2018-03-05 Thread Yossi Tamari (JIRA)
Yossi Tamari created NUTCH-2523:
---

 Summary: UpdateHostDB blocks plugins unintenionally
 Key: NUTCH-2523
 URL: https://issues.apache.org/jira/browse/NUTCH-2523
 Project: Nutch
  Issue Type: Bug
  Components: hostdb
Affects Versions: 1.14
Reporter: Yossi Tamari


UpdateHostDB blocks the use of urlnormalizer-host and urlfilter-domainblacklist 
(it check if they are configured and throws an exception) without any good 
reason.

Quoting Markus: "I simply reused the job setup code and forgot to remove that 
check. You can safely remove that check in HostDB."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2522) Bidirectional URL exemption filter

2018-03-05 Thread Semyon Semyonov (JIRA)
Semyon Semyonov created NUTCH-2522:
--

 Summary:  Bidirectional URL exemption filter
 Key: NUTCH-2522
 URL: https://issues.apache.org/jira/browse/NUTCH-2522
 Project: Nutch
  Issue Type: Improvement
  Components: plugin
Reporter: Semyon Semyonov


The current Nutch Url Exemption plugin exempts based on toUrl only, the new 
plugin uses both fromUrl and toUrl and after the regex transformation, exempts 
based on condition regex(fromUrl) == regex(toUrl).

This approach allows us to perform more complex url exemption filter checks, 
such as allow links:
http://[www.website.com/|http://www.website.com/]home -> 
http://[website.com/a|http://www.website.com/]bout ( with/without www).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[no subject]

2018-03-05 Thread Muhammet Dinç



[jira] [Created] (NUTCH-2521) SitemapProcessor to use property sitemap.redir.max

2018-03-05 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2521:
--

 Summary: SitemapProcessor to use property sitemap.redir.max
 Key: NUTCH-2521
 URL: https://issues.apache.org/jira/browse/NUTCH-2521
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.15
Reporter: Sebastian Nagel
 Fix For: 1.15


SitemapProcessor isn't actually using the property sitemap.redir.max 
(NUTCH-2466), instead the maximum number of redirects is hardwired (=3).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2521) SitemapProcessor to use property sitemap.redir.max

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386021#comment-16386021
 ] 

ASF GitHub Bot commented on NUTCH-2521:
---

sebastian-nagel opened a new pull request #289: NUTCH-2521 SitemapProcessor to 
use property sitemap.redir.max
URL: https://github.com/apache/nutch/pull/289
 
 



This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> SitemapProcessor to use property sitemap.redir.max
> --
>
> Key: NUTCH-2521
> URL: https://issues.apache.org/jira/browse/NUTCH-2521
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.15
>
>
> SitemapProcessor isn't actually using the property sitemap.redir.max 
> (NUTCH-2466), instead the maximum number of redirects is hardwired (=3).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2520) Wrong Accept-Charset sent when http.accept.charset is not defined

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386018#comment-16386018
 ] 

ASF GitHub Bot commented on NUTCH-2520:
---

sebastian-nagel opened a new pull request #288: NUTCH-2520 Use default value 
for Accept-Charset
URL: https://github.com/apache/nutch/pull/288
 
 
if http.accept.charset is undefined


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Wrong Accept-Charset sent when http.accept.charset is not defined
> -
>
> Key: NUTCH-2520
> URL: https://issues.apache.org/jira/browse/NUTCH-2520
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 1.14
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.15
>
>
> When the property http.accept.charset is not defined, instead of the 
> hard-wired default value {{utf-8,iso-8859-1;q=0.7,*;q=0.7}} that of the 
> "Accept" field is used. Introduced by NUTCH-2376 
> ([HttpBase|https://github.com/apache/nutch/pull/186/files#diff-432a58c46ab1e686ef05a84cace29790R164]).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2520) Wrong Accept-Charset sent when http.accept.charset is not defined

2018-03-05 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2520:
--

 Summary: Wrong Accept-Charset sent when http.accept.charset is not 
defined
 Key: NUTCH-2520
 URL: https://issues.apache.org/jira/browse/NUTCH-2520
 Project: Nutch
  Issue Type: Bug
  Components: protocol
Affects Versions: 1.14
Reporter: Sebastian Nagel
 Fix For: 1.15


When the property http.accept.charset is not defined, instead of the hard-wired 
default value {{utf-8,iso-8859-1;q=0.7,*;q=0.7}} that of the "Accept" field is 
used. Introduced by NUTCH-2376 
([HttpBase|https://github.com/apache/nutch/pull/186/files#diff-432a58c46ab1e686ef05a84cace29790R164]).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2519) Log mapreduce job counters in local mode

2018-03-05 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2519?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16386009#comment-16386009
 ] 

ASF GitHub Bot commented on NUTCH-2519:
---

sebastian-nagel opened a new pull request #287: NUTCH-2519 Log mapreduce job 
messages and counters in local mode
URL: https://github.com/apache/nutch/pull/287
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Log mapreduce job counters in local mode
> 
>
> Key: NUTCH-2519
> URL: https://issues.apache.org/jira/browse/NUTCH-2519
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 2.3.1, 1.14
>Reporter: Sebastian Nagel
>Priority: Trivial
> Fix For: 2.4, 1.15
>
>
> A simple change in the log4j.properties would make the Hadoop job counters 
> appear in the hadoop.log also in local mode:
> {noformat}
> log4j.logger.org.apache.hadoop.mapreduce.Job=INFO
> {noformat}
> This may provide useful information for debugging, esp. if counters are not 
> explicitly logged by tools (see 
> [@user|https://lists.apache.org/thread.html/1dd5410b479bd536fb3df98612db4b832cd0a97533099b0dc632eba9@%3Cuser.nutch.apache.org%3E]).
>  This would make the output also more similar to (pseudo)distributed mode 
> (Nutch is called via {{hadoop jar}}) Job counters and progress info are 
> obligatorily logged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2519) Log mapreduce job counters in local mode

2018-03-05 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2519:
--

 Summary: Log mapreduce job counters in local mode
 Key: NUTCH-2519
 URL: https://issues.apache.org/jira/browse/NUTCH-2519
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.14, 2.3.1
Reporter: Sebastian Nagel
 Fix For: 2.4, 1.15


A simple change in the log4j.properties would make the Hadoop job counters 
appear in the hadoop.log also in local mode:
{noformat}
log4j.logger.org.apache.hadoop.mapreduce.Job=INFO
{noformat}
This may provide useful information for debugging, esp. if counters are not 
explicitly logged by tools (see 
[@user|https://lists.apache.org/thread.html/1dd5410b479bd536fb3df98612db4b832cd0a97533099b0dc632eba9@%3Cuser.nutch.apache.org%3E]).
 This would make the output also more similar to (pseudo)distributed mode 
(Nutch is called via {{hadoop jar}}) Job counters and progress info are 
obligatorily logged.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2518) Must check return value of job.waitForCompletion()

2018-03-05 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16385963#comment-16385963
 ] 

Sebastian Nagel commented on NUTCH-2518:


It seems to affect all 25 occurrences of
{code:java}
int complete = job.waitForCompletion(true)?0:1;{code}

> Must check return value of job.waitForCompletion()
> --
>
> Key: NUTCH-2518
> URL: https://issues.apache.org/jira/browse/NUTCH-2518
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb, fetcher, generator, hostdb, linkdb
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.15
>
>
> The return value of job.waitForCompletion() of the new MapReduce API 
> (NUTCH-2375) must always be checked. If it's not true, the job has been 
> failed or killed. Accordingly, the program
> - should not proceed with further jobs/steps
> - must clean-up temporary data, unlock CrawlDB, etc.
> - exit with non-zero exit value, so that scripts running the crawl workflow 
> can handle the failure
> Cf. NUTCH-2076, NUTCH-2442, [NUTCH-2375 PR 
> #221|https://github.com/apache/nutch/pull/221#issuecomment-332941883].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2518) Must check return value of job.waitForCompletion()

2018-03-05 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16385960#comment-16385960
 ] 

Sebastian Nagel commented on NUTCH-2518:


[~kamaci]: wasn't this part of your PR for NUTCH-2375 (may a commit has been 
lost)?

> Must check return value of job.waitForCompletion()
> --
>
> Key: NUTCH-2518
> URL: https://issues.apache.org/jira/browse/NUTCH-2518
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb, fetcher, generator, hostdb, linkdb
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.15
>
>
> The return value of job.waitForCompletion() of the new MapReduce API 
> (NUTCH-2375) must always be checked. If it's not true, the job has been 
> failed or killed. Accordingly, the program
> - should not proceed with further jobs/steps
> - must clean-up temporary data, unlock CrawlDB, etc.
> - exit with non-zero exit value, so that scripts running the crawl workflow 
> can handle the failure
> Cf. NUTCH-2076, NUTCH-2442, [NUTCH-2375 PR 
> #221|https://github.com/apache/nutch/pull/221#issuecomment-332941883].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2518) Must check return value of job.waitForCompletion()

2018-03-05 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2518:
--

 Summary: Must check return value of job.waitForCompletion()
 Key: NUTCH-2518
 URL: https://issues.apache.org/jira/browse/NUTCH-2518
 Project: Nutch
  Issue Type: Bug
  Components: crawldb, fetcher, generator, hostdb, linkdb
Affects Versions: 1.15
Reporter: Sebastian Nagel
 Fix For: 1.15


The return value of job.waitForCompletion() of the new MapReduce API 
(NUTCH-2375) must always be checked. If it's not true, the job has been failed 
or killed. Accordingly, the program
- should not proceed with further jobs/steps
- must clean-up temporary data, unlock CrawlDB, etc.
- exit with non-zero exit value, so that scripts running the crawl workflow can 
handle the failure

Cf. NUTCH-2076, NUTCH-2442, [NUTCH-2375 PR 
#221|https://github.com/apache/nutch/pull/221#issuecomment-332941883].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2510) Crawl script modification. HostDb : generate, optional usage and descirption

2018-03-05 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2510:
---
Fix Version/s: (was: 1.14)
   1.15

> Crawl script modification. HostDb : generate, optional usage and descirption
> 
>
> Key: NUTCH-2510
> URL: https://issues.apache.org/jira/browse/NUTCH-2510
> Project: Nutch
>  Issue Type: Improvement
>  Components: bin
>Affects Versions: 1.15
>Reporter: Semyon Semyonov
>Priority: Minor
> Fix For: 1.15
>
>
> Script crawl now includes hostdb update as a part of crawling cycle, but :
> 1) There is no hostdb parameter for generate
> 2) Generation of hostdb is not optional, therefore hostdb is generated each 
> step without asking of user. It should be an optional parameter.
> 3) Description of 1 and 2.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2310) Protocol-Selenium does not support HTTPS protocol

2018-03-05 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16385853#comment-16385853
 ] 

Sebastian Nagel commented on NUTCH-2310:


The 
[plugin.xml|https://github.com/apache/nutch/blob/master/src/plugin/protocol-selenium/plugin.xml]
 must also list https as supported protocol. That's done by adding:
{noformat}
  
   
  
{noformat}
But it's likely that more points need to be done to support https.

> Protocol-Selenium does not support HTTPS protocol
> -
>
> Key: NUTCH-2310
> URL: https://issues.apache.org/jira/browse/NUTCH-2310
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 1.12
>Reporter: Joey Hong
>Priority: Major
>  Labels: easyfix
> Fix For: 1.15
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> The protocol-selenium and protocol-interactiveselenium plugins raise errors 
> whenever there is a URL with the HTTPS protocol.
>  From the source code for those plugins, we can see that HTTP is the only 
> scheme currently accepted, which makes Nutch unable to crawl HTTPS sites with 
> JS using Selenium Webdrivers. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)