[jira] [Commented] (NUTCH-2523) UpdateHostDB blocks usage of plugins unintentionally

2018-03-19 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404934#comment-16404934
 ] 

Hudson commented on NUTCH-2523:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3510 (See 
[https://builds.apache.org/job/Nutch-trunk/3510/])
NUTCH-2523 UpdateHostDB blocks usage of plugins unintentionally (snagel: 
[https://github.com/apache/nutch/commit/31819b781ea7fa7187e04b27f3146a98eab46601])
* (edit) src/java/org/apache/nutch/hostdb/UpdateHostDb.java


> UpdateHostDB blocks usage of plugins unintentionally
> 
>
> Key: NUTCH-2523
> URL: https://issues.apache.org/jira/browse/NUTCH-2523
> Project: Nutch
>  Issue Type: Bug
>  Components: hostdb
>Affects Versions: 1.14
>Reporter: Yossi Tamari
>Priority: Major
> Fix For: 1.15
>
> Attachments: NUTCH-2523.tamari.180305.patch.txt
>
>
> UpdateHostDB blocks the use of urlnormalizer-host and 
> urlfilter-domainblacklist (it check if they are configured and throws an 
> exception) without any good reason.
> Quoting Markus: "I simply reused the job setup code and forgot to remove that 
> check. You can safely remove that check in HostDB."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2523) UpdateHostDB blocks usage of plugins unintentionally

2018-03-19 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2523.

Resolution: Fixed

Committed, 
[31819b7|https://github.com/apache/nutch/commit/31819b781ea7fa7187e04b27f3146a98eab46601].
 Thanks!

> UpdateHostDB blocks usage of plugins unintentionally
> 
>
> Key: NUTCH-2523
> URL: https://issues.apache.org/jira/browse/NUTCH-2523
> Project: Nutch
>  Issue Type: Bug
>  Components: hostdb
>Affects Versions: 1.14
>Reporter: Yossi Tamari
>Priority: Major
> Fix For: 1.15
>
> Attachments: NUTCH-2523.tamari.180305.patch.txt
>
>
> UpdateHostDB blocks the use of urlnormalizer-host and 
> urlfilter-domainblacklist (it check if they are configured and throws an 
> exception) without any good reason.
> Quoting Markus: "I simply reused the job setup code and forgot to remove that 
> check. You can safely remove that check in HostDB."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2523) UpdateHostDB blocks usage of plugins unintentionally

2018-03-19 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2523:
---
Summary: UpdateHostDB blocks usage of plugins unintentionally  (was: 
UpdateHostDB blocks usage of plugins unintenionally)

> UpdateHostDB blocks usage of plugins unintentionally
> 
>
> Key: NUTCH-2523
> URL: https://issues.apache.org/jira/browse/NUTCH-2523
> Project: Nutch
>  Issue Type: Bug
>  Components: hostdb
>Affects Versions: 1.14
>Reporter: Yossi Tamari
>Priority: Major
> Fix For: 1.15
>
> Attachments: NUTCH-2523.tamari.180305.patch.txt
>
>
> UpdateHostDB blocks the use of urlnormalizer-host and 
> urlfilter-domainblacklist (it check if they are configured and throws an 
> exception) without any good reason.
> Quoting Markus: "I simply reused the job setup code and forgot to remove that 
> check. You can safely remove that check in HostDB."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2509) Inconsistent behavior in SitemapProcessor

2018-03-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404890#comment-16404890
 ] 

ASF GitHub Bot commented on NUTCH-2509:
---

sebastian-nagel opened a new pull request #301: NUTCH-2509 Apply URL 
filters/normalizers also to URLs of
URL: https://github.com/apache/nutch/pull/301
 
 
   - subsitemaps from a sitemap index (contributed by Yossi Tamari)
   - sitemaps referenced in robots.txt


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Inconsistent behavior in SitemapProcessor
> -
>
> Key: NUTCH-2509
> URL: https://issues.apache.org/jira/browse/NUTCH-2509
> Project: Nutch
>  Issue Type: Bug
>  Components: sitemap
>Affects Versions: 1.14
>Reporter: Yossi Tamari
>Priority: Minor
> Fix For: 1.15
>
> Attachments: SitemapProcessor.patch
>
>
> There are two inconsistent behaviors in SitemapProcessor:
>  # There is a member variable maxRedir that is supposed to limit the number 
> of redirections on sitemap URLs, and it is initialized from config property 
> sitemap.redir.max, but it is ignored in the code because a local variable 
> with the same name is defined in the relevant method, and is always set to 3.
>  # When a sitemap URL goes through redirect, it is filtered and normalized. 
> However, if a sitemap URL comes from a sitemapindex, it is not. This seems 
> inconsistent, as in both cases we have a URL from an outside source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2509) Inconsistent behavior in SitemapProcessor

2018-03-19 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404889#comment-16404889
 ] 

Sebastian Nagel commented on NUTCH-2509:


Thanks, [~yossi]! The redir issue is already fixed in NUTCH-2521 (sorry, I 
haven't seen this issue). I've found that also the URLs of sitemaps referenced 
in the robots.txt are not filtered/normalize. I'll open a PR to address this as 
well.

> Inconsistent behavior in SitemapProcessor
> -
>
> Key: NUTCH-2509
> URL: https://issues.apache.org/jira/browse/NUTCH-2509
> Project: Nutch
>  Issue Type: Bug
>  Components: sitemap
>Affects Versions: 1.14
>Reporter: Yossi Tamari
>Priority: Minor
> Fix For: 1.15
>
> Attachments: SitemapProcessor.patch
>
>
> There are two inconsistent behaviors in SitemapProcessor:
>  # There is a member variable maxRedir that is supposed to limit the number 
> of redirections on sitemap URLs, and it is initialized from config property 
> sitemap.redir.max, but it is ignored in the code because a local variable 
> with the same name is defined in the relevant method, and is always set to 3.
>  # When a sitemap URL goes through redirect, it is filtered and normalized. 
> However, if a sitemap URL comes from a sitemapindex, it is not. This seems 
> inconsistent, as in both cases we have a URL from an outside source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2509) Inconsistent behavior in SitemapProcessor

2018-03-19 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2509?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2509:
---
Fix Version/s: 1.15

> Inconsistent behavior in SitemapProcessor
> -
>
> Key: NUTCH-2509
> URL: https://issues.apache.org/jira/browse/NUTCH-2509
> Project: Nutch
>  Issue Type: Bug
>  Components: sitemap
>Affects Versions: 1.14
>Reporter: Yossi Tamari
>Priority: Minor
> Fix For: 1.15
>
> Attachments: SitemapProcessor.patch
>
>
> There are two inconsistent behaviors in SitemapProcessor:
>  # There is a member variable maxRedir that is supposed to limit the number 
> of redirections on sitemap URLs, and it is initialized from config property 
> sitemap.redir.max, but it is ignored in the code because a local variable 
> with the same name is defined in the relevant method, and is always set to 3.
>  # When a sitemap URL goes through redirect, it is filtered and normalized. 
> However, if a sitemap URL comes from a sitemapindex, it is not. This seems 
> inconsistent, as in both cases we have a URL from an outside source.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2523) UpdateHostDB blocks usage of plugins unintenionally

2018-03-19 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2523:
---
Summary: UpdateHostDB blocks usage of plugins unintenionally  (was: 
UpdateHostDB blocked by plugins unintenionally)

> UpdateHostDB blocks usage of plugins unintenionally
> ---
>
> Key: NUTCH-2523
> URL: https://issues.apache.org/jira/browse/NUTCH-2523
> Project: Nutch
>  Issue Type: Bug
>  Components: hostdb
>Affects Versions: 1.14
>Reporter: Yossi Tamari
>Priority: Major
> Fix For: 1.15
>
> Attachments: NUTCH-2523.tamari.180305.patch.txt
>
>
> UpdateHostDB blocks the use of urlnormalizer-host and 
> urlfilter-domainblacklist (it check if they are configured and throws an 
> exception) without any good reason.
> Quoting Markus: "I simply reused the job setup code and forgot to remove that 
> check. You can safely remove that check in HostDB."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2523) UpdateHostDB blocked by plugins unintenionally

2018-03-19 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404877#comment-16404877
 ] 

Sebastian Nagel commented on NUTCH-2523:


Thanks, [~yossi], will commit the patch shortly!

> UpdateHostDB blocked by plugins unintenionally
> --
>
> Key: NUTCH-2523
> URL: https://issues.apache.org/jira/browse/NUTCH-2523
> Project: Nutch
>  Issue Type: Bug
>  Components: hostdb
>Affects Versions: 1.14
>Reporter: Yossi Tamari
>Priority: Major
> Fix For: 1.15
>
> Attachments: NUTCH-2523.tamari.180305.patch.txt
>
>
> UpdateHostDB blocks the use of urlnormalizer-host and 
> urlfilter-domainblacklist (it check if they are configured and throws an 
> exception) without any good reason.
> Quoting Markus: "I simply reused the job setup code and forgot to remove that 
> check. You can safely remove that check in HostDB."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2523) UpdateHostDB blocked by plugins unintenionally

2018-03-19 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2523:
---
Summary: UpdateHostDB blocked by plugins unintenionally  (was: UpdateHostDB 
blocks plugins unintenionally)

> UpdateHostDB blocked by plugins unintenionally
> --
>
> Key: NUTCH-2523
> URL: https://issues.apache.org/jira/browse/NUTCH-2523
> Project: Nutch
>  Issue Type: Bug
>  Components: hostdb
>Affects Versions: 1.14
>Reporter: Yossi Tamari
>Priority: Major
> Fix For: 1.15
>
> Attachments: NUTCH-2523.tamari.180305.patch.txt
>
>
> UpdateHostDB blocks the use of urlnormalizer-host and 
> urlfilter-domainblacklist (it check if they are configured and throws an 
> exception) without any good reason.
> Quoting Markus: "I simply reused the job setup code and forgot to remove that 
> check. You can safely remove that check in HostDB."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2523) UpdateHostDB blocks plugins unintenionally

2018-03-19 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2523:
---
Fix Version/s: 1.15

> UpdateHostDB blocks plugins unintenionally
> --
>
> Key: NUTCH-2523
> URL: https://issues.apache.org/jira/browse/NUTCH-2523
> Project: Nutch
>  Issue Type: Bug
>  Components: hostdb
>Affects Versions: 1.14
>Reporter: Yossi Tamari
>Priority: Major
> Fix For: 1.15
>
> Attachments: NUTCH-2523.tamari.180305.patch.txt
>
>
> UpdateHostDB blocks the use of urlnormalizer-host and 
> urlfilter-domainblacklist (it check if they are configured and throws an 
> exception) without any good reason.
> Quoting Markus: "I simply reused the job setup code and forgot to remove that 
> check. You can safely remove that check in HostDB."



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2539) Not correct naming of db.url.filters and db.url.normalizers in nutch-default.xml

2018-03-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404835#comment-16404835
 ] 

ASF GitHub Bot commented on NUTCH-2539:
---

okedoki commented on issue #300: NUTCH-2539
URL: https://github.com/apache/nutch/pull/300#issuecomment-374218894
 
 
   @sebastian-nagel 
   done


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Not correct naming of db.url.filters and db.url.normalizers in 
> nutch-default.xml
> 
>
> Key: NUTCH-2539
> URL: https://issues.apache.org/jira/browse/NUTCH-2539
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.15
>Reporter: Semyon Semyonov
>Priority: Major
>
> There is a mismatch between config and code.
> In code, 
>  In CrawlDbFilter line 41:43
> > public static final String URL_FILTERING = "crawldb.url.filters";
> > public static final String URL_NORMALIZING = "crawldb.url.normalizers";
> > public static final String URL_NORMALIZING_SCOPE = 
> > "crawldb.url.normalizers.scope";
>  
> In nutch-default.xml
> > 
> > db.url.normalizers
> > false
> > Normalize urls when updating crawldb
> > 
> >
> > 
> > db.url.filters
> > false
> > Filter urls when updating crawldb
> > 
> These properties should be in line with code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2539) Not correct naming of db.url.filters and db.url.normalizers in nutch-default.xml

2018-03-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404804#comment-16404804
 ] 

ASF GitHub Bot commented on NUTCH-2539:
---

sebastian-nagel commented on a change in pull request #300: NUTCH-2539
URL: https://github.com/apache/nutch/pull/300#discussion_r175433990
 
 

 ##
 File path: conf/nutch-default.xml
 ##
 @@ -548,15 +548,18 @@
 
 
 
-db.url.normalizers
+crawldb.url.normalizers
 false
 
 Review comment:
   crawldb.url.normalizers is also overwritten from command-line (`updatedb ... 
-normalize`).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Not correct naming of db.url.filters and db.url.normalizers in 
> nutch-default.xml
> 
>
> Key: NUTCH-2539
> URL: https://issues.apache.org/jira/browse/NUTCH-2539
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.15
>Reporter: Semyon Semyonov
>Priority: Major
>
> There is a mismatch between config and code.
> In code, 
>  In CrawlDbFilter line 41:43
> > public static final String URL_FILTERING = "crawldb.url.filters";
> > public static final String URL_NORMALIZING = "crawldb.url.normalizers";
> > public static final String URL_NORMALIZING_SCOPE = 
> > "crawldb.url.normalizers.scope";
>  
> In nutch-default.xml
> > 
> > db.url.normalizers
> > false
> > Normalize urls when updating crawldb
> > 
> >
> > 
> > db.url.filters
> > false
> > Filter urls when updating crawldb
> > 
> These properties should be in line with code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2539) Not correct naming of db.url.filters and db.url.normalizers in nutch-default.xml

2018-03-19 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2539?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16404798#comment-16404798
 ] 

ASF GitHub Bot commented on NUTCH-2539:
---

okedoki opened a new pull request #300: NUTCH-2539
URL: https://github.com/apache/nutch/pull/300
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Not correct naming of db.url.filters and db.url.normalizers in 
> nutch-default.xml
> 
>
> Key: NUTCH-2539
> URL: https://issues.apache.org/jira/browse/NUTCH-2539
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.15
>Reporter: Semyon Semyonov
>Priority: Major
>
> There is a mismatch between config and code.
> In code, 
>  In CrawlDbFilter line 41:43
> > public static final String URL_FILTERING = "crawldb.url.filters";
> > public static final String URL_NORMALIZING = "crawldb.url.normalizers";
> > public static final String URL_NORMALIZING_SCOPE = 
> > "crawldb.url.normalizers.scope";
>  
> In nutch-default.xml
> > 
> > db.url.normalizers
> > false
> > Normalize urls when updating crawldb
> > 
> >
> > 
> > db.url.filters
> > false
> > Filter urls when updating crawldb
> > 
> These properties should be in line with code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Config issues with URL filters and normalizers in UpdateCrawlDb

2018-03-19 Thread Semyon Semyonov


Hi Sebastian,

No problems.

Here it is,

https://issues.apache.org/jira/browse/NUTCH-2539


Semyon.


Sent: Monday, March 19, 2018 at 2:02 PM
From: "Sebastian Nagel" 
To: dev@nutch.apache.org
Subject: Re: Config issues with URL filters and normalizers in UpdateCrawlDb

Hi Semyon,

sorry for the late answer. Yes, you're right the naming in nutch-default.xml is wrong.
Please open a Jira issue to address this.

The description should also mention that the property
crawldb.url.filters is a "temporary" and set/overwritten by command-line options.
Cf. the overview (somewhat outdated) on
https://wiki.apache.org/nutch/NutchPropertiesCompleteList

Best,
Sebastian

On 02/19/2018 02:24 PM, Semyon Semyonov wrote:
> Gents,
>
> To use URL filters and Normalizers in CrawlDBUpdate the three config setting may be used:
>
> In CrawlDbFilter line 41:43
> public static final String URL_FILTERING = "crawldb.url.filters";
> public static final String URL_NORMALIZING = "crawldb.url.normalizers";
> public static final String URL_NORMALIZING_SCOPE = "crawldb.url.normalizers.scope";
>
>
> However, in nutch-default we have different names
> 
> db.url.normalizers
> false
> Normalize urls when updating crawldb
> 
>
> 
> db.url.filters
> false
> Filter urls when updating crawldb
> 
>
>
> Obviously, that is the reason why URLNormalizers/Filters dont work.
>
> Should I change CrawlDbFilter code to
> public static final String URL_FILTERING = "db.url.filters";
> public static final String URL_NORMALIZING = "db.url.normalizers";
> public static final String URL_NORMALIZING_SCOPE = "db.url.normalizers.scope";
>
>
> ?
>
> Semyon.
>
 





[jira] [Created] (NUTCH-2539) Not correct naming of db.url.filters and db.url.normalizers in nutch-default.xml

2018-03-19 Thread Semyon Semyonov (JIRA)
Semyon Semyonov created NUTCH-2539:
--

 Summary: Not correct naming of db.url.filters and 
db.url.normalizers in nutch-default.xml
 Key: NUTCH-2539
 URL: https://issues.apache.org/jira/browse/NUTCH-2539
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.15
Reporter: Semyon Semyonov


There is a mismatch between config and code.

In code, 
 In CrawlDbFilter line 41:43
> public static final String URL_FILTERING = "crawldb.url.filters";
> public static final String URL_NORMALIZING = "crawldb.url.normalizers";
> public static final String URL_NORMALIZING_SCOPE = 
> "crawldb.url.normalizers.scope";

 

In nutch-default.xml
> 
> db.url.normalizers
> false
> Normalize urls when updating crawldb
> 
>
> 
> db.url.filters
> false
> Filter urls when updating crawldb
> 



These properties should be in line with code.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: Upgrade to Hadoop 3

2018-03-19 Thread Sebastian Nagel
> If I remember correct;ly one could not run multiple Nutch instances from the 
> same user
> because all those instances would write to the same TMP file or something 
> like this...

Just make sure that every instance has it's own temp folder configured by 
setting
  -Dhadoop.tmp.dir=...
That's required for local mode (which was thought for testing). In single-node 
pseudo-distributed
mode the temp folders are automatically configured per job.

> try and use Hadoop 3, say like till the end of next week and report back.

Then you might try pseudo-distributed mode:
   
http://hadoop.apache.org/docs/r3.0.0/hadoop-project-dist/hadoop-common/SingleCluster.html

Best,
Sebastian

On 03/14/2018 12:47 PM, BlackIce wrote:
> I'm redoing everything on my end pretty much from scratch
> 
> The question I had when I woke up this morning... Do I really need to 
> configure VM's in order to run
> multiple nodes? Or does running multiple, Nutch-Solr nodes, sufice if they 
> are under their own user
> space? If I remember correct;ly one could not run multiple Nutch instances 
> from the same user
> because all those instances would write to the same TMP file or something 
> like this... But this
> shouldn't be the case when each instance is run from its own user account.
> 
> With that sayd... If its really as easy as running the instances from their 
> own account... I could
> try and use Hadoop 3, say like till the end of next week and report back.
> 
> Greetz
> 
> On Wed, Mar 14, 2018 at 12:01 AM, Lewis John McGibbney  > wrote:
> 
> Hi Seb,
> 
> On 2018/03/12 11:00:52, Sebastian Nagel  > wrote:
> > Hi,
> >
> > > seeing as we have just merged in the 'new' MR patch
> >
> > yep, but there's still something to do (NUTCH-2517,
> 
> ACK, this needs more testing.
> 
> > NUTCH-2518).
> 
> I honestly didn't see this come through but yes you are right.
> 
> > Better to address this before any upgrade of the Hadoop version.
> 
> ACK
> 
> > But since there seem to be no breaking MapReduce API changes
> >   http://hadoop.apache.org/docs/r3.0.0/index.html 
> 
> > I would even expect that the Nutch job jar (built for 2.7)
> > will run on Hadoop 3.0, or does it not?
> >
> 
> I have absolutely no idea. I've certainly not had an opportunity to run 
> on H v3 cluster.
> 
> 



Re: Config issues with URL filters and normalizers in UpdateCrawlDb

2018-03-19 Thread Sebastian Nagel
Hi Semyon,

sorry for the late answer. Yes, you're right the naming in nutch-default.xml is 
wrong.
Please open a Jira issue to address this.

The description should also mention that the property
crawldb.url.filters is a "temporary" and set/overwritten by command-line 
options.
Cf. the overview (somewhat outdated) on
  https://wiki.apache.org/nutch/NutchPropertiesCompleteList

Best,
Sebastian

On 02/19/2018 02:24 PM, Semyon Semyonov wrote:
> Gents,
> 
> To use URL filters and Normalizers in CrawlDBUpdate the three config setting 
> may be used:
>  
> In CrawlDbFilter line 41:43
>   public static final String URL_FILTERING = "crawldb.url.filters";
>   public static final String URL_NORMALIZING = "crawldb.url.normalizers";
>   public static final String URL_NORMALIZING_SCOPE = 
> "crawldb.url.normalizers.scope";
> 
> 
> However, in nutch-default we have different names 
> 
> db.url.normalizers
> false
> Normalize urls when updating crawldb
> 
> 
> 
> db.url.filters
> false
> Filter urls when updating crawldb
> 
> 
> 
> Obviously, that is the reason why URLNormalizers/Filters dont work.
> 
> Should I change CrawlDbFilter code to
>  public static final String URL_FILTERING = "db.url.filters";
>   public static final String URL_NORMALIZING = "db.url.normalizers";
>   public static final String URL_NORMALIZING_SCOPE = 
> "db.url.normalizers.scope";
> 
> 
> ?
> 
> Semyon.
>