date:20171218

[jira] [Commented] (NUTCH-2450) Remove FixMe in ParseOutputFormat

2017-12-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16296169#comment-16296169
 ] 

ASF GitHub Bot commented on NUTCH-2450:
---

kpm1985 closed pull request #235: Fix for NUTCH-2450 by Kenneth McFarland
URL: https://github.com/apache/nutch/pull/235
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/java/org/apache/nutch/parse/ParseOutputFormat.java 
b/src/java/org/apache/nutch/parse/ParseOutputFormat.java
index 2c8396a75..9cf147790 100644
--- a/src/java/org/apache/nutch/parse/ParseOutputFormat.java
+++ b/src/java/org/apache/nutch/parse/ParseOutputFormat.java
@@ -362,7 +362,6 @@ public static String filterNormalize(String fromUrl, String 
toUrl,
   if (ignoreExternalLinks) {
 if ("bydomain".equalsIgnoreCase(ignoreExternalLinksMode)) {
   String toDomain = URLUtil.getDomainName(targetURL).toLowerCase();
-  //FIXME: toDomain will never be null, correct?
   if (toDomain == null || !toDomain.equals(origin)) {
 return null; // skip it
   }
@@ -379,15 +378,16 @@ public static String filterNormalize(String fromUrl, 
String toUrl,
   if (ignoreInternalLinks) {
 if ("bydomain".equalsIgnoreCase(ignoreExternalLinksMode)) {
   String toDomain = URLUtil.getDomainName(targetURL).toLowerCase();
-  //FIXME: toDomain will never be null, correct?
   if (toDomain == null || toDomain.equals(origin)) {
 return null; // skip it
   }
 } else {
   String toHost = targetURL.getHost().toLowerCase();
-  //FIXME: toDomain will never be null, correct?
   if (toHost == null || toHost.equals(origin)) {
-return null; // skip it
+if (exemptionFilters == null // check if it is exempted?
+|| !exemptionFilters.isExempted(fromUrl, toUrl)) {
+  return null; ///skip it, This external url is not exempted.
+}
   }
 }
   }


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove FixMe in ParseOutputFormat
> -
>
> Key: NUTCH-2450
> URL: https://issues.apache.org/jira/browse/NUTCH-2450
> Project: Nutch
>  Issue Type: Bug
> Environment: master branch
>Reporter: Kenneth McFarland
>Assignee: Kenneth McFarland
>Priority: Minor
>
> ParseOutputFormat contains a few FixMe's that I've looked at. If a valid url 
> is created, it will always return valid results. There is a spot in the code 
> where the try catch is already done, so the predicate is satisfied and there 
> is no need to keep checking it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2450) Remove FixMe in ParseOutputFormat

2017-12-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16296168#comment-16296168
 ] 

ASF GitHub Bot commented on NUTCH-2450:
---

kpm1985 commented on issue #235: Fix for NUTCH-2450 by Kenneth McFarland
URL: https://github.com/apache/nutch/pull/235#issuecomment-352634339
 
 
   I closed the issue. The code is still labeled "FIXME", I don't want to mess 
up closing issues @lewismc  and since Eclipse has been being such a frustration 
issue I am just going to take a break from the time + frustration issues and 
come back. 
   
   I hope that is ok, I have taken notes and will come back with a preemptive 
strike next time. I'm just not good with pressure right now. Thank you for 
giving me the direction, the next time I PR it will have more thought because 
of this.


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove FixMe in ParseOutputFormat
> -
>
> Key: NUTCH-2450
> URL: https://issues.apache.org/jira/browse/NUTCH-2450
> Project: Nutch
>  Issue Type: Bug
> Environment: master branch
>Reporter: Kenneth McFarland
>Assignee: Kenneth McFarland
>Priority: Minor
>
> ParseOutputFormat contains a few FixMe's that I've looked at. If a valid url 
> is created, it will always return valid results. There is a spot in the code 
> where the try catch is already done, so the predicate is satisfied and there 
> is no need to keep checking it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Closed] (NUTCH-2450) Remove FixMe in ParseOutputFormat

2017-12-18 Thread Kenneth McFarland (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2450?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kenneth McFarland closed NUTCH-2450.

Resolution: Not A Problem

This is not a problem, it is still labeled FIXME. I have run out of time and 
Eclipse is still turning my hair grey. I will attempt later, but do not want 
this issue causing problems. The code still clearly invites another analyst to 
clear the issue.

> Remove FixMe in ParseOutputFormat
> -
>
> Key: NUTCH-2450
> URL: https://issues.apache.org/jira/browse/NUTCH-2450
> Project: Nutch
>  Issue Type: Bug
> Environment: master branch
>Reporter: Kenneth McFarland
>Assignee: Kenneth McFarland
>Priority: Minor
>
> ParseOutputFormat contains a few FixMe's that I've looked at. If a valid url 
> is created, it will always return valid results. There is a spot in the code 
> where the try catch is already done, so the predicate is satisfied and there 
> is no need to keep checking it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[VOTE] Release Apache Nutch 1.14 RC#1

2017-12-18 Thread Sebastian Nagel

Hi Folks,

A first candidate for the Nutch 1.14 release is available at:

  https://dist.apache.org/repos/dist/dev/nutch/1.14/

The release candidate is a zip and tar.gz archive of the binary and sources in:
  https://github.com/apache/nutch/tree/release-1.14
The SHA1 checksum of the release commit is
  a8e60bdfb79b368612f068ed5aeeb690e29b448d

In addition, a staged maven repository is available here:
  https://repository.apache.org/content/repositories/orgapachenutch-1014/

We addressed 79 Issues:
   
https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=10680=12340218

Please vote on releasing this package as Apache Nutch 1.14.
The vote is open for the next 72 hours and passes if a majority of at
least three +1 Nutch PMC votes are cast.

[ ] +1 Release this package as Apache Nutch 1.14.
[ ] -1 Do not release this package because…

Cheers,
Sebastian
(On behalf of the Nutch PMC)

P.S. Here is my +1.

[jira] [Commented] (NUTCH-2353) Create seed file with metadata using the REST API

2017-12-18 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295333#comment-16295333
 ] 

Hudson commented on NUTCH-2353:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3488 (See 
[https://builds.apache.org/job/Nutch-trunk/3488/])
NUTCH-2353 Create seed file with metadata using the REST API - reverse (snagel: 
[https://github.com/apache/nutch/commit/dae62f8bc3fd041e71a0c43abbc6c5f7590bb88d])
* (edit) src/java/org/apache/nutch/service/model/request/SeedUrl.java
* (edit) src/java/org/apache/nutch/service/resources/SeedResource.java
* (edit) src/java/org/apache/nutch/webui/model/SeedUrl.java


> Create seed file with metadata using the REST API
> -
>
> Key: NUTCH-2353
> URL: https://issues.apache.org/jira/browse/NUTCH-2353
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector, REST_api
>Affects Versions: 1.12
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: rest_api
> Fix For: 1.15
>
>
> At the moment its not possible to create a seed file and specify any metadata 
> when using the REST API. The file gets created but there is no option to add 
> any metadata to the seed URLs.
> If we use a payload like this:
> {code}
> {
> "name":"name-of-seedlist", 
> "seedUrls":[
> {
> "url" : "http://example.com;,
> "metadata" : {
> "key1" : "value1",
> "key2" : "value2",
> "key3" : "value3"
> }
> }
> ]
> }
> {code}
> It should be easy to specify the desired metadata. Also this should keep BC 
> with the previous array syntax if we only want to specify the list of URLs 
> without any metadata at all.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2483) Remove/replace indirect dependencies to org.json

2017-12-18 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295332#comment-16295332
 ] 

Hudson commented on NUTCH-2483:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3488 (See 
[https://builds.apache.org/job/Nutch-trunk/3488/])
NUTCH-2483 Remove/replace indirect dependencies to org.json - exclude (snagel: 
[https://github.com/apache/nutch/commit/75d66e9fae6b2006c969e8eaa9789807aae90a38])
* (edit) ivy/ivy.xml


> Remove/replace indirect dependencies to org.json
> 
>
> Key: NUTCH-2483
> URL: https://issues.apache.org/jira/browse/NUTCH-2483
> Project: Nutch
>  Issue Type: Bug
>  Components: deployment
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
> Fix For: 1.14
>
>
> As indirect transitive dependency we ship with Nutch 1.x binary packages a 
> jar file of org.json which [license|http://www.json.org/license.html] is 
> since one year among the [category 
> x|https://www.apache.org/legal/resolved.html#category-x] licenses (see also 
> [license faq|https://www.apache.org/legal/resolved.html#json]).
> We should check whether the library is mandatory and the exclude or replace 
> it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Reopened] (NUTCH-2392) Get same pages multiple times if URL contains relative path

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reopened NUTCH-2392:


> Get same pages multiple times if URL contains relative path
> ---
>
> Key: NUTCH-2392
> URL: https://issues.apache.org/jira/browse/NUTCH-2392
> Project: Nutch
>  Issue Type: Bug
>  Components: commoncrawl
>Affects Versions: 1.13
> Environment: Ubuntu, JRE 1.8.131, Apache Solr 6.5.1
>Reporter: Jayesh Shende
>Priority: Critical
>  Labels: features
>   Original Estimate: 60h
>  Remaining Estimate: 60h
>
> When websites have relative URL at different pages for same HTML document, 
> for example on first depth I fetched contents of a page 
> http://example.com/index.html, after few depths I got a link (constructed by 
> Nutch from some relative path pattern in some anchor tag) 
> http://example.com/Level1/Level2/../../index.html , in this case Nutch is 
> fetching same HTML document two times considering both URLs are different but 
> they are not. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Resolved] (NUTCH-2212) Decrease memory consumption by tuning stack size

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2212.

Resolution: Not A Problem

> Decrease memory consumption by tuning stack size
> 
>
> Key: NUTCH-2212
> URL: https://issues.apache.org/jira/browse/NUTCH-2212
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
>
> In today's environments it is common to see a default stack size (-Xss) of 1 
> MB. This is ridiculous for a fetcher running many fetcher.threads.fetch. The 
> actual number of threads is much higher due to parsing and running it on 
> YARN. 
> We can decrease stack usage by 75 %, 1 MB to a safe 256 kB. YARN will run out 
> of stack size if we set it to 128 kB.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Resolved] (NUTCH-2392) Get same pages multiple times if URL contains relative path

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2392.

Resolution: Won't Fix

> Get same pages multiple times if URL contains relative path
> ---
>
> Key: NUTCH-2392
> URL: https://issues.apache.org/jira/browse/NUTCH-2392
> Project: Nutch
>  Issue Type: Bug
>  Components: commoncrawl
>Affects Versions: 1.13
> Environment: Ubuntu, JRE 1.8.131, Apache Solr 6.5.1
>Reporter: Jayesh Shende
>Priority: Critical
>  Labels: features
>   Original Estimate: 60h
>  Remaining Estimate: 60h
>
> When websites have relative URL at different pages for same HTML document, 
> for example on first depth I fetched contents of a page 
> http://example.com/index.html, after few depths I got a link (constructed by 
> Nutch from some relative path pattern in some anchor tag) 
> http://example.com/Level1/Level2/../../index.html , in this case Nutch is 
> fetching same HTML document two times considering both URLs are different but 
> they are not. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (NUTCH-2212) Decrease memory consumption by tuning stack size

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2212:
---
Fix Version/s: (was: 1.14)

> Decrease memory consumption by tuning stack size
> 
>
> Key: NUTCH-2212
> URL: https://issues.apache.org/jira/browse/NUTCH-2212
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
>
> In today's environments it is common to see a default stack size (-Xss) of 1 
> MB. This is ridiculous for a fetcher running many fetcher.threads.fetch. The 
> actual number of threads is much higher due to parsing and running it on 
> YARN. 
> We can decrease stack usage by 75 %, 1 MB to a safe 256 kB. YARN will run out 
> of stack size if we set it to 128 kB.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Reopened] (NUTCH-2212) Decrease memory consumption by tuning stack size

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reopened NUTCH-2212:


> Decrease memory consumption by tuning stack size
> 
>
> Key: NUTCH-2212
> URL: https://issues.apache.org/jira/browse/NUTCH-2212
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
>
> In today's environments it is common to see a default stack size (-Xss) of 1 
> MB. This is ridiculous for a fetcher running many fetcher.threads.fetch. The 
> actual number of threads is much higher due to parsing and running it on 
> YARN. 
> We can decrease stack usage by 75 %, 1 MB to a safe 256 kB. YARN will run out 
> of stack size if we set it to 128 kB.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Resolved] (NUTCH-2212) Decrease memory consumption by tuning stack size

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2212.

Resolution: Fixed

Remove fix version to avoid that the issue is listed in release notes.

> Decrease memory consumption by tuning stack size
> 
>
> Key: NUTCH-2212
> URL: https://issues.apache.org/jira/browse/NUTCH-2212
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
>
> In today's environments it is common to see a default stack size (-Xss) of 1 
> MB. This is ridiculous for a fetcher running many fetcher.threads.fetch. The 
> actual number of threads is much higher due to parsing and running it on 
> YARN. 
> We can decrease stack usage by 75 %, 1 MB to a safe 256 kB. YARN will run out 
> of stack size if we set it to 128 kB.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Reopened] (NUTCH-2212) Decrease memory consumption by tuning stack size

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reopened NUTCH-2212:


> Decrease memory consumption by tuning stack size
> 
>
> Key: NUTCH-2212
> URL: https://issues.apache.org/jira/browse/NUTCH-2212
> Project: Nutch
>  Issue Type: Task
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
>
> In today's environments it is common to see a default stack size (-Xss) of 1 
> MB. This is ridiculous for a fetcher running many fetcher.threads.fetch. The 
> actual number of threads is much higher due to parsing and running it on 
> YARN. 
> We can decrease stack usage by 75 %, 1 MB to a safe 256 kB. YARN will run out 
> of stack size if we set it to 128 kB.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Resolved] (NUTCH-2392) Get same pages multiple times if URL contains relative path

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2392.

Resolution: Fixed

Remove fix version to avoid that the issue is listed in release notes.

> Get same pages multiple times if URL contains relative path
> ---
>
> Key: NUTCH-2392
> URL: https://issues.apache.org/jira/browse/NUTCH-2392
> Project: Nutch
>  Issue Type: Bug
>  Components: commoncrawl
>Affects Versions: 1.13
> Environment: Ubuntu, JRE 1.8.131, Apache Solr 6.5.1
>Reporter: Jayesh Shende
>Priority: Critical
>  Labels: features
>   Original Estimate: 60h
>  Remaining Estimate: 60h
>
> When websites have relative URL at different pages for same HTML document, 
> for example on first depth I fetched contents of a page 
> http://example.com/index.html, after few depths I got a link (constructed by 
> Nutch from some relative path pattern in some anchor tag) 
> http://example.com/Level1/Level2/../../index.html , in this case Nutch is 
> fetching same HTML document two times considering both URLs are different but 
> they are not. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (NUTCH-2392) Get same pages multiple times if URL contains relative path

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2392:
---
Fix Version/s: (was: 1.14)

> Get same pages multiple times if URL contains relative path
> ---
>
> Key: NUTCH-2392
> URL: https://issues.apache.org/jira/browse/NUTCH-2392
> Project: Nutch
>  Issue Type: Bug
>  Components: commoncrawl
>Affects Versions: 1.13
> Environment: Ubuntu, JRE 1.8.131, Apache Solr 6.5.1
>Reporter: Jayesh Shende
>Priority: Critical
>  Labels: features
>   Original Estimate: 60h
>  Remaining Estimate: 60h
>
> When websites have relative URL at different pages for same HTML document, 
> for example on first depth I fetched contents of a page 
> http://example.com/index.html, after few depths I got a link (constructed by 
> Nutch from some relative path pattern in some anchor tag) 
> http://example.com/Level1/Level2/../../index.html , in this case Nutch is 
> fetching same HTML document two times considering both URLs are different but 
> they are not. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Reopened] (NUTCH-2392) Get same pages multiple times if URL contains relative path

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reopened NUTCH-2392:


> Get same pages multiple times if URL contains relative path
> ---
>
> Key: NUTCH-2392
> URL: https://issues.apache.org/jira/browse/NUTCH-2392
> Project: Nutch
>  Issue Type: Bug
>  Components: commoncrawl
>Affects Versions: 1.13
> Environment: Ubuntu, JRE 1.8.131, Apache Solr 6.5.1
>Reporter: Jayesh Shende
>Priority: Critical
>  Labels: features
>   Original Estimate: 60h
>  Remaining Estimate: 60h
>
> When websites have relative URL at different pages for same HTML document, 
> for example on first depth I fetched contents of a page 
> http://example.com/index.html, after few depths I got a link (constructed by 
> Nutch from some relative path pattern in some anchor tag) 
> http://example.com/Level1/Level2/../../index.html , in this case Nutch is 
> fetching same HTML document two times considering both URLs are different but 
> they are not. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (NUTCH-2185) protocol-soda-consumer plugin

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2185:
---
Fix Version/s: (was: 1.14)

> protocol-soda-consumer plugin
> -
>
> Key: NUTCH-2185
> URL: https://issues.apache.org/jira/browse/NUTCH-2185
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin, protocol
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>
> I'm finishing off a Nutch protocol implementation for interacting with the 
> popular [Socrata|https://www.socrata.com/] Open Data platform via their 
> [soda-java api|https://github.com/socrata/soda-java]. I feel that this would 
> be useful for Government and other public sector organizations who make their 
> data available through the Socrata platforms so it is my intention to propose 
> it as a protocol-soda-consumer plugin for Nutch.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (NUTCH-2353) Create seed file with metadata using the REST API

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2353:
---
Fix Version/s: (was: 1.14)
   1.15

> Create seed file with metadata using the REST API
> -
>
> Key: NUTCH-2353
> URL: https://issues.apache.org/jira/browse/NUTCH-2353
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector, REST_api
>Affects Versions: 1.12
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: rest_api
> Fix For: 1.15
>
>
> At the moment its not possible to create a seed file and specify any metadata 
> when using the REST API. The file gets created but there is no option to add 
> any metadata to the seed URLs.
> If we use a payload like this:
> {code}
> {
> "name":"name-of-seedlist", 
> "seedUrls":[
> {
> "url" : "http://example.com;,
> "metadata" : {
> "key1" : "value1",
> "key2" : "value2",
> "key3" : "value3"
> }
> }
> ]
> }
> {code}
> It should be easy to specify the desired metadata. Also this should keep BC 
> with the previous array syntax if we only want to specify the list of URLs 
> without any metadata at all.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Resolved] (NUTCH-2483) Remove/replace indirect dependencies to org.json

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2483.

Resolution: Fixed

> Remove/replace indirect dependencies to org.json
> 
>
> Key: NUTCH-2483
> URL: https://issues.apache.org/jira/browse/NUTCH-2483
> Project: Nutch
>  Issue Type: Bug
>  Components: deployment
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
> Fix For: 1.14
>
>
> As indirect transitive dependency we ship with Nutch 1.x binary packages a 
> jar file of org.json which [license|http://www.json.org/license.html] is 
> since one year among the [category 
> x|https://www.apache.org/legal/resolved.html#category-x] licenses (see also 
> [license faq|https://www.apache.org/legal/resolved.html#json]).
> We should check whether the library is mandatory and the exclude or replace 
> it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2483) Remove/replace indirect dependencies to org.json

2017-12-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2483?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295271#comment-16295271
 ] 

ASF GitHub Bot commented on NUTCH-2483:
---

sebastian-nagel closed pull request #265: NUTCH-2483 Remove/replace indirect 
dependencies to org.json
URL: https://github.com/apache/nutch/pull/265
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/ivy/ivy.xml b/ivy/ivy.xml
index 520afa0f0..2dbe58351 100644
--- a/ivy/ivy.xml
+++ b/ivy/ivy.xml
@@ -95,6 +95,7 @@



+   

 

@@ -128,7 +129,9 @@



-   
+   
+   
+   





 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Remove/replace indirect dependencies to org.json
> 
>
> Key: NUTCH-2483
> URL: https://issues.apache.org/jira/browse/NUTCH-2483
> Project: Nutch
>  Issue Type: Bug
>  Components: deployment
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
> Fix For: 1.14
>
>
> As indirect transitive dependency we ship with Nutch 1.x binary packages a 
> jar file of org.json which [license|http://www.json.org/license.html] is 
> since one year among the [category 
> x|https://www.apache.org/legal/resolved.html#category-x] licenses (see also 
> [license faq|https://www.apache.org/legal/resolved.html#json]).
> We should check whether the library is mandatory and the exclude or replace 
> it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2035) Regex filter using case sensitive rules.

2017-12-18 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295247#comment-16295247
 ] 

Hudson commented on NUTCH-2035:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3487 (See 
[https://builds.apache.org/job/Nutch-trunk/3487/])
NUTCH-2035 urlfilter-regex case insensitive rules (snagel: 
[https://github.com/apache/nutch/commit/e0e06f58015c982700c5ec0a2a4a43dde642f03f])
* (edit) conf/regex-urlfilter.txt.template


> Regex filter using case sensitive rules.
> 
>
> Key: NUTCH-2035
> URL: https://issues.apache.org/jira/browse/NUTCH-2035
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 1.10
>Reporter: Luis Lopez
>Assignee: Sebastian Nagel
>Priority: Minor
>  Labels: filters, regex, regex-urlfilter
> Fix For: 2.4, 1.14
>
> Attachments: regex-urlfilter.txt
>
>
> Regex expressions are computationally expensive and having “EXE|exe|JPG|jpg” 
> etc etc. adds up if we use complex rules.
> Regex filter should use case insensitive rules to make the rules more 
> readable and improve performance.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2370) FileDumper: save JSON mapping file -> URL

2017-12-18 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295241#comment-16295241
 ] 

Hudson commented on NUTCH-2370:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3487 (See 
[https://builds.apache.org/job/Nutch-trunk/3487/])
fix for NUTCH-2370 contributed by msha...@usc.edu (snagel: 
[https://github.com/apache/nutch/commit/34236ffecf478a1776559b0ed8c1ad929483d752])
* (edit) src/java/org/apache/nutch/tools/FileDumper.java


> FileDumper: save JSON mapping file -> URL
> -
>
> Key: NUTCH-2370
> URL: https://issues.apache.org/jira/browse/NUTCH-2370
> Project: Nutch
>  Issue Type: Improvement
>  Components: dumpers
>Affects Versions: 1.14
>Reporter: Madhav Sharan
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.14
>
>
> - nutch dump [0] is a great tool to simply dump all the crawled files from 
> nutch segments.
> - After dump we loose information about URL from which this file was crawled. 
> URL is used to name dumped file but that information is encrypted.
> - In `reverseUrlDirs` option one can figure out URL by checking the file path 
> but even accessing file path is little complicated than simple mapping file.
> - In `flatdir` there is no way to know actual URL.
> I am submitting a PR which edits [0] and saves a json for each crawled 
> segment which maps a file path to URL.
> [0] 
> https://github.com/apache/nutch/blob/3e2d3d456489bf52bc586dae0e2e71fb7aad8fe7/src/java/org/apache/nutch/tools/FileDumper.java



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2380) indexer-elastic version upgrade to 5.3.0

2017-12-18 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295253#comment-16295253
 ] 

Hudson commented on NUTCH-2380:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3487 (See 
[https://builds.apache.org/job/Nutch-trunk/3487/])
NUTCH-2380 Upgrade indexer-elastic to Elasticsearch version 5.3.0 (snagel: 
[https://github.com/apache/nutch/commit/dd94a61d3359ede0e35480b26926901f25c4b250])
* (edit) src/plugin/indexer-elastic/ivy.xml
* (edit) 
src/plugin/indexer-elastic/src/test/org/apache/nutch/indexwriter/elastic/TestElasticIndexWriter.java
* (edit) 
src/plugin/indexer-elastic/src/java/org/apache/nutch/indexwriter/elastic/ElasticIndexWriter.java
* (edit) src/plugin/indexer-elastic/plugin.xml


> indexer-elastic version upgrade to 5.3.0
> 
>
> Key: NUTCH-2380
> URL: https://issues.apache.org/jira/browse/NUTCH-2380
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.13
>Reporter: Jurian Broertjes
>Priority: Minor
> Fix For: 1.14
>
> Attachments: NUTCH-2380-indexer-elastic-p0.patch
>
>
> The current version of the indexer-elastic plugin is not compatible with ES 
> 5.x. The patch bumps the ES lib version to 5.3 but also requires a Nutch 
> classloader fix (NUTCH-2378) due to runtime dependency issues. 
> I didn't test compatibility with ES 2.x, so not sure if that still works.
> Please let me know what you think of the provided patch.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2477) Refactor *Checker classes to use base class for common code

2017-12-18 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295249#comment-16295249
 ] 

Hudson commented on NUTCH-2477:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3487 (See 
[https://builds.apache.org/job/Nutch-trunk/3487/])
fix for NUTCH-2477 (refactor checker classes) contributed by Jurian (snagel: 
[https://github.com/apache/nutch/commit/4da6b19e3b149687c624a996fc065207561217ed])
* (edit) src/java/org/apache/nutch/net/URLFilterChecker.java
* (add) src/java/org/apache/nutch/util/AbstractChecker.java
* (edit) src/java/org/apache/nutch/net/URLFilters.java
* (edit) src/java/org/apache/nutch/indexer/IndexingFiltersChecker.java
* (edit) src/java/org/apache/nutch/net/URLNormalizerChecker.java


> Refactor *Checker classes to use base class for common code
> ---
>
> Key: NUTCH-2477
> URL: https://issues.apache.org/jira/browse/NUTCH-2477
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Jurian Broertjes
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 1.14
>
>
> The various Checker class implementations have quite a bit of duplicated code 
> in them. This should be refactored for cleanliness and maintainability.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2362) Upgrade MaxMind GeoIP version in index-geoip

2017-12-18 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295246#comment-16295246
 ] 

Hudson commented on NUTCH-2362:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3487 (See 
[https://builds.apache.org/job/Nutch-trunk/3487/])
NUTCH-2362 Upgrade MaxMind GeoIP version in index-geoip (snagel: 
[https://github.com/apache/nutch/commit/e7d5c137f88816fd4b5d5054ca7fb151bae0e97e])
* (edit) src/plugin/index-geoip/ivy.xml
* (edit) src/plugin/index-geoip/plugin.xml


> Upgrade MaxMind GeoIP version in index-geoip
> 
>
> Key: NUTCH-2362
> URL: https://issues.apache.org/jira/browse/NUTCH-2362
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 1.14
>
>
> Current version of GeoIP dependency is 2.8.1, we should upgrade
> http://search.maven.org/#search|gav|1|g%3A%22com.maxmind.geoip2%22%20AND%20a%3A%22geoip2%22



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2480) Upgrade crawler-commons dependency to 0.9

2017-12-18 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2480?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295244#comment-16295244
 ] 

Hudson commented on NUTCH-2480:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3487 (See 
[https://builds.apache.org/job/Nutch-trunk/3487/])
NUTCH-2480 Upgrade crawler-commons dependency to 0.9 (snagel: 
[https://github.com/apache/nutch/commit/e7b077eeb2d823b3a09259435915ae69b2a3471a])
* (edit) ivy/ivy.xml


> Upgrade crawler-commons dependency to 0.9
> -
>
> Key: NUTCH-2480
> URL: https://issues.apache.org/jira/browse/NUTCH-2480
> Project: Nutch
>  Issue Type: Improvement
>  Components: build, deployment
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.14
>
>
> Crawler-commons [0.9 is 
> relased|https://groups.google.com/d/msg/crawler-commons/O39RrYlwwTY/m4VS0YMvBgAJ].
>  We should upgrade the dependency: there are significant improvements in the 
> sitemap parser, also crawler-commons 0.9 depends on Tika 1.16 which minimizes 
> the gap to Tika 1.17 (NUTCH-2439).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2365) HTTP Redirects to SubDomains don't get crawled if db.ignore.external.links.mode == byDomain

2017-12-18 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295239#comment-16295239
 ] 

Hudson commented on NUTCH-2365:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3487 (See 
[https://builds.apache.org/job/Nutch-trunk/3487/])
NUTCH-2365 Fetcher to respect db.ignore.external.links.mode for (snagel: 
[https://github.com/apache/nutch/commit/7cc622e14f9bcdb5fb0547ee54c0966424eabcd9])
* (edit) src/java/org/apache/nutch/fetcher/FetcherThread.java


> HTTP Redirects to SubDomains don't get crawled if 
> db.ignore.external.links.mode == byDomain
> ---
>
> Key: NUTCH-2365
> URL: https://issues.apache.org/jira/browse/NUTCH-2365
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.12
> Environment: Fedora 25
>Reporter: Sriram Nookala
>Assignee: Sebastian Nagel
> Fix For: 1.14
>
>
> Crawling a domain  http://www.mercenarytrader.com which redirects to 
> https://members.mercenarytrader.com which doesn't get followed by Nutch even 
> though 'db.ignore.external.links' is set to 'true' and 
> 'db.ignore.external.links.mode' is set to 'byDomain'. 
>   The bug is in FetcherThread where the comparison is by host and not by 
> domain
> String origHost = new URL(urlString).getHost().toLowerCase();
> >   String newHost = new URL(newUrl).getHost().toLowerCase();
> >   if (ignoreExternalLinks) {
> > if (!origHost.equals(newHost)) {
> >   if (LOG.isDebugEnabled()) {
> > LOG.debug(" - ignoring redirect " + redirType + " from "
> > + urlString + " to " + newUrl
> > + " because external links are ignored");
> >   }
> >   return null;
> > }
> >   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2478) // is not a valid base URL

2017-12-18 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2478?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295248#comment-16295248
 ] 

Hudson commented on NUTCH-2478:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3487 (See 
[https://builds.apache.org/job/Nutch-trunk/3487/])
NUTCH-2478 HTML parser should resolve base URL  - fix (snagel: 
[https://github.com/apache/nutch/commit/35193c2ddcbe8f24ea09eeabd9e90f7bc52097d5])
* (edit) 
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
* (edit) 
src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestHtmlParser.java
* (edit) 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
* (edit) 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java
* (edit) 
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
NUTCH-2478 HTML parser should resolve base URL  - finally 
(snagel: 
[https://github.com/apache/nutch/commit/8f692d13d45642f8b447d47af796f06487afeec2])
* (edit) 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/TikaParser.java
* (edit) 
src/plugin/parse-html/src/test/org/apache/nutch/parse/html/TestHtmlParser.java
* (add) src/plugin/parse-tika/src/test/org/apache/nutch/tika/TestHtmlParser.java
* (edit) 
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java
* (edit) src/java/org/apache/nutch/util/DomUtil.java


> // is not a valid base URL
> --
>
> Key: NUTCH-2478
> URL: https://issues.apache.org/jira/browse/NUTCH-2478
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
>
> This test fails:
> {code}
>   @Test
>   public void testBadResolver() throws Exception {
> URL base = new URL("//www.example.org/");
> String target = "index/produkt/kanaly/";
> 
> URL abs = URLUtil.resolveURL(base, target);
> Assert.assertEquals("http://www.example.org/index/produkt/kanaly/;, 
> abs.toString());
>   }
> {code}
> and has to fail because of invalid base URL, so the current URL is used. If 
> current URL is not /, its path will be prepended, resulting in 404 being 
> crawled.
> This ticket must allow // as base, and resolve the protocol.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2322) URL not available for Jexl operations

2017-12-18 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295250#comment-16295250
 ] 

Hudson commented on NUTCH-2322:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3487 (See 
[https://builds.apache.org/job/Nutch-trunk/3487/])
NUTCH-2322 URL not available for Jexl operations - apply patch (snagel: 
[https://github.com/apache/nutch/commit/22fc7f0defb22588c4ade33b5693303f18d96253])
* (edit) src/java/org/apache/nutch/crawl/CrawlDbReader.java
* (edit) src/java/org/apache/nutch/crawl/CrawlDatum.java
* (edit) src/java/org/apache/nutch/crawl/Generator.java


> URL not available for Jexl operations
> -
>
> Key: NUTCH-2322
> URL: https://issues.apache.org/jira/browse/NUTCH-2322
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2322-1.11.patch, NUTCH-2322-1.11.patch, 
> NUTCH-2322.patch, NUTCH-2322.patch, NUTCH-2322.patch, NUTCH-2322.patch
>
>
> In CrawlDatum.evaluate(), the records's URL is just missing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2034) CrawlDB filtered documents counter.

2017-12-18 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2034?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295251#comment-16295251
 ] 

Hudson commented on NUTCH-2034:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3487 (See 
[https://builds.apache.org/job/Nutch-trunk/3487/])
NUTCH-2034 CrawlDB update job to count documents in CrawlDb rejected by 
(snagel: 
[https://github.com/apache/nutch/commit/e0a27c7870d632966d584cf45399b98ba77e2bd6])
* (edit) src/java/org/apache/nutch/crawl/CrawlDb.java
* (edit) src/java/org/apache/nutch/crawl/CrawlDbFilter.java


> CrawlDB filtered documents counter.
> ---
>
> Key: NUTCH-2034
> URL: https://issues.apache.org/jira/browse/NUTCH-2034
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Affects Versions: 1.10
>Reporter: Luis Lopez
>Assignee: Lewis John McGibbney
>Priority: Minor
>  Labels: counters, crawldb, filter, info, regex
> Fix For: 1.14
>
>
> When we are doing big crawls we would like to know how many of the URLs are 
> being discarded by the regex filters, this is only presented in the Inject 
> class:
> Injector: Total number of urls rejected by filters: 0
> It will be nice to have a counter in the CrawlDB class so we know in every 
> round how many were discarded by our filters:
> CrawlDb update: Total number of URLs filtered by regex filters: 31415



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2216) db.ignore.*.links to optionally follow internal redirects

2017-12-18 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295252#comment-16295252
 ] 

Hudson commented on NUTCH-2216:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3487 (See 
[https://builds.apache.org/job/Nutch-trunk/3487/])
NUTCH-2216 db.ignore.*.links to optionally follow internal redirects - (snagel: 
[https://github.com/apache/nutch/commit/856e5513d4c8f9a4e35778b2e4b8f18da1e46fcc])
* (edit) conf/nutch-default.xml
* (edit) src/java/org/apache/nutch/fetcher/FetcherThread.java


> db.ignore.*.links to optionally follow internal redirects
> -
>
> Key: NUTCH-2216
> URL: https://issues.apache.org/jira/browse/NUTCH-2216
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.14
>
> Attachments: NUTCH-2216.patch, NUTCH-2216.patch
>
>
> db.ignore.internal.links doesn't follow any internal hyperlinks or redirects. 
> Together with db.ignore.external.links it helps to restrict the crawl to a 
> predefined set of URL's, for example provided by a customer.
> In many cases, a few of those URL's are redirects, which are not followed. 
> This issue adds an option to optionally allow internal redirects despite 
> db.ignore.internal.links being enabled.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2295) Nutch master docker container broken

2017-12-18 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295240#comment-16295240
 ] 

Hudson commented on NUTCH-2295:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3487 (See 
[https://builds.apache.org/job/Nutch-trunk/3487/])
NUTCH-2295 Nutch master docker container broken - upgrade base image to 
(snagel: 
[https://github.com/apache/nutch/commit/7d21158b934f421fc8987b019c627cae598d6ef0])
* (edit) docker/Dockerfile


> Nutch master docker container broken
> 
>
> Key: NUTCH-2295
> URL: https://issues.apache.org/jira/browse/NUTCH-2295
> Project: Nutch
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 1.12
>Reporter: Lewis John McGibbney
>Assignee: Sebastian Nagel
> Fix For: 1.14
>
>
> Right now the Docker container at 
> https://github.com/apache/nutch/blob/25e879afc9c48981e3daccb055b5389799fae464/docker/Dockerfile
>  is broken. 
> Various links need updated. The base image could be updated to Ubuntu 16. 
> Nutch is no longer held within SVN, etc.
> Needs a bit of time put into resolving these issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2474) CrawlDbReader -stats fails with ClassCastException

2017-12-18 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2474?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295242#comment-16295242
 ] 

Hudson commented on NUTCH-2474:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3487 (See 
[https://builds.apache.org/job/Nutch-trunk/3487/])
NUTCH-2474 CrawlDbReader -stats fails with ClassCastException - replace 
(snagel: 
[https://github.com/apache/nutch/commit/d758a31bbee0807bcbc92a591668076cfa95aeb1])
* (edit) src/java/org/apache/nutch/crawl/CrawlDbReader.java


> CrawlDbReader -stats fails with ClassCastException
> --
>
> Key: NUTCH-2474
> URL: https://issues.apache.org/jira/browse/NUTCH-2474
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.14
> Environment: Java 8, distributed mode: Hadoop CDH 5.13.0
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Critical
> Fix For: 1.14
>
>
> In distributed mode CrawlDbReader / readdb -stats fails with a 
> ClassCastException in the combiner:
> {noformat}
> 17/12/08 04:57:13 INFO mapreduce.Job: Task Id : 
> attempt_1512553291624_0022_m_39_0, Status : FAILED
> Error: java.lang.ClassCastException: org.apache.hadoop.io.FloatWritable 
> cannot be cast to org.apache.hadoop.io.LongWritable
> at 
> org.apache.nutch.crawl.CrawlDbReader$CrawlDbStatCombiner.reduce(CrawlDbReader.java:296)
> at 
> org.apache.nutch.crawl.CrawlDbReader$CrawlDbStatCombiner.reduce(CrawlDbReader.java:222)
> at 
> org.apache.hadoop.mapred.Task$OldCombinerRunner.combine(Task.java:1639)
> at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.mergeParts(MapTask.java:1946)
> at 
> org.apache.hadoop.mapred.MapTask$MapOutputBuffer.flush(MapTask.java:1514)
> at org.apache.hadoop.mapred.MapTask.runOldMapper(MapTask.java:466)
> at org.apache.hadoop.mapred.MapTask.run(MapTask.java:343)
> {noformat}
> FloatWritables are used since NUTCH-2470, so that's when this bug was 
> introduced.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2354) Upgrade Hadoop dependencies to 2.7.4

2017-12-18 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295245#comment-16295245
 ] 

Hudson commented on NUTCH-2354:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3487 (See 
[https://builds.apache.org/job/Nutch-trunk/3487/])
NUTCH-2354 Upgrade Hadoop dependencies to 2.7.4 (snagel: 
[https://github.com/apache/nutch/commit/416c457a9ddcd22f5746432a2777b9e6aa47877d])
* (edit) ivy/ivy.xml


> Upgrade Hadoop dependencies to 2.7.4
> 
>
> Key: NUTCH-2354
> URL: https://issues.apache.org/jira/browse/NUTCH-2354
> Project: Nutch
>  Issue Type: Bug
>  Components: injector
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Blocker
> Fix For: 1.14
>
>
> This wednesday we experienced trouble running the 1.12 injector on Hadoop 
> 2.7.3. We operated 2.7.2 before and we had no trouble running a job.
> {code}
> 2017-01-18 15:36:53,005 FATAL [main] org.apache.hadoop.mapred.YarnChild: 
> Error running child : java.lang.IncompatibleClassChangeError: Found interface 
> org.apache.hadoop.mapreduce.Counter, but class was expected
>   at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:216)
>   at org.apache.nutch.crawl.Injector$InjectMapper.map(Injector.java:100)
>   at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:146)
>   at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
>   at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
>   at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:164)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.security.auth.Subject.doAs(Subject.java:422)
>   at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
>   at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158)
> Exception in thread "main" java.lang.IncompatibleClassChangeError: Found 
> interface org.apache.hadoop.mapreduce.Counter, but class was expected
> at org.apache.nutch.crawl.Injector.inject(Injector.java:383)
> at org.apache.nutch.crawl.Injector.run(Injector.java:467)
> at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
> at org.apache.nutch.crawl.Injector.main(Injector.java:441)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:498)
> at org.apache.hadoop.util.RunJar.run(RunJar.java:221)
> at org.apache.hadoop.util.RunJar.main(RunJar.java:136)
> {code}
> Our processes retried injecting for a few minutes until we manually shut it 
> down. Meanwhile on HDFS, our CrawlDB was gone, thanks for snapshots and/or 
> backups we could restore it, so enable those if you haven't done so yet.
> These freak Hadoop errors can be notoriously difficult to debug but it seems 
> we are in luck, recompile Nutch with Hadoop 2.7.3 instead 2.4.0. You are also 
> in luck if your job file uses the old org.hadoop.mapred.* API, only jobs 
> using the org.hadoop.mapreduce.* API seem to fail.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2439) Upgrade to Apache Tika 1.17

2017-12-18 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2439?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295243#comment-16295243
 ] 

Hudson commented on NUTCH-2439:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3487 (See 
[https://builds.apache.org/job/Nutch-trunk/3487/])
NUTCH-2439 Upgrade Apache Tika dependency to 1.17 (snagel: 
[https://github.com/apache/nutch/commit/42bdc65df4569d66d188ffc9981e6bf7baea45c7])
* (edit) src/plugin/parse-tika/ivy.xml
* (edit) src/plugin/parse-tika/plugin.xml
* (edit) ivy/ivy.xml


> Upgrade to Apache Tika 1.17
> ---
>
> Key: NUTCH-2439
> URL: https://issues.apache.org/jira/browse/NUTCH-2439
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2439.patch, NUTCH-2439.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Resolved] (NUTCH-2295) Nutch master docker container broken

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2295.

Resolution: Fixed

Fixed in 1.x. Thanks, [~lewismc]!

> Nutch master docker container broken
> 
>
> Key: NUTCH-2295
> URL: https://issues.apache.org/jira/browse/NUTCH-2295
> Project: Nutch
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 1.12
>Reporter: Lewis John McGibbney
>Assignee: Sebastian Nagel
> Fix For: 1.14
>
>
> Right now the Docker container at 
> https://github.com/apache/nutch/blob/25e879afc9c48981e3daccb055b5389799fae464/docker/Dockerfile
>  is broken. 
> Various links need updated. The base image could be updated to Ubuntu 16. 
> Nutch is no longer held within SVN, etc.
> Needs a bit of time put into resolving these issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2295) Nutch master docker container broken

2017-12-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295219#comment-16295219
 ] 

ASF GitHub Bot commented on NUTCH-2295:
---

sebastian-nagel closed pull request #266: NUTCH-2295 Nutch master docker 
container broken
URL: https://github.com/apache/nutch/pull/266
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/docker/Dockerfile b/docker/Dockerfile
index 404f8c69f..c5ba8073a 100644
--- a/docker/Dockerfile
+++ b/docker/Dockerfile
@@ -13,28 +13,21 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 
-FROM ubuntu:14.04
+FROM ubuntu:16.04
 MAINTAINER Michael Joyce 
 
 WORKDIR /root/
 
-# Get the package containing apt-add-repository installed for adding 
repositories
-RUN apt-get update && apt-get install -y software-properties-common
 
-# Add the repository that we'll pull java down from.
-RUN add-apt-repository -y ppa:webupd8team/java && apt-get update && apt-get 
upgrade -y
-
-# Get Oracle Java 1.7 installed
-RUN echo oracle-java7-installer shared/accepted-oracle-license-v1-1 select 
true | /usr/bin/debconf-set-selections && apt-get install -y 
oracle-java7-installer oracle-java7-set-default
-
-# Install various dependencies
-RUN apt-get install -y ant openssh-server vim telnet subversion rsync curl 
build-essential 
+# Install dependencies
+RUN apt update
+RUN apt install -y ant openssh-server vim telnet git rsync curl 
openjdk-8-jdk-headless
 
 # Set up JAVA_HOME
-RUN echo 'export JAVA_HOME=$(readlink -f /usr/bin/java | sed "s:bin/java::")' 
>> $HOME/.bashrc
+RUN echo 'export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64' >> $HOME/.bashrc
 
 # Checkout and build the nutch trunk
-RUN svn checkout https://svn.apache.org/repos/asf/nutch/trunk/ nutch_source && 
cd nutch_source && ant
+RUN git clone https://github.com/apache/nutch.git nutch_source && cd 
nutch_source && ant runtime
 
 # Convenience symlink to Nutch runtime local
 RUN ln -s nutch_source/runtime/local $HOME/nutch


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Nutch master docker container broken
> 
>
> Key: NUTCH-2295
> URL: https://issues.apache.org/jira/browse/NUTCH-2295
> Project: Nutch
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 1.12
>Reporter: Lewis John McGibbney
>Assignee: Sebastian Nagel
> Fix For: 1.14
>
>
> Right now the Docker container at 
> https://github.com/apache/nutch/blob/25e879afc9c48981e3daccb055b5389799fae464/docker/Dockerfile
>  is broken. 
> Various links need updated. The base image could be updated to Ubuntu 16. 
> Nutch is no longer held within SVN, etc.
> Needs a bit of time put into resolving these issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Resolved] (NUTCH-2216) db.ignore.*.links to optionally follow internal redirects

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2216.

Resolution: Fixed

Committed to 1.x 
([c6e5dfb|https://github.com/apache/nutch/commit/c6e5dfb3d2f430d9b899a273515f58c093295baa]).
 Thanks, [~markus17]!

> db.ignore.*.links to optionally follow internal redirects
> -
>
> Key: NUTCH-2216
> URL: https://issues.apache.org/jira/browse/NUTCH-2216
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.14
>
> Attachments: NUTCH-2216.patch, NUTCH-2216.patch
>
>
> db.ignore.internal.links doesn't follow any internal hyperlinks or redirects. 
> Together with db.ignore.external.links it helps to restrict the crawl to a 
> predefined set of URL's, for example provided by a customer.
> In many cases, a few of those URL's are redirects, which are not followed. 
> This issue adds an option to optionally allow internal redirects despite 
> db.ignore.internal.links being enabled.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Resolved] (NUTCH-2365) HTTP Redirects to SubDomains don't get crawled if db.ignore.external.links.mode == byDomain

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2365.

Resolution: Fixed

Fixed in 1.x - thanks, [~srinookala]!

> HTTP Redirects to SubDomains don't get crawled if 
> db.ignore.external.links.mode == byDomain
> ---
>
> Key: NUTCH-2365
> URL: https://issues.apache.org/jira/browse/NUTCH-2365
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.12
> Environment: Fedora 25
>Reporter: Sriram Nookala
>Assignee: Sebastian Nagel
> Fix For: 1.14
>
>
> Crawling a domain  http://www.mercenarytrader.com which redirects to 
> https://members.mercenarytrader.com which doesn't get followed by Nutch even 
> though 'db.ignore.external.links' is set to 'true' and 
> 'db.ignore.external.links.mode' is set to 'byDomain'. 
>   The bug is in FetcherThread where the comparison is by host and not by 
> domain
> String origHost = new URL(urlString).getHost().toLowerCase();
> >   String newHost = new URL(newUrl).getHost().toLowerCase();
> >   if (ignoreExternalLinks) {
> > if (!origHost.equals(newHost)) {
> >   if (LOG.isDebugEnabled()) {
> > LOG.debug(" - ignoring redirect " + redirType + " from "
> > + urlString + " to " + newUrl
> > + " because external links are ignored");
> >   }
> >   return null;
> > }
> >   }



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2365) HTTP Redirects to SubDomains don't get crawled if db.ignore.external.links.mode == byDomain

2017-12-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2365?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295210#comment-16295210
 ] 

ASF GitHub Bot commented on NUTCH-2365:
---

sebastian-nagel closed pull request #264: NUTCH-2365 Fetcher to respect 
db.ignore.external.links.mode for redirects
URL: https://github.com/apache/nutch/pull/264
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index f386527a2..3d12be282 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -572,7 +572,7 @@
   false
   If true, outlinks leading from a page to internal hosts or 
domain
   will be ignored. This is an effective way to limit the crawl to include
-  only initially injected hosts, without creating complex URLFilters.
+  only initially injected hosts or domains, without creating complex 
URLFilters.
   See 'db.ignore.external.links.mode'.
   
 
@@ -582,11 +582,21 @@
   false
   If true, outlinks leading from a page to external hosts or 
domain
   will be ignored. This is an effective way to limit the crawl to include
-  only initially injected hosts, without creating complex URLFilters.
+  only initially injected hosts or domains, without creating complex 
URLFilters.
   See 'db.ignore.external.links.mode'.
   
 
 
+
+  db.ignore.also.redirects
+  true
+  If true, the fetcher checks redirects the same way as
+  links when ignoring internal or external links. Set to false to
+  follow redirects despite the values for db.ignore.external.links and
+  db.ignore.internal.links.
+  
+
+
 
   db.ignore.external.links.mode
   byHost
diff --git a/src/java/org/apache/nutch/fetcher/FetcherThread.java 
b/src/java/org/apache/nutch/fetcher/FetcherThread.java
index 42d5d5077..6c70186a6 100644
--- a/src/java/org/apache/nutch/fetcher/FetcherThread.java
+++ b/src/java/org/apache/nutch/fetcher/FetcherThread.java
@@ -92,6 +92,7 @@
   private int redirectCount;
   private boolean ignoreInternalLinks;
   private boolean ignoreExternalLinks;
+  private boolean ignoreAlsoRedirects;
   private String ignoreExternalLinksMode;
 
   // Used by fetcher.follow.outlinks.depth in parse
@@ -207,6 +208,7 @@ public FetcherThread(Configuration conf, AtomicInteger 
activeThreads, FetchItemQ
 interval = conf.getInt("db.fetch.interval.default", 2592000);
 ignoreInternalLinks = conf.getBoolean("db.ignore.internal.links", false);
 ignoreExternalLinks = conf.getBoolean("db.ignore.external.links", false);
+ignoreAlsoRedirects = conf.getBoolean("db.ignore.also.redirects", true);
 ignoreExternalLinksMode = conf.get("db.ignore.external.links.mode", 
"byHost");
 maxOutlinkDepth = conf.getInt("fetcher.follow.outlinks.depth", -1);
 outlinksIgnoreExternal = conf.getBoolean(
@@ -484,69 +486,72 @@ private Text handleRedirect(Text url, CrawlDatum datum, 
String urlString,
 newUrl = normalizers.normalize(newUrl, URLNormalizers.SCOPE_FETCHER);
 newUrl = urlFilters.filter(newUrl);
 
-try {
-  String origHost = new URL(urlString).getHost().toLowerCase();
-  String newHost = new URL(newUrl).getHost().toLowerCase();
-  if (ignoreExternalLinks) {
-if (!origHost.equals(newHost)) {
-  if (LOG.isDebugEnabled()) {
-LOG.debug(" - ignoring redirect " + redirType + " from "
-+ urlString + " to " + newUrl
-+ " because external links are ignored");
+if (newUrl == null || newUrl.equals(urlString)) {
+  LOG.debug(" - {} redirect skipped: {}", redirType,
+  (newUrl != null ? "to same url" : "filtered"));
+  return null;
+}
+
+if (ignoreAlsoRedirects && (ignoreExternalLinks || ignoreInternalLinks)) {
+  try {
+URL origUrl = new URL(urlString);
+URL redirUrl = new URL(newUrl);
+if (ignoreExternalLinks) {
+  String origHostOrDomain, newHostOrDomain;
+  if ("bydomain".equalsIgnoreCase(ignoreExternalLinksMode)) {
+origHostOrDomain = URLUtil.getDomainName(origUrl).toLowerCase();
+newHostOrDomain = URLUtil.getDomainName(redirUrl).toLowerCase();
+  } else {
+// byHost
+origHostOrDomain = origUrl.getHost().toLowerCase();
+newHostOrDomain = redirUrl.getHost().toLowerCase();
   }
-  return null;
-}
-  }
-  
-  if (ignoreInternalLinks) {
-if (origHost.equals(newHost)) {
-  if (LOG.isDebugEnabled()) {
-LOG.debug(" - ignoring redirect " + redirType + " from "
-+ urlString + " to " + newUrl
-+ " because internal links are ignored");
+  if (!origHostOrDomain.equals(newHostOrDomain)) {
+

[jira] [Resolved] (NUTCH-2380) indexer-elastic version upgrade to 5.3.0

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2380.

Resolution: Fixed

Committed to 1.x 
([dd94a61|https://github.com/apache/nutch/commit/dd94a61d3359ede0e35480b26926901f25c4b250]).
 Thanks, [~jurian]!

> indexer-elastic version upgrade to 5.3.0
> 
>
> Key: NUTCH-2380
> URL: https://issues.apache.org/jira/browse/NUTCH-2380
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.13
>Reporter: Jurian Broertjes
>Priority: Minor
> Fix For: 1.14
>
> Attachments: NUTCH-2380-indexer-elastic-p0.patch
>
>
> The current version of the indexer-elastic plugin is not compatible with ES 
> 5.x. The patch bumps the ES lib version to 5.3 but also requires a Nutch 
> classloader fix (NUTCH-2378) due to runtime dependency issues. 
> I didn't test compatibility with ES 2.x, so not sure if that still works.
> Please let me know what you think of the provided patch.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (NUTCH-2380) indexer-elastic version upgrade to 5.3.0

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2380:
---
Summary: indexer-elastic version upgrade to 5.3.0  (was: indexer-elastic 
version bump)

> indexer-elastic version upgrade to 5.3.0
> 
>
> Key: NUTCH-2380
> URL: https://issues.apache.org/jira/browse/NUTCH-2380
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.13
>Reporter: Jurian Broertjes
>Priority: Minor
> Fix For: 1.14
>
> Attachments: NUTCH-2380-indexer-elastic-p0.patch
>
>
> The current version of the indexer-elastic plugin is not compatible with ES 
> 5.x. The patch bumps the ES lib version to 5.3 but also requires a Nutch 
> classloader fix (NUTCH-2378) due to runtime dependency issues. 
> I didn't test compatibility with ES 2.x, so not sure if that still works.
> Please let me know what you think of the provided patch.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2216) db.ignore.*.links to optionally follow internal redirects

2017-12-18 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295175#comment-16295175
 ] 

Sebastian Nagel commented on NUTCH-2216:


Partially overlaps with NUTCH-2365, will update [PR 
#264|https://github.com/apache/nutch/pull/264] to include the fix for 
NUTCH-2216.

> db.ignore.*.links to optionally follow internal redirects
> -
>
> Key: NUTCH-2216
> URL: https://issues.apache.org/jira/browse/NUTCH-2216
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.14
>
> Attachments: NUTCH-2216.patch, NUTCH-2216.patch
>
>
> db.ignore.internal.links doesn't follow any internal hyperlinks or redirects. 
> Together with db.ignore.external.links it helps to restrict the crawl to a 
> predefined set of URL's, for example provided by a customer.
> In many cases, a few of those URL's are redirects, which are not followed. 
> This issue adds an option to optionally allow internal redirects despite 
> db.ignore.internal.links being enabled.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Resolved] (NUTCH-2415) Create a JEXL based IndexingFilter

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2415.

Resolution: Fixed

Thanks, everyone!

> Create a JEXL based IndexingFilter
> --
>
> Key: NUTCH-2415
> URL: https://issues.apache.org/jira/browse/NUTCH-2415
> Project: Nutch
>  Issue Type: New Feature
>  Components: plugin
>Affects Versions: 1.13
>Reporter: Yossi Tamari
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
> Fix For: 1.14
>
>
> Following on NUTCH-2414 and NUTCH-2412, the requirement was raised for a 
> IndexingFilter plugin which will decide whether to index a document based on 
> a JEXL expression.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2415) Create a JEXL based IndexingFilter

2017-12-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295147#comment-16295147
 ] 

ASF GitHub Bot commented on NUTCH-2415:
---

sebastian-nagel closed pull request #219: NUTCH-2415 : Create a JEXL based 
IndexingFilter
URL: https://github.com/apache/nutch/pull/219
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/build.xml b/build.xml
index 47c2a2ede..56be49533 100644
--- a/build.xml
+++ b/build.xml
@@ -179,6 +179,7 @@
   
   
   
+  
   
   
   
@@ -630,6 +631,7 @@
   
   
   
+  
   
   
   
@@ -1040,6 +1042,8 @@
 
 
 
+
+
 
 
 
diff --git a/conf/nutch-default.xml b/conf/nutch-default.xml
index e68b0dd84..5e8606fe4 100644
--- a/conf/nutch-default.xml
+++ b/conf/nutch-default.xml
@@ -1618,6 +1618,24 @@ visit 
https://wiki.apache.org/nutch/SimilarityScoringFilter-->
   
 
 
+
+
+
+  index.jexl.filter
+  
+   A JEXL expression. If it evaluates to false,
+  the document will not be indexed.
+  Available primitives in the JEXL context:
+  * status, fetchTime, modifiedTime, retries, interval, score, signature, url, 
text, title
+  Available objects in the JEXL context:
+  * httpStatus - contains majorCode, minorCode, message
+  * documentMeta, contentMeta, parseMeta - contain all the Metadata properties.
+each property value is always an array of Strings (so if you expect one 
value, use [0])
+  * doc - contains all the NutchFields from the NutchDocument.
+each property value is always an array of Objects.
+  
+
+
 
 
 
diff --git a/default.properties b/default.properties
index 6b7a6ab79..c057518d8 100644
--- a/default.properties
+++ b/default.properties
@@ -170,6 +170,7 @@ plugins.index=\
org.apache.nutch.indexer.basic*:\
org.apache.nutch.indexer.feed*:\
org.apache.nutch.indexer.geoip*:\
+   org.apache.nutch.indexer.jexl*:\
org.apache.nutch.indexer.filter*:\
org.apache.nutch.indexer.links*:\
org.apache.nutch.indexer.metadata*:\
diff --git a/src/plugin/build.xml b/src/plugin/build.xml
index 5402d036c..5052082cd 100755
--- a/src/plugin/build.xml
+++ b/src/plugin/build.xml
@@ -40,6 +40,7 @@
 
 
 
+
 
 
 
@@ -159,6 +160,7 @@
 
 
 
+
 
 
 
diff --git a/src/plugin/index-jexl-filter/build.xml 
b/src/plugin/index-jexl-filter/build.xml
new file mode 100644
index 0..7aa7be24d
--- /dev/null
+++ b/src/plugin/index-jexl-filter/build.xml
@@ -0,0 +1,22 @@
+
+
+
+
+
+
+ 
diff --git a/src/plugin/index-jexl-filter/ivy.xml 
b/src/plugin/index-jexl-filter/ivy.xml
new file mode 100644
index 0..0a363f774
--- /dev/null
+++ b/src/plugin/index-jexl-filter/ivy.xml
@@ -0,0 +1,41 @@
+
+
+
+
+
+  
+
+http://nutch.apache.org"/>
+
+Apache Nutch
+
+  
+
+  
+
+  
+
+  
+
+
+  
+
+  
+  
+  
+
diff --git a/src/plugin/index-jexl-filter/plugin.xml 
b/src/plugin/index-jexl-filter/plugin.xml
new file mode 100644
index 0..a24a0c95f
--- /dev/null
+++ b/src/plugin/index-jexl-filter/plugin.xml
@@ -0,0 +1,37 @@
+
+
+
+
+   
+  
+ 
+  
+   
+
+
+
+
+
+
diff --git 
a/src/plugin/index-jexl-filter/src/java/org/apache/nutch/indexer/jexl/JexlIndexingFilter.java
 
b/src/plugin/index-jexl-filter/src/java/org/apache/nutch/indexer/jexl/JexlIndexingFilter.java
new file mode 100644
index 0..24284a67b
--- /dev/null
+++ 
b/src/plugin/index-jexl-filter/src/java/org/apache/nutch/indexer/jexl/JexlIndexingFilter.java
@@ -0,0 +1,131 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.nutch.indexer.jexl;
+
+import java.lang.invoke.MethodHandles;
+import java.util.Map.Entry;
+
+import org.apache.commons.jexl2.Expression;
+import org.apache.commons.jexl2.JexlContext;
+import org.apache.commons.jexl2.MapContext;
+import

[jira] [Commented] (NUTCH-2295) Nutch master docker container broken

2017-12-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295129#comment-16295129
 ] 

ASF GitHub Bot commented on NUTCH-2295:
---

sebastian-nagel opened a new pull request #266: NUTCH-2295 Nutch master docker 
container broken
URL: https://github.com/apache/nutch/pull/266
 
 
   - upgrade base image to Ubuntu 16.04
   - fetch Nutch sources via git


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Nutch master docker container broken
> 
>
> Key: NUTCH-2295
> URL: https://issues.apache.org/jira/browse/NUTCH-2295
> Project: Nutch
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 1.12
>Reporter: Lewis John McGibbney
>Assignee: Sebastian Nagel
> Fix For: 1.14
>
>
> Right now the Docker container at 
> https://github.com/apache/nutch/blob/25e879afc9c48981e3daccb055b5389799fae464/docker/Dockerfile
>  is broken. 
> Various links need updated. The base image could be updated to Ubuntu 16. 
> Nutch is no longer held within SVN, etc.
> Needs a bit of time put into resolving these issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (NUTCH-2485) ParserFactory swallows exception

2017-12-18 Thread Markus Jelsma (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2485:
-
Attachment: NUTCH-2485.patch

Patch!

> ParserFactory swallows exception
> 
>
> Key: NUTCH-2485
> URL: https://issues.apache.org/jira/browse/NUTCH-2485
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-2485.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Created] (NUTCH-2485) ParserFactory swallows exception

2017-12-18 Thread Markus Jelsma (JIRA)

Markus Jelsma created NUTCH-2485:


 Summary: ParserFactory swallows exception
 Key: NUTCH-2485
 URL: https://issues.apache.org/jira/browse/NUTCH-2485
 Project: Nutch
  Issue Type: Bug
Affects Versions: 1.13
Reporter: Markus Jelsma
Assignee: Markus Jelsma
Priority: Minor
 Fix For: 1.15






--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (NUTCH-2295) Nutch master docker container broken

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2295:
---
Fix Version/s: (was: 1.15)
   1.14

> Nutch master docker container broken
> 
>
> Key: NUTCH-2295
> URL: https://issues.apache.org/jira/browse/NUTCH-2295
> Project: Nutch
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 1.12
>Reporter: Lewis John McGibbney
> Fix For: 1.14
>
>
> Right now the Docker container at 
> https://github.com/apache/nutch/blob/25e879afc9c48981e3daccb055b5389799fae464/docker/Dockerfile
>  is broken. 
> Various links need updated. The base image could be updated to Ubuntu 16. 
> Nutch is no longer held within SVN, etc.
> Needs a bit of time put into resolving these issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Assigned] (NUTCH-2295) Nutch master docker container broken

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2295:
--

Assignee: Sebastian Nagel

> Nutch master docker container broken
> 
>
> Key: NUTCH-2295
> URL: https://issues.apache.org/jira/browse/NUTCH-2295
> Project: Nutch
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 1.12
>Reporter: Lewis John McGibbney
>Assignee: Sebastian Nagel
> Fix For: 1.14
>
>
> Right now the Docker container at 
> https://github.com/apache/nutch/blob/25e879afc9c48981e3daccb055b5389799fae464/docker/Dockerfile
>  is broken. 
> Various links need updated. The base image could be updated to Ubuntu 16. 
> Nutch is no longer held within SVN, etc.
> Needs a bit of time put into resolving these issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2359) Parsefilter-regex raises IndexOutOfBoundsException when rules are ill-formed

2017-12-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295051#comment-16295051
 ] 

ASF GitHub Bot commented on NUTCH-2359:
---

sebastian-nagel closed pull request #178: NUTCH-2359 RegexParseFilter: 
ill-formed rules raise error
URL: https://github.com/apache/nutch/pull/178
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/plugin/parsefilter-regex/README.txt 
b/src/plugin/parsefilter-regex/README.txt
new file mode 100644
index 0..9cbfbf170
--- /dev/null
+++ b/src/plugin/parsefilter-regex/README.txt
@@ -0,0 +1,37 @@
+Parsefilter-regex plugin
+
+Allow parsing and set custom defined fields using regex. Rules can be defined 
in a separate rule file or in the nutch configuration.
+
+If a rule file is used, should create a text file regex-parsefilter.txt (which 
is the default name of the rules file). To use a different filename, either 
update the file value in plugin’s build.xml or add parsefilter.regex.file 
config to the nutch config.
+
+ie:
+
+  parsefilter.regex.file
+  
+   /path/to/rulefile
+  
+\t\t\n
+
+ie:
+   my_first_field  htmlh1
+   my_second_field textmy_pattern
+
+
+If a rule file is not used, rules can be directly set in the nutch config:
+
+ie:
+
+  parsefilter.regex.rules
+  
+   my_first_field  htmlh1
+   my_second_field textmy_pattern
+  
+ Parsefilter-regex raises IndexOutOfBoundsException when rules are ill-formed
> 
>
> Key: NUTCH-2359
> URL: https://issues.apache.org/jira/browse/NUTCH-2359
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin
>Affects Versions: 1.12
>Reporter: Laknath Semage
>Assignee: Markus Jelsma
>Priority: Minor
>  Labels: patch
> Fix For: 1.13
>
>
> This patch fixes:
> 1) [Bug] Parsefilter-regex raises IndexOutOfBoundsException when rules are 
> ill-formed
> 2) Rules are split using any space character (\s) instead tab (\t) 
> 3) A detailed Readme for the plugin



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2359) Parsefilter-regex raises IndexOutOfBoundsException when rules are ill-formed

2017-12-18 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2359?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295050#comment-16295050
 ] 

ASF GitHub Bot commented on NUTCH-2359:
---

sebastian-nagel commented on issue #178: NUTCH-2359 RegexParseFilter: 
ill-formed rules raise error
URL: https://github.com/apache/nutch/pull/178#issuecomment-352438765
 
 
   Committed to 1.x/master (9a9c4b3).


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Parsefilter-regex raises IndexOutOfBoundsException when rules are ill-formed
> 
>
> Key: NUTCH-2359
> URL: https://issues.apache.org/jira/browse/NUTCH-2359
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin
>Affects Versions: 1.12
>Reporter: Laknath Semage
>Assignee: Markus Jelsma
>Priority: Minor
>  Labels: patch
> Fix For: 1.13
>
>
> This patch fixes:
> 1) [Bug] Parsefilter-regex raises IndexOutOfBoundsException when rules are 
> ill-formed
> 2) Rules are split using any space character (\s) instead tab (\t) 
> 3) A detailed Readme for the plugin



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (NUTCH-2381) In some situations the class TextProfileSignature gives different signatures for the same text "profile" page.

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2381?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2381:
---
Fix Version/s: 1.15

> In some situations the class TextProfileSignature gives different signatures 
> for the same text "profile" page.
> --
>
> Key: NUTCH-2381
> URL: https://issues.apache.org/jira/browse/NUTCH-2381
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.13
>Reporter: Rodrigo Joni Sestari
>  Labels: signature
> Fix For: 1.15
>
>
> In some situations the class TextProfileSignature gives different signatures 
> for the same text "profile" page.
> The method TextProfileSignature.calculate uses a HashMap to salve the tokens, 
> after some process, the tokens come sorted by decreasing frequency.
> For some pages like "http://curia.europa.eu/jcms/; the text "profile" is the 
> same but the signature come different for each fetch.
> Its happens because the tokens are sorted only by decreasing frequency. 
> Tokens with the same frequency maybe not have the same order in different 
> fetchs.
> The HashMap no guarantees as to the order of the map and  not guarantee that 
> the order will remain constant over time.
> My suggestion is change the methods TokenComparator.compare  in order to sort 
> by frequency and Name.
> Rodrigo



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (NUTCH-2415) Create a JEXL based IndexingFilter

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2415?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2415:
---
Fix Version/s: 1.14

> Create a JEXL based IndexingFilter
> --
>
> Key: NUTCH-2415
> URL: https://issues.apache.org/jira/browse/NUTCH-2415
> Project: Nutch
>  Issue Type: New Feature
>  Components: plugin
>Affects Versions: 1.13
>Reporter: Yossi Tamari
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
> Fix For: 1.14
>
>
> Following on NUTCH-2414 and NUTCH-2412, the requirement was raised for a 
> IndexingFilter plugin which will decide whether to index a document based on 
> a JEXL expression.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2382) indexer-hbase Nutch 1.x branch

2017-12-18 Thread Jurian Broertjes (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295034#comment-16295034
 ] 

Jurian Broertjes commented on NUTCH-2382:
-

Yeah +1 for that.

> indexer-hbase Nutch 1.x branch
> --
>
> Key: NUTCH-2382
> URL: https://issues.apache.org/jira/browse/NUTCH-2382
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.13
>Reporter: Jurian Broertjes
> Fix For: 1.15
>
> Attachments: NUTCH-2382-indexer-hbase-p1.patch
>
>
> I've ported the indexer-hbase for Nutch 2.x 
> (https://github.com/apache/nutch/pull/184) to 1.x. Dit some basic tests. 
> Patch is attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (NUTCH-2416) Fetcher to log thread ID

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2416:
---
Fix Version/s: (was: 1.14)
   1.15

> Fetcher to log thread ID
> 
>
> Key: NUTCH-2416
> URL: https://issues.apache.org/jira/browse/NUTCH-2416
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Trivial
> Fix For: 1.15
>
> Attachments: NUTCH-2416.patch, NUTCH-2416.patch, NUTCH-2416.patch
>
>
> Better logging for the fetcher.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (NUTCH-2432) Protocol httpclient to disable cookies if http.enable.cookie.header is false

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2432?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2432:
---
Fix Version/s: (was: 1.14)
   1.15

> Protocol httpclient to disable cookies if http.enable.cookie.header is false
> 
>
> Key: NUTCH-2432
> URL: https://issues.apache.org/jira/browse/NUTCH-2432
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-2432.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (NUTCH-2466) Sitemap processor to follow redirects

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2466:
---
Fix Version/s: (was: 1.14)
   1.15

> Sitemap processor to follow redirects
> -
>
> Key: NUTCH-2466
> URL: https://issues.apache.org/jira/browse/NUTCH-2466
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-2466.patch
>
>
> It does follow http > https, but not the following redirect, e.g. 
> sitemap_index.xml that some websites have.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (NUTCH-2411) Index-metadata to support indexing multiple values for a field

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2411?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2411:
---
Fix Version/s: (was: 1.14)
   1.15

> Index-metadata to support indexing multiple values for a field 
> ---
>
> Key: NUTCH-2411
> URL: https://issues.apache.org/jira/browse/NUTCH-2411
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-2411-1.13.patch, NUTCH-2411-1.13.patch, 
> NUTCH-2411.patch, NUTCH-2411.patch, NUTCH-2411.patch, NUTCH-2411.patch
>
>
> {code}
> 
>   index.metadata.separator
>   
>   
>Separator to use if you want to index multiple values for a given field. 
> Leave empty to
>treat each value as a single value.
>   
> 
> 
>   index.metadata.multivalued.fields
>   
>   
> Comma separated list of fields that are multi valued.
>   
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (NUTCH-2321) Indexing filter checker leaks threads

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2321:
---
Fix Version/s: (was: 1.14)
   1.15

> Indexing filter checker leaks threads
> -
>
> Key: NUTCH-2321
> URL: https://issues.apache.org/jira/browse/NUTCH-2321
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-2321.patch
>
>
> Same issue as NUTCH-2320.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (NUTCH-2467) Sitemap type field can be null

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2467:
---
Fix Version/s: (was: 1.14)
   1.15

> Sitemap type field can be null
> --
>
> Key: NUTCH-2467
> URL: https://issues.apache.org/jira/browse/NUTCH-2467
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.15
>
> Attachments: NUTCH-2467.patch
>
>
> sitemap.isIndex() can return null for real sitemap indices, so there contents 
> won't be added to the CrawlDB. Example, the indices 
> https://www.reisenco.nl/sitemap_index.xml points to are not processed.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (NUTCH-2454) REST API fix for usage of hostdb in generator

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2454?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2454:
---
Fix Version/s: (was: 1.14)
   1.15

> REST API fix for usage of hostdb in generator
> -
>
> Key: NUTCH-2454
> URL: https://issues.apache.org/jira/browse/NUTCH-2454
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.12
>Reporter: Semyon Semyonov
> Fix For: 1.15
>
> Attachments: NUTCH-2368_RESTAPI_Fix.patch
>
>
> NutchNUTCH-2368
> Variable generate.max.count and fetcher.server.delay



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (NUTCH-2441) ARG_SEGMENT usage

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2441?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2441:
---
Fix Version/s: (was: 1.14)
   1.15

> ARG_SEGMENT usage
> -
>
> Key: NUTCH-2441
> URL: https://issues.apache.org/jira/browse/NUTCH-2441
> Project: Nutch
>  Issue Type: Improvement
>  Components: metadata
>Affects Versions: 1.13
>Reporter: Semyon Semyonov
> Fix For: 1.15
>
> Attachments: metadataARG_SEGMENT.patch
>
>
> The class metadata/Nutch.java  public static final String ARG_SEGMENT = 
> "segment" is not used correctly. In some cases Fetcher and ParseSegment it is 
> interpreted as a single segmenet, in others CrawlDb, LinkDb, IndexingJob as 
> an array of segments. Such misunderstanding leads to inconsistency of usage 
> of the parameter.
> After a discussion with [~wastl-nagel]  the proposed solution is to allow the 
> usage of both array and a string in all cases. That gives an opportunity to 
> not introduce the broken changes.
> A path is proposed.
>  *The question left is refactoring, all these five components share the same 
> code(two versions of the same code to be precise). Shouldn't we extract a 
> method and reduce duplicates?  *



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (NUTCH-2417) Support for variable fetch delay via FreeGenerator

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2417:
---
Fix Version/s: (was: 1.14)
   1.15

> Support for variable fetch delay via FreeGenerator
> --
>
> Key: NUTCH-2417
> URL: https://issues.apache.org/jira/browse/NUTCH-2417
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.15
>
>
> Same as NUTCH-2368



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (NUTCH-2382) indexer-hbase Nutch 1.x branch

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2382?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2382:
---
Fix Version/s: (was: 1.14)
   1.15

> indexer-hbase Nutch 1.x branch
> --
>
> Key: NUTCH-2382
> URL: https://issues.apache.org/jira/browse/NUTCH-2382
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.13
>Reporter: Jurian Broertjes
> Fix For: 1.15
>
> Attachments: NUTCH-2382-indexer-hbase-p1.patch
>
>
> I've ported the indexer-hbase for Nutch 2.x 
> (https://github.com/apache/nutch/pull/184) to 1.x. Dit some basic tests. 
> Patch is attached.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (NUTCH-2248) CSS parser plugin

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2248:
---
Fix Version/s: (was: 1.14)
   1.15

> CSS parser plugin
> -
>
> Key: NUTCH-2248
> URL: https://issues.apache.org/jira/browse/NUTCH-2248
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser, plugin
>Reporter: Joseph Naegele
>Assignee: Chris A. Mattmann
> Fix For: 1.15
>
> Attachments: 102.patch
>
>
> This plugin allows for collecting {{uri}} links from CSS (stylesheets). This 
> is useful for collecting parent stylesheets, fonts, and images needed to 
> display web pages as intended.
> Parsed Outlinks do not have associated anchors, and no additional 
> text/content is parsed from the stylesheet.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (NUTCH-2310) Protocol-Selenium does not support HTTPS protocol

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2310:
---
Fix Version/s: (was: 1.14)
   1.15

> Protocol-Selenium does not support HTTPS protocol
> -
>
> Key: NUTCH-2310
> URL: https://issues.apache.org/jira/browse/NUTCH-2310
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 1.12
>Reporter: Joey Hong
>  Labels: easyfix
> Fix For: 1.15
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> The protocol-selenium and protocol-interactiveselenium plugins raise errors 
> whenever there is a URL with the HTTPS protocol.
>  From the source code for those plugins, we can see that HTTP is the only 
> scheme currently accepted, which makes Nutch unable to crawl HTTPS sites with 
> JS using Selenium Webdrivers. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (NUTCH-2295) Nutch master docker container broken

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2295?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2295:
---
Fix Version/s: (was: 1.14)
   1.15

> Nutch master docker container broken
> 
>
> Key: NUTCH-2295
> URL: https://issues.apache.org/jira/browse/NUTCH-2295
> Project: Nutch
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 1.12
>Reporter: Lewis John McGibbney
> Fix For: 1.15
>
>
> Right now the Docker container at 
> https://github.com/apache/nutch/blob/25e879afc9c48981e3daccb055b5389799fae464/docker/Dockerfile
>  is broken. 
> Various links need updated. The base image could be updated to Ubuntu 16. 
> Nutch is no longer held within SVN, etc.
> Needs a bit of time put into resolving these issues.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (NUTCH-2290) Update licenses of bundled libraries

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2290?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2290:
---
Fix Version/s: (was: 1.14)
   1.15

> Update licenses of bundled libraries
> 
>
> Key: NUTCH-2290
> URL: https://issues.apache.org/jira/browse/NUTCH-2290
> Project: Nutch
>  Issue Type: Bug
>  Components: deployment
>Affects Versions: 2.3.1, 1.12
>Reporter: Sebastian Nagel
> Fix For: 1.15
>
>
> The files LICENSE.txt and NOTICE.txt were last edited 5 years ago and should 
> be updated to include all licenses of dependencies (and their dependencies) 
> in accordance to [Assembling LICENSE and NOTICE 
> HOWTO|http://www.apache.org/dev/licensing-howto.html]:
> # check for missing or obsolete licenses due to added or removed dependencies
> # update year in NOTICE.txt -- should be a range according to the licensing 
> HOWTO
> # bundled libraries are referenced with path and version number, e.g 
> {{lib/icu4j-4_0_1.jar}}. This would require to update the LICENSE.txt with 
> every dependency upgrade. A more generic reference ("ICU4J") would be easier 
> to maintain but the HOWTO requires to "specify the version of the dependency 
> as licenses are sometimes changed".
> # try to reduce the size of LICENSE.txt (currently 5800 lines). Mainly, 
> according to the HOWTO there is no need to repeat the Apache license again 
> and again.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (NUTCH-2461) Generate passes the data to when maxCount == 0

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2461?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2461:
---
Fix Version/s: (was: 1.14)
   1.15

> Generate passes the data to when maxCount  == 0
> ---
>
> Key: NUTCH-2461
> URL: https://issues.apache.org/jira/browse/NUTCH-2461
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.14
>Reporter: Semyon Semyonov
>Priority: Critical
> Fix For: 1.15
>
>
> The generator checks condition 
> if (maxCount > 0) : line 421 and stop the generation when amount per host 
> exceeds maxCount( continue : line 455)
> but when  maxCount == 0 it goes directly to line 465 :output.collect(key, 
> entry);
> It is obviously not correct, the correct solution would be to add 
> if(maxCount == 0){
>   continue;
> }
> at line 380.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (NUTCH-2161) Interrupted failed and/or killed tasks fail to clean up temp directories in HDFS

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2161:
---
Fix Version/s: (was: 1.14)
   1.15

> Interrupted failed and/or killed tasks fail to clean up temp directories in 
> HDFS
> 
>
> Key: NUTCH-2161
> URL: https://issues.apache.org/jira/browse/NUTCH-2161
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.11
>Reporter: Lewis John McGibbney
> Fix For: 1.15
>
>
> If for example one kills an inject or generate job, Nutch does not clean up 
> 'temporary' directories and I have witnessed them remain within HDFS. This is 
> far from ideal if we have a large team of users all hammering away on Yarn 
> and persisting data into HDFS.
> We should investigate how to clean up these directories such that a cluster 
> admin is not left with all of the dross at the end of the long day ;)



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (NUTCH-2447) Work-around SSLProtocolException: handshake alert: unrecognized_name

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2447?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2447:
---
Fix Version/s: (was: 1.14)
   1.15

> Work-around SSLProtocolException: handshake alert: unrecognized_name
> 
>
> Key: NUTCH-2447
> URL: https://issues.apache.org/jira/browse/NUTCH-2447
> Project: Nutch
>  Issue Type: Bug
>  Components: protocol
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Critical
> Fix For: 1.15
>
> Attachments: NUTCH-2447.patch, NUTCH-2447.patch
>
>
> Nutch is unable to crawl some websites, regardless of protocol plugin you are 
> using. The work-around you frequently find (-Djsse.enableSNIExtension=false) 
> does not work at all, so the internet is clearly lying to us!
> {code}
> 2017-10-23 12:43:52,911 INFO  api.HttpRobotRulesParser - Couldn't get 
> robots.txt for https://www.eidsiva.net/: javax.net.ssl.SSLProtocolException: 
> handshake alert:  unrecognized_name
> 2017-10-23 12:43:53,011 ERROR http.Http - Failed to get protocol output
> javax.net.ssl.SSLProtocolException: handshake alert:  unrecognized_name
> at 
> sun.security.ssl.ClientHandshaker.handshakeAlert(ClientHandshaker.java:1446)
> at sun.security.ssl.SSLSocketImpl.recvAlert(SSLSocketImpl.java:2016)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:1125)
> at 
> sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1375)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1403)
> at 
> sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1387)
> at 
> org.apache.nutch.protocol.http.HttpResponse.(HttpResponse.java:152)
> at org.apache.nutch.protocol.http.Http.getResponse(Http.java:72)
> at 
> org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:271)
> at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:327)
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2353) Create seed file with metadata using the REST API

2017-12-18 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16295006#comment-16295006
 ] 

Sebastian Nagel commented on NUTCH-2353:


If no quick fix is available, I would revert the commit and mark this for 1.15, 
so that the webapp works for 1.14.

> Create seed file with metadata using the REST API
> -
>
> Key: NUTCH-2353
> URL: https://issues.apache.org/jira/browse/NUTCH-2353
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector, REST_api
>Affects Versions: 1.12
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: rest_api
> Fix For: 1.14
>
>
> At the moment its not possible to create a seed file and specify any metadata 
> when using the REST API. The file gets created but there is no option to add 
> any metadata to the seed URLs.
> If we use a payload like this:
> {code}
> {
> "name":"name-of-seedlist", 
> "seedUrls":[
> {
> "url" : "http://example.com;,
> "metadata" : {
> "key1" : "value1",
> "key2" : "value2",
> "key3" : "value3"
> }
> }
> ]
> }
> {code}
> It should be easy to specify the desired metadata. Also this should keep BC 
> with the previous array syntax if we only want to specify the list of URLs 
> without any metadata at all.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Reopened] (NUTCH-2353) Create seed file with metadata using the REST API

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reopened NUTCH-2353:


The change causes the webapp to fail with 404 / NOT_FOUND and the exception in 
hadoop.log:
{noformat}
2017-12-18 14:17:19,322 ERROR mortbay.log - Nested in 
org.springframework.beans.factory.BeanCreationException: Error creating bean 
with name 'craw
lServiceImpl': Injection of resource dependencies failed; nested exception is 
org.springframework.beans.factory.BeanCreationException: Error creat
ing bean with name 'createCrawlDao' defined in class path resource 
[org/apache/nutch/webui/config/SpringConfiguration.class]: Instantiation of bea
n failed; nested exception is 
org.springframework.beans.factory.BeanDefinitionStoreException: Factory method 
[public com.j256.ormlite.dao.Dao org.
apache.nutch.webui.config.SpringConfiguration.createCrawlDao() throws 
java.sql.SQLException] threw exception; nested exception is java.lang.Illega
lArgumentException: ORMLite does not know how to store interface java.util.Map 
for field metadata. Use another class or a custom persister.:
java.lang.IllegalArgumentException: ORMLite does not know how to store 
interface java.util.Map for field metadata. Use another class or a custom p
ersister.
at com.j256.ormlite.field.FieldType.(FieldType.java:189)
at com.j256.ormlite.field.FieldType.createFieldType(FieldType.java:957)
at 
com.j256.ormlite.table.DatabaseTableConfig.extractFieldTypes(DatabaseTableConfig.java:208)
at 
com.j256.ormlite.table.DatabaseTableConfig.fromClass(DatabaseTableConfig.java:146)
at com.j256.ormlite.table.TableInfo.(TableInfo.java:53)
at com.j256.ormlite.dao.BaseDaoImpl.initialize(BaseDaoImpl.java:151)
at com.j256.ormlite.dao.BaseDaoImpl.(BaseDaoImpl.java:128)
at com.j256.ormlite.dao.BaseDaoImpl.(BaseDaoImpl.java:107)
at com.j256.ormlite.dao.BaseDaoImpl$4.(BaseDaoImpl.java:907)
at com.j256.ormlite.dao.BaseDaoImpl.createDao(BaseDaoImpl.java:907)
at com.j256.ormlite.dao.DaoManager.createDao(DaoManager.java:70)
at 
com.j256.ormlite.field.FieldType.configDaoInformation(FieldType.java:380)
at com.j256.ormlite.dao.BaseDaoImpl.initialize(BaseDaoImpl.java:201)
at com.j256.ormlite.dao.BaseDaoImpl.(BaseDaoImpl.java:128)
at com.j256.ormlite.dao.BaseDaoImpl.(BaseDaoImpl.java:107)
at com.j256.ormlite.dao.BaseDaoImpl$4.(BaseDaoImpl.java:907)
at com.j256.ormlite.dao.BaseDaoImpl.createDao(BaseDaoImpl.java:907)
at com.j256.ormlite.dao.DaoManager.createDao(DaoManager.java:70)
at com.j256.ormlite.spring.DaoFactory.createDao(DaoFactory.java:37)
at 
org.apache.nutch.webui.config.CustomDaoFactory.createDao(CustomDaoFactory.java:39)
at 
org.apache.nutch.webui.config.SpringConfiguration.createCrawlDao(SpringConfiguration.java:82)
{noformat}

> Create seed file with metadata using the REST API
> -
>
> Key: NUTCH-2353
> URL: https://issues.apache.org/jira/browse/NUTCH-2353
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector, REST_api
>Affects Versions: 1.12
>Reporter: Jorge Luis Betancourt Gonzalez
>Assignee: Jorge Luis Betancourt Gonzalez
>Priority: Minor
>  Labels: rest_api
> Fix For: 1.14
>
>
> At the moment its not possible to create a seed file and specify any metadata 
> when using the REST API. The file gets created but there is no option to add 
> any metadata to the seed URLs.
> If we use a payload like this:
> {code}
> {
> "name":"name-of-seedlist", 
> "seedUrls":[
> {
> "url" : "http://example.com;,
> "metadata" : {
> "key1" : "value1",
> "key2" : "value2",
> "key3" : "value3"
> }
> }
> ]
> }
> {code}
> It should be easy to specify the desired metadata. Also this should keep BC 
> with the previous array syntax if we only want to specify the list of URLs 
> without any metadata at all.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Resolved] (NUTCH-2431) URLFilterchecker to implement Tool-interface

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2431.

Resolution: Duplicate

Thanks, [~jurian]!

> URLFilterchecker to implement Tool-interface
> 
>
> Key: NUTCH-2431
> URL: https://issues.apache.org/jira/browse/NUTCH-2431
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Jurian Broertjes
>Priority: Minor
>  Labels: urlfilter
> Fix For: 1.14
>
> Attachments: NUTCH-2431.patch
>
>
> The current implementation of the URLFilterChecker does not allow for 
> commandline config overrides. It needs to implement the Tool interface for 
> this. 
> Please see the attached patch



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Updated] (NUTCH-2431) URLFilterchecker to implement Tool-interface

2017-12-18 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2431?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2431:
---
Fix Version/s: 1.14

> URLFilterchecker to implement Tool-interface
> 
>
> Key: NUTCH-2431
> URL: https://issues.apache.org/jira/browse/NUTCH-2431
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Jurian Broertjes
>Priority: Minor
>  Labels: urlfilter
> Fix For: 1.14
>
> Attachments: NUTCH-2431.patch
>
>
> The current implementation of the URLFilterChecker does not allow for 
> commandline config overrides. It needs to implement the Tool interface for 
> this. 
> Please see the attached patch



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2431) URLFilterchecker to implement Tool-interface

2017-12-18 Thread Jurian Broertjes (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2431?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16294783#comment-16294783
 ] 

Jurian Broertjes commented on NUTCH-2431:
-

Yes, this is indeed resolved by NUTCH-2477

> URLFilterchecker to implement Tool-interface
> 
>
> Key: NUTCH-2431
> URL: https://issues.apache.org/jira/browse/NUTCH-2431
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.13
>Reporter: Jurian Broertjes
>Priority: Minor
>  Labels: urlfilter
> Attachments: NUTCH-2431.patch
>
>
> The current implementation of the URLFilterChecker does not allow for 
> commandline config overrides. It needs to implement the Tool interface for 
> this. 
> Please see the attached patch



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (NUTCH-2380) indexer-elastic version bump

2017-12-18 Thread Jurian Broertjes (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2380?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16294778#comment-16294778
 ] 

Jurian Broertjes commented on NUTCH-2380:
-

I've tested it a while back, and it's currently also running for a customer. I 
guess it should be fine for 1.14

> indexer-elastic version bump
> 
>
> Key: NUTCH-2380
> URL: https://issues.apache.org/jira/browse/NUTCH-2380
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.13
>Reporter: Jurian Broertjes
>Priority: Minor
> Fix For: 1.14
>
> Attachments: NUTCH-2380-indexer-elastic-p0.patch
>
>
> The current version of the indexer-elastic plugin is not compatible with ES 
> 5.x. The patch bumps the ES lib version to 5.3 but also requires a Nutch 
> classloader fix (NUTCH-2378) due to runtime dependency issues. 
> I didn't test compatibility with ES 2.x, so not sure if that still works.
> Please let me know what you think of the provided patch.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

78 matches

Mail list logo