[jira] [Commented] (NUTCH-2503) Add option to run tests for a single plugin

2018-01-23 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336441#comment-16336441
 ] 

ASF GitHub Bot commented on NUTCH-2503:
---

lewismc closed pull request #281: NUTCH-2503: Add option to run tests for a 
single plugin
URL: https://github.com/apache/nutch/pull/281
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/build.xml b/build.xml
index 85bb923de..db163c620 100644
--- a/build.xml
+++ b/build.xml
@@ -411,7 +411,7 @@
   
   
   
-
   
 
+  
+
+  
+
   
   
 
diff --git a/src/plugin/build.xml b/src/plugin/build.xml
index d035d54b9..3f579e841 100755
--- a/src/plugin/build.xml
+++ b/src/plugin/build.xml
@@ -152,6 +152,13 @@
 
   
 
+  
+  
+  
+  
+
+  
+
   
   
   


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add option to run tests for a single plugin
> ---
>
> Key: NUTCH-2503
> URL: https://issues.apache.org/jira/browse/NUTCH-2503
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Moreno Feltscher
>Priority: Major
> Fix For: 1.15
>
>
> Sometimes it makes sense to just run tests for a single plugin instead of 
> building all plugins and running all tests at once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2503) Add option to run tests for a single plugin

2018-01-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336242#comment-16336242
 ] 

Hudson commented on NUTCH-2503:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3498 (See 
[https://builds.apache.org/job/Nutch-trunk/3498/])
NUTCH-2503: Add option to run tests for a single plugin (moreno: 
[https://github.com/apache/nutch/commit/ea6a5f071baae3c55be22858822b251e4c781241])
* (edit) src/plugin/build.xml
* (edit) build.xml


> Add option to run tests for a single plugin
> ---
>
> Key: NUTCH-2503
> URL: https://issues.apache.org/jira/browse/NUTCH-2503
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Moreno Feltscher
>Priority: Major
> Fix For: 1.15
>
>
> Sometimes it makes sense to just run tests for a single plugin instead of 
> building all plugins and running all tests at once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2499) Elastic REST Indexer: Duplicate values

2018-01-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336240#comment-16336240
 ] 

Hudson commented on NUTCH-2499:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3498 (See 
[https://builds.apache.org/job/Nutch-trunk/3498/])
fix for NUTCH-2499: Filter duplicated field values when indexing using (moreno: 
[https://github.com/apache/nutch/commit/a51686446d03dd27e04c4cb77f8bf0a60895954c])
* (edit) 
src/plugin/indexer-elastic-rest/src/java/org/apache/nutch/indexwriter/elasticrest/ElasticRestIndexWriter.java


> Elastic REST Indexer: Duplicate values
> --
>
> Key: NUTCH-2499
> URL: https://issues.apache.org/jira/browse/NUTCH-2499
> Project: Nutch
>  Issue Type: Bug
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.15
>
>
> Due to a change in 
> https://github.com/apache/nutch/commit/160758023e3de83894ae4fe654c17fde62aba50e#diff-408fd2f17bc9791dcbf531ffe6574a6a
>  the Elastic REST indexer does not work with HashSets for values anymore but 
> instead saves duplicated values as arrays.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2502) Any23 Plugin: Add Content-Type filtering

2018-01-23 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16336241#comment-16336241
 ] 

Hudson commented on NUTCH-2502:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3498 (See 
[https://builds.apache.org/job/Nutch-trunk/3498/])
NUTCH-2502: Add Content-Type filter option to Any23 plugin (moreno: 
[https://github.com/apache/nutch/commit/856a8abd31ac9a4d9944c1f9b494b8f94ded209f])
* (edit) src/plugin/any23/src/java/org/apache/nutch/any23/Any23ParseFilter.java
* (edit) conf/nutch-default.xml
* (edit) 
src/plugin/any23/src/test/org/apache/nutch/any23/TestAny23ParseFilter.java


> Any23 Plugin: Add Content-Type filtering
> 
>
> Key: NUTCH-2502
> URL: https://issues.apache.org/jira/browse/NUTCH-2502
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.15
>
>
> It should be possible to filter based on a document's Content-Type when using 
> Any23 extractors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2502) Any23 Plugin: Add Content-Type filtering

2018-01-23 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2502.
-
Resolution: Fixed

Thank you [~mfeltscher]

> Any23 Plugin: Add Content-Type filtering
> 
>
> Key: NUTCH-2502
> URL: https://issues.apache.org/jira/browse/NUTCH-2502
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.15
>
>
> It should be possible to filter based on a document's Content-Type when using 
> Any23 extractors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2502) Any23 Plugin: Add Content-Type filtering

2018-01-23 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2502:

Fix Version/s: 1.15

> Any23 Plugin: Add Content-Type filtering
> 
>
> Key: NUTCH-2502
> URL: https://issues.apache.org/jira/browse/NUTCH-2502
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.15
>
>
> It should be possible to filter based on a document's Content-Type when using 
> Any23 extractors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2499) Elastic REST Indexer: Duplicate values

2018-01-23 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2499:

Fix Version/s: 1.15

> Elastic REST Indexer: Duplicate values
> --
>
> Key: NUTCH-2499
> URL: https://issues.apache.org/jira/browse/NUTCH-2499
> Project: Nutch
>  Issue Type: Bug
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.15
>
>
> Due to a change in 
> https://github.com/apache/nutch/commit/160758023e3de83894ae4fe654c17fde62aba50e#diff-408fd2f17bc9791dcbf531ffe6574a6a
>  the Elastic REST indexer does not work with HashSets for values anymore but 
> instead saves duplicated values as arrays.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2499) Elastic REST Indexer: Duplicate values

2018-01-23 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2499.
-
Resolution: Fixed

Thank you [~mfeltscher]

 

> Elastic REST Indexer: Duplicate values
> --
>
> Key: NUTCH-2499
> URL: https://issues.apache.org/jira/browse/NUTCH-2499
> Project: Nutch
>  Issue Type: Bug
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
> Fix For: 1.15
>
>
> Due to a change in 
> https://github.com/apache/nutch/commit/160758023e3de83894ae4fe654c17fde62aba50e#diff-408fd2f17bc9791dcbf531ffe6574a6a
>  the Elastic REST indexer does not work with HashSets for values anymore but 
> instead saves duplicated values as arrays.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (NUTCH-2495) Use -deleteGone instead of clean job in crawler script while indexing

2018-01-23 Thread Moreno Feltscher (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Moreno Feltscher reassigned NUTCH-2495:
---

Assignee: Lewis John McGibbney  (was: Moreno Feltscher)

> Use -deleteGone instead of clean job in crawler script while indexing
> -
>
> Key: NUTCH-2495
> URL: https://issues.apache.org/jira/browse/NUTCH-2495
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
>
> Instead of running {{bin/nutch clean}} after indexing the documents run 
> {{bin/nutch index}} with the {{-deleteGone}} flag which instead of just 
> deleting gone and duplicated documents also deletes redirects from the index.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (NUTCH-2502) Any23 Plugin: Add Content-Type filtering

2018-01-23 Thread Moreno Feltscher (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2502?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Moreno Feltscher reassigned NUTCH-2502:
---

Assignee: Lewis John McGibbney  (was: Moreno Feltscher)

> Any23 Plugin: Add Content-Type filtering
> 
>
> Key: NUTCH-2502
> URL: https://issues.apache.org/jira/browse/NUTCH-2502
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
>
> It should be possible to filter based on a document's Content-Type when using 
> Any23 extractors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script

2018-01-23 Thread Moreno Feltscher (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Moreno Feltscher reassigned NUTCH-2501:
---

Assignee: Lewis John McGibbney  (was: Moreno Feltscher)

> Take into account $NUTCH_HEAPSIZE when crawling using crawl script
> --
>
> Key: NUTCH-2501
> URL: https://issues.apache.org/jira/browse/NUTCH-2501
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2503) Add option to run tests for a single plugin

2018-01-23 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-2503.
-
Resolution: Fixed

Thank you [~mfeltscher]

> Add option to run tests for a single plugin
> ---
>
> Key: NUTCH-2503
> URL: https://issues.apache.org/jira/browse/NUTCH-2503
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Moreno Feltscher
>Priority: Major
> Fix For: 1.15
>
>
> Sometimes it makes sense to just run tests for a single plugin instead of 
> building all plugins and running all tests at once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2503) Add option to run tests for a single plugin

2018-01-23 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-2503:

Fix Version/s: 1.15

> Add option to run tests for a single plugin
> ---
>
> Key: NUTCH-2503
> URL: https://issues.apache.org/jira/browse/NUTCH-2503
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Moreno Feltscher
>Priority: Major
> Fix For: 1.15
>
>
> Sometimes it makes sense to just run tests for a single plugin instead of 
> building all plugins and running all tests at once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (NUTCH-2499) Elastic REST Indexer: Duplicate values

2018-01-23 Thread Moreno Feltscher (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2499?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Moreno Feltscher reassigned NUTCH-2499:
---

Assignee: Lewis John McGibbney  (was: Moreno Feltscher)

> Elastic REST Indexer: Duplicate values
> --
>
> Key: NUTCH-2499
> URL: https://issues.apache.org/jira/browse/NUTCH-2499
> Project: Nutch
>  Issue Type: Bug
>Reporter: Moreno Feltscher
>Assignee: Lewis John McGibbney
>Priority: Major
>
> Due to a change in 
> https://github.com/apache/nutch/commit/160758023e3de83894ae4fe654c17fde62aba50e#diff-408fd2f17bc9791dcbf531ffe6574a6a
>  the Elastic REST indexer does not work with HashSets for values anymore but 
> instead saves duplicated values as arrays.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script

2018-01-23 Thread Moreno Feltscher (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335999#comment-16335999
 ] 

Moreno Feltscher commented on NUTCH-2501:
-

Pull request: https://github.com/apache/nutch/pull/279

> Take into account $NUTCH_HEAPSIZE when crawling using crawl script
> --
>
> Key: NUTCH-2501
> URL: https://issues.apache.org/jira/browse/NUTCH-2501
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Moreno Feltscher
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2503) Add option to run tests for a single plugin

2018-01-23 Thread Moreno Feltscher (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335991#comment-16335991
 ] 

Moreno Feltscher commented on NUTCH-2503:
-

Pull request: https://github.com/apache/nutch/pull/281

> Add option to run tests for a single plugin
> ---
>
> Key: NUTCH-2503
> URL: https://issues.apache.org/jira/browse/NUTCH-2503
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Moreno Feltscher
>Priority: Major
>
> Sometimes it makes sense to just run tests for a single plugin instead of 
> building all plugins and running all tests at once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2502) Any23 Plugin: Add Content-Type filtering

2018-01-23 Thread Moreno Feltscher (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2502?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335994#comment-16335994
 ] 

Moreno Feltscher commented on NUTCH-2502:
-

Pull request: https://github.com/apache/nutch/pull/280

> Any23 Plugin: Add Content-Type filtering
> 
>
> Key: NUTCH-2502
> URL: https://issues.apache.org/jira/browse/NUTCH-2502
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Moreno Feltscher
>Priority: Major
>
> It should be possible to filter based on a document's Content-Type when using 
> Any23 extractors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2503) Add option to run tests for a single plugin

2018-01-23 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2503?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335984#comment-16335984
 ] 

Markus Jelsma commented on NUTCH-2503:
--

Hmm, in the past you could run ant -f src/plugin/urlfilter-suffix/build.xml  
test and it ran that specific test. Nowadays i get a errors:
{code}
[javac] 
/home/markus/projects/apache/nutch/svn/trunk/src/plugin/urlfilter-suffix/src/test/org/apache/nutch/urlfilter/suffix/TestSuffixURLFilter.java:118:
 error: cannot find symbol
[javac]   Assert.assertTrue(urlsModeAcceptAndPathFilter[i] == filter
[javac]   ^
[javac]   symbol:   variable Assert
[javac]   location: class TestSuffixURLFilter
{code}

> Add option to run tests for a single plugin
> ---
>
> Key: NUTCH-2503
> URL: https://issues.apache.org/jira/browse/NUTCH-2503
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Moreno Feltscher
>Assignee: Moreno Feltscher
>Priority: Major
>
> Sometimes it makes sense to just run tests for a single plugin instead of 
> building all plugins and running all tests at once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2503) Add option to run tests for a single plugin

2018-01-23 Thread Moreno Feltscher (JIRA)
Moreno Feltscher created NUTCH-2503:
---

 Summary: Add option to run tests for a single plugin
 Key: NUTCH-2503
 URL: https://issues.apache.org/jira/browse/NUTCH-2503
 Project: Nutch
  Issue Type: Improvement
Reporter: Moreno Feltscher
Assignee: Moreno Feltscher


Sometimes it makes sense to just run tests for a single plugin instead of 
building all plugins and running all tests at once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2466) Sitemap processor to follow redirects

2018-01-23 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16335949#comment-16335949
 ] 

Markus Jelsma commented on NUTCH-2466:
--

First patch adding maxRedir configurable and filterNormalize instead just 
normalize.

> Sitemap processor to follow redirects
> -
>
> Key: NUTCH-2466
> URL: https://issues.apache.org/jira/browse/NUTCH-2466
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-2466.patch, NUTCH-2466.patch
>
>
> It does follow http > https, but not the following redirect, e.g. 
> sitemap_index.xml that some websites have.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2466) Sitemap processor to follow redirects

2018-01-23 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2466?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2466:
-
Attachment: NUTCH-2466.patch

> Sitemap processor to follow redirects
> -
>
> Key: NUTCH-2466
> URL: https://issues.apache.org/jira/browse/NUTCH-2466
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.13
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.15
>
> Attachments: NUTCH-2466.patch, NUTCH-2466.patch
>
>
> It does follow http > https, but not the following redirect, e.g. 
> sitemap_index.xml that some websites have.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2502) Any23 Plugin: Add Content-Type filtering

2018-01-23 Thread Moreno Feltscher (JIRA)
Moreno Feltscher created NUTCH-2502:
---

 Summary: Any23 Plugin: Add Content-Type filtering
 Key: NUTCH-2502
 URL: https://issues.apache.org/jira/browse/NUTCH-2502
 Project: Nutch
  Issue Type: Improvement
Reporter: Moreno Feltscher
Assignee: Moreno Feltscher


It should be possible to filter based on a document's Content-Type when using 
Any23 extractors.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)