[jira] [Updated] (NUTCH-2124) redirect following same link again and again , max redirect exceed and went db_gone

2015-09-28 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2124?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2124:
---
Priority: Blocker  (was: Major)

> redirect following same link again and again , max redirect exceed and went 
> db_gone
> ---
>
> Key: NUTCH-2124
> URL: https://issues.apache.org/jira/browse/NUTCH-2124
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.11
>Reporter: Yogendra Kumar Soni
>Priority: Blocker
>  Labels: db_gone, fetcher, redirect
>
> Hello, followredirect is not working in trunk. please see the below log.
> Fetcher: throughput threshold retries: 5
> fetcher.maxNum.threads can't be < than 50 : using 50 instead
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, 
> fetchQueues.getQueueCount=1
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, 
> fetchQueues.getQueueCount=1
> {color:red}
> fetching http://www.wikipedia.com/wiki/URL_redirection (queue crawl 
> delay=5000ms)
> fetching http://www.wikipedia.com/wiki/URL_redirection (queue crawl 
> delay=5000ms)
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, 
> fetchQueues.getQueueCount=2
> fetching http://www.wikipedia.com/wiki/URL_redirection (queue crawl 
> delay=5000ms)
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, 
> fetchQueues.getQueueCount=2
> fetching http://www.wikipedia.com/wiki/URL_redirection (queue crawl 
> delay=5000ms)
> fetching http://www.wikipedia.com/wiki/URL_redirection (queue crawl 
> delay=5000ms)
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, 
> fetchQueues.getQueueCount=2
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, 
> fetchQueues.getQueueCount=2
>  - redirect count exceeded http://www.wikipedia.com/wiki/URL_redirection
> {color}
> Thread FetcherThread has no more work available
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, 
> fetchQueues.getQueueCount=2
> -activeThreads=0
> Fetcher: finished at 2015-09-28 19:32:05, elapsed: 00:00:09
> Parsing : 20150928193153



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2110) Create the capability to provide seeds in the form of "url+xpath(including option to enter seach terms).selenium"

2015-09-20 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14899934#comment-14899934
 ] 

Sebastian Nagel commented on NUTCH-2110:


Hi Asitang, the Injector is already able to store key-value pairs from the seed 
list in CrawlDb withing CrawlDatum's meta data, see 
[[1|http://nutch.apache.org/apidocs/apidocs-1.10/org/apache/nutch/crawl/Injector.html]].
 If the XPath statements are not too complex, this would be the easiest way: 
the protocol plugin could then read the XPath from the CrawlDatum.
Regarding the "state of a selenium operation": should the a state be passed to 
the outlinks of a page or is the same page fetched multiple times with varying 
Ajax/JavaScript actions to be performed?

> Create the capability to provide seeds in the form of "url+xpath(including 
> option to enter seach terms).selenium" 
> --
>
> Key: NUTCH-2110
> URL: https://issues.apache.org/jira/browse/NUTCH-2110
> Project: Nutch
>  Issue Type: Sub-task
>  Components: fetcher
>Affects Versions: 1.10
>Reporter: Asitang Mishra
>  Labels: memex
>
> Create the capability to provide seeds in the form of "url+xpath(including 
> option to enter seach terms).selenium" to be used by selenium 
> protocols/plugins as urls/flow to reach to a specific ajax based page or save 
> the state of a selenium operation for the next fetching round.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2110) Create the capability to provide seeds in the form of "url+xpath(including option to enter seach terms).selenium"

2015-09-22 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903524#comment-14903524
 ] 

Sebastian Nagel commented on NUTCH-2110:


Ok, understood. One point to consider: shall all paginated documents be kept 
under the same URL? As a batch crawler Nutch uses the URL in many places to 
uniquely identify content, meta data, status information, indexed documents, 
etc.  Of course, the outlinks generated for page1 could be modified by adding a 
suffix which makes the URL unique. Only inside protocol-selenium the suffix is 
removed to fetch the right page.

> Create the capability to provide seeds in the form of "url+xpath(including 
> option to enter seach terms).selenium" 
> --
>
> Key: NUTCH-2110
> URL: https://issues.apache.org/jira/browse/NUTCH-2110
> Project: Nutch
>  Issue Type: Sub-task
>  Components: fetcher
>Affects Versions: 1.10
>Reporter: Asitang Mishra
>  Labels: memex
>
> Create the capability to provide seeds in the form of "url+xpath(including 
> option to enter seach terms).selenium" to be used by selenium 
> protocols/plugins as urls/flow to reach to a specific ajax based page or save 
> the state of a selenium operation for the next fetching round.
> Atleast, this should make nutch capable of distinguishing if a url should be 
> opened using the basic http, httpclient or selenium protocols. And provide 
> the selenium protocol with basic authentication capabilities based on the 
> above ideas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2106) Runtime to contain Selenium and dependencies only once

2015-09-18 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14847281#comment-14847281
 ] 

Sebastian Nagel commented on NUTCH-2106:


Avoiding conflicting dependencies is the reason for the Nutch plugin system 
[[1|https://wiki.apache.org/nutch/WhatsTheProblemWithPluginsAndClass-loading]]. 
However, if a plugin depends on another plugin and both depend on a library, 
there is no way: both plugins must rely on the same version (or two versions 
with compatible API).
- protocol-selenium depends on lib-selenium
- both depend on selenium-java (currently the same version)
- when the plugin protocol-selenium is loaded the lib-selenium.jar is just 
added to the classpath of protocol-selenium's own class loader. The classes 
from lib-selenium.jar do not live in it's own class loader! They are used 
directly (and not via the lib-selenium plugin instance) from classes in 
protocol-selenium.
- the same situation for protocol-interactiveselenium

As a consequence, the Selenium version used by lib-selenium dictates the 
version to be used by the two protocol plugins. So, why not bundle Selenium 
jars and dependencies in lib-selenium?

> Runtime to contain Selenium and dependencies only once
> --
>
> Key: NUTCH-2106
> URL: https://issues.apache.org/jira/browse/NUTCH-2106
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.11
>Reporter: Sebastian Nagel
> Fix For: 1.11
>
> Attachments: NUTCH-2106.patch
>
>
> All Selenium-based plugins contain the same dependendent jars which 
> significantly affects the size of runtime and bin package:
> {noformat}
> % du -hs runtime/local/plugins/*selenium/ runtime/deploy/*.job
> 25M runtime/local/plugins/lib-selenium/
> 25M runtime/local/plugins/protocol-interactiveselenium/
> 25M runtime/local/plugins/protocol-selenium/
> 182M runtime/deploy/apache-nutch-1.11-SNAPSHOT.job
> {noformat}
> Since all plugins depend on the same Selenium version we could bundle the 
> dependencies in lib-selenium and let the other plugins load it from there:
> - let lib-selenium export all dependent libs, e.g.:
> {code:xml|title=lib-selenium/plugin.xml}
> 
>   ...
>   
> 
>   
> {code}
> - both protocol plugins already import lib-selenium: the dependencies in 
> ivy.xml can be removed
> As expected, these changes make the runtime smaller:
> {noformat}
> 25M runtime/local/plugins/lib-selenium/
> 20K runtime/local/plugins/protocol-interactiveselenium/
> 16K runtime/local/plugins/protocol-selenium/
> 138M runtime/deploy/apache-nutch-1.11-SNAPSHOT.job
> {noformat}
> Open points:
> - I've tested only protocol-selenium using chromedriver. Should also test 
> protocol-interactiveselenium?
> - What about phantomjsdriver-1.2.1.jar? It was contained in lib-selenium and 
> protocol-selenium but not protocol-interactiveselenium. Is there a reason for 
> this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2106) Runtime to contain Selenium and dependencies only once

2015-09-21 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2106:
--

Assignee: Sebastian Nagel

> Runtime to contain Selenium and dependencies only once
> --
>
> Key: NUTCH-2106
> URL: https://issues.apache.org/jira/browse/NUTCH-2106
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.11
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
> Fix For: 1.11
>
> Attachments: NUTCH-2106.patch
>
>
> All Selenium-based plugins contain the same dependendent jars which 
> significantly affects the size of runtime and bin package:
> {noformat}
> % du -hs runtime/local/plugins/*selenium/ runtime/deploy/*.job
> 25M runtime/local/plugins/lib-selenium/
> 25M runtime/local/plugins/protocol-interactiveselenium/
> 25M runtime/local/plugins/protocol-selenium/
> 182M runtime/deploy/apache-nutch-1.11-SNAPSHOT.job
> {noformat}
> Since all plugins depend on the same Selenium version we could bundle the 
> dependencies in lib-selenium and let the other plugins load it from there:
> - let lib-selenium export all dependent libs, e.g.:
> {code:xml|title=lib-selenium/plugin.xml}
> 
>   ...
>   
> 
>   
> {code}
> - both protocol plugins already import lib-selenium: the dependencies in 
> ivy.xml can be removed
> As expected, these changes make the runtime smaller:
> {noformat}
> 25M runtime/local/plugins/lib-selenium/
> 20K runtime/local/plugins/protocol-interactiveselenium/
> 16K runtime/local/plugins/protocol-selenium/
> 138M runtime/deploy/apache-nutch-1.11-SNAPSHOT.job
> {noformat}
> Open points:
> - I've tested only protocol-selenium using chromedriver. Should also test 
> protocol-interactiveselenium?
> - What about phantomjsdriver-1.2.1.jar? It was contained in lib-selenium and 
> protocol-selenium but not protocol-interactiveselenium. Is there a reason for 
> this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2106) Runtime to contain Selenium and dependencies only once

2015-09-21 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2106?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2106.

Resolution: Fixed

Committed to trunk, r1704425. Thanks, Lewis!

> Runtime to contain Selenium and dependencies only once
> --
>
> Key: NUTCH-2106
> URL: https://issues.apache.org/jira/browse/NUTCH-2106
> Project: Nutch
>  Issue Type: Bug
>  Components: build
>Affects Versions: 1.11
>Reporter: Sebastian Nagel
> Fix For: 1.11
>
> Attachments: NUTCH-2106.patch
>
>
> All Selenium-based plugins contain the same dependendent jars which 
> significantly affects the size of runtime and bin package:
> {noformat}
> % du -hs runtime/local/plugins/*selenium/ runtime/deploy/*.job
> 25M runtime/local/plugins/lib-selenium/
> 25M runtime/local/plugins/protocol-interactiveselenium/
> 25M runtime/local/plugins/protocol-selenium/
> 182M runtime/deploy/apache-nutch-1.11-SNAPSHOT.job
> {noformat}
> Since all plugins depend on the same Selenium version we could bundle the 
> dependencies in lib-selenium and let the other plugins load it from there:
> - let lib-selenium export all dependent libs, e.g.:
> {code:xml|title=lib-selenium/plugin.xml}
> 
>   ...
>   
> 
>   
> {code}
> - both protocol plugins already import lib-selenium: the dependencies in 
> ivy.xml can be removed
> As expected, these changes make the runtime smaller:
> {noformat}
> 25M runtime/local/plugins/lib-selenium/
> 20K runtime/local/plugins/protocol-interactiveselenium/
> 16K runtime/local/plugins/protocol-selenium/
> 138M runtime/deploy/apache-nutch-1.11-SNAPSHOT.job
> {noformat}
> Open points:
> - I've tested only protocol-selenium using chromedriver. Should also test 
> protocol-interactiveselenium?
> - What about phantomjsdriver-1.2.1.jar? It was contained in lib-selenium and 
> protocol-selenium but not protocol-interactiveselenium. Is there a reason for 
> this?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2124) redirect following same link again and again , max redirect exceed and went db_gone

2015-10-05 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943609#comment-14943609
 ] 

Sebastian Nagel commented on NUTCH-2124:


I've tested the patch with the mentioned URL as only seed URL and 
http.redirect.max == 5:
{noformat}
...
2015-09-28 19:46:16,183 INFO  crawl.Injector - Injector: Total new urls 
injected: 1
...
2015-09-28 19:46:23,342 INFO  fetcher.FetcherThread - fetching 
http://www.wikipedia.com/wiki/URL_redirection (queue crawl delay=1000ms)
2015-09-28 19:46:23,342 INFO  fetcher.FetcherThread - Using queue mode : byHost
2015-09-28 19:46:23,343 DEBUG fetcher.FetcherThread - redirectCount=0
...
2015-09-28 19:46:24,096 DEBUG fetcher.FetcherThread -  - protocol redirect to 
http://www.wikipedia.org/wiki/URL_redirection (fetching now)
2015-09-28 19:46:24,097 INFO  fetcher.FetcherThread - fetching 
http://www.wikipedia.org/wiki/URL_redirection (queue crawl delay=1000ms)
2015-09-28 19:46:24,097 DEBUG fetcher.FetcherThread - redirectCount=1
2015-09-28 19:46:24,179 DEBUG fetcher.FetcherThread -  - protocol redirect to 
https://www.wikipedia.org/wiki/URL_redirection (fetching now)
2015-09-28 19:46:24,180 INFO  fetcher.FetcherThread - fetching 
https://www.wikipedia.org/wiki/URL_redirection (queue crawl delay=1000ms)
2015-09-28 19:46:24,180 DEBUG fetcher.FetcherThread - redirectCount=2
...
2015-09-28 19:46:25,460 DEBUG fetcher.FetcherThread -  - protocol redirect to 
https://en.wikipedia.org/wiki/URL_redirection (fetching now)
2015-09-28 19:46:25,461 INFO  fetcher.FetcherThread - fetching 
https://en.wikipedia.org/wiki/URL_redirection (queue crawl delay=1000ms)
2015-09-28 19:46:25,461 DEBUG fetcher.FetcherThread - redirectCount=3
...
2015-09-28 19:46:36,441 INFO  crawl.CrawlDbReader - status 1 (db_unfetched):
58
2015-09-28 19:46:36,441 INFO  crawl.CrawlDbReader - status 2 (db_fetched):  
1
2015-09-28 19:46:36,441 INFO  crawl.CrawlDbReader - status 5 (db_redir_perm):   
3
...
{noformat}

Can you verify the solution again with the given URL and http.redirect.max 
large enough to follow all redirects?
Let's track further problems as separate issues to get this problem fixed.

> redirect following same link again and again , max redirect exceed and went 
> db_gone
> ---
>
> Key: NUTCH-2124
> URL: https://issues.apache.org/jira/browse/NUTCH-2124
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.11
>Reporter: Yogendra Kumar Soni
>Priority: Blocker
>  Labels: db_gone, fetcher, redirect
> Fix For: 1.11
>
> Attachments: NUTCH-2124.patch
>
>
> Hello, followredirect is not working in trunk. please see the below log.
> Fetcher: throughput threshold retries: 5
> fetcher.maxNum.threads can't be < than 50 : using 50 instead
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, 
> fetchQueues.getQueueCount=1
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, 
> fetchQueues.getQueueCount=1
> {color:red}
> fetching http://www.wikipedia.com/wiki/URL_redirection (queue crawl 
> delay=5000ms)
> fetching http://www.wikipedia.com/wiki/URL_redirection (queue crawl 
> delay=5000ms)
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, 
> fetchQueues.getQueueCount=2
> fetching http://www.wikipedia.com/wiki/URL_redirection (queue crawl 
> delay=5000ms)
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, 
> fetchQueues.getQueueCount=2
> fetching http://www.wikipedia.com/wiki/URL_redirection (queue crawl 
> delay=5000ms)
> fetching http://www.wikipedia.com/wiki/URL_redirection (queue crawl 
> delay=5000ms)
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, 
> fetchQueues.getQueueCount=2
> -activeThreads=1, spinWaiting=0, fetchQueues.totalSize=0, 
> fetchQueues.getQueueCount=2
>  - redirect count exceeded http://www.wikipedia.com/wiki/URL_redirection
> {color}
> Thread FetcherThread has no more work available
> -finishing thread FetcherThread, activeThreads=0
> -activeThreads=0, spinWaiting=0, fetchQueues.totalSize=0, 
> fetchQueues.getQueueCount=2
> -activeThreads=0
> Fetcher: finished at 2015-09-28 19:32:05, elapsed: 00:00:09
> Parsing : 20150928193153



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2132) Publisher/Subscriber model for Nutch to emit events

2015-10-05 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2132?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14943637#comment-14943637
 ] 

Sebastian Nagel commented on NUTCH-2132:


No question, this is a significant improvement over FetchNodeDb (NUTCH-2011). A 
few comments (without setting up a RabbitMQ consumer):
- (same as for FetchNodeDb) the default should be that nothing is done if no 
publisher is configured. Even constructing a FetcherThreadEvent (it's a huge 
object with a parsing fetcher) would mean a certain overhead
- since it's not trivial to configure the message queue, it's important to 
catch improper or missing configurations early (in setConf()), and not when the 
crawler is running:
{noformat}
fetch of ... failed with: java.lang.NullPointerException
at 
org.apache.nutch.tools.RabbitMQPublisher.publish(RabbitMQPublisher.java:70)
at 
org.apache.nutch.fetcher.FetcherThreadPublisher.publish(FetcherThreadPublisher.java:43)
at org.apache.nutch.fetcher.FetcherThread.run(FetcherThread.java:253)
{noformat}
- is it really necessary to send 2 events (start + end) per fetched URL?  
Fetching is usually quite fast (< 1s).  Right, it may be useful to track 
documents which take long to fetch, but that's happening rarely and there are 
timeouts to avoid that fetcher itself hangs forever.
- start and end event are on different loop levels (do-while): with redirects 
and http.redirect.max > 0 you'll get unpaired events.


> Publisher/Subscriber model for Nutch to emit events 
> 
>
> Key: NUTCH-2132
> URL: https://issues.apache.org/jira/browse/NUTCH-2132
> Project: Nutch
>  Issue Type: New Feature
>  Components: fetcher, REST_api
>Reporter: Sujen Shah
>  Labels: memex
> Fix For: 1.11
>
> Attachments: NUTCH-2132.patch
>
>
> It would be nice to have a Pub/Sub model in Nutch to emit certain events (ex- 
> Fetcher events like fetch-start, fetch-end, a fetch report which may contain 
> data like outlinks of the current fetched url, score, etc). 
> A consumer of this functionality could use this data to generate real time 
> visualization and generate statics of the crawl without having to wait for 
> the fetch round to finish. 
> The REST API could contain an endpoint which would respond with a url to 
> which a client could subscribe to get the fetcher events. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2179) Cleanup job for SOLR Performance Boost

2015-12-01 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034511#comment-15034511
 ] 

Sebastian Nagel commented on NUTCH-2179:


+1: SolrIndexWriter should queue the deletions the same way as done for 
additions/updates. Looks like the bulk commit by an UpdateRequest is already 
assumed because numDeletes is taken into account when checking whether the 
batchSize is reached (SolrIndexWriter, line 125: {{if (inputDocs.size() + 
numDeletes >= batchSize) {}}

> Cleanup job for SOLR Performance Boost
> --
>
> Key: NUTCH-2179
> URL: https://issues.apache.org/jira/browse/NUTCH-2179
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.9, 1.10, 1.11
>Reporter: David Johnson
>Priority: Minor
>  Labels: patch
>   Original Estimate: 2h
>  Remaining Estimate: 2h
>
> During a cleanup job, index deletes are scheduled one by one, which can make 
> a large job take days



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2172) Parsing whitespace not just tabs in contenttype-mapping.txt

2015-12-01 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2172:
---
Attachment: NUTCH-2172-1.patch

Patch to add a template for conf/contenttype-mapping.txt (instantiated by ant).
Examples are taken from NUTCH-1262 (the examples suggest that custom target 
types may include white space).

> Parsing whitespace not just tabs in contenttype-mapping.txt
> ---
>
> Key: NUTCH-2172
> URL: https://issues.apache.org/jira/browse/NUTCH-2172
> Project: Nutch
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 1.10
> Environment: Macosx, Java 8
>Reporter: Nicola Tonellotto
>Priority: Minor
>  Labels: easyfix, newbie
> Attachments: NUTCH-2172-1.patch
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> The index-more plugin uses the conf/contenttype-mapping.txt file to build up 
> the mimeMap hash table (in the readConfiguration() method).
> The line splitting is performed around "\t", so it silently skip lines 
> separated by simple spaces or more than one tab (see line 325).
> Changing the single-char string "\t" with the regex "\\s+" should do the 
> magic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2107) plugin.xml to validate against plugin.dtd

2015-12-01 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2107:
--

Assignee: Sebastian Nagel

> plugin.xml to validate against plugin.dtd
> -
>
> Key: NUTCH-2107
> URL: https://issues.apache.org/jira/browse/NUTCH-2107
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 2.3, 1.10, 1.11
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Trivial
> Fix For: 2.4, 1.12
>
> Attachments: NUTCH-2107.patch
>
>
> Some of the plugin.xml do not validate against the plugin.dtd:
> {noformat}
> % xmllint --noout --dtdvalid ./src/plugin/plugin.dtd 
> src/plugin/urlnormalizer-regex/plugin.xml
> src/plugin/urlnormalizer-regex/plugin.xml:30: element requires: validity 
> error : Element requires content does not follow the DTD, expecting 
> (import)+, got (include )
> src/plugin/urlnormalizer-regex/plugin.xml:31: element include: validity error 
> : No declaration for element include
> src/plugin/urlnormalizer-regex/plugin.xml:31: element include: validity error 
> : No declaration for attribute file of element include
> Document src/plugin/urlnormalizer-regex/plugin.xml does not validate against 
> ./src/plugin/plugin.dtd
> % ...
> src/plugin/subcollection/plugin.xml:22: element plugin: validity error : 
> Element plugin content does not follow the DTD, expecting (runtime? , 
> requires? , extension-point* , extension*), got (requires runtime extension )
> % ...
> src/plugin/lib-selenium/plugin.xml:76: element requires: validity error : 
> Element requires content does not follow the DTD, expecting (import)+, got 
> (library library )
> src/plugin/lib-selenium/plugin.xml:80: element library: validity error : 
> Element library content does not follow the DTD, expecting (export)*, got 
> (export exclude )
> src/plugin/lib-selenium/plugin.xml:82: element exclude: validity error : No 
> declaration for element exclude
> src/plugin/lib-selenium/plugin.xml:82: element exclude: validity error : No 
> declaration for attribute name of element exclude
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2107) plugin.xml to validate against plugin.dtd

2015-12-01 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2107?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2107.

   Resolution: Fixed
Fix Version/s: (was: 1.12)
   (was: 2.4)
   2.3.1
   1.11

Committed to trunk r1717536 and 2.x r1717537.

> plugin.xml to validate against plugin.dtd
> -
>
> Key: NUTCH-2107
> URL: https://issues.apache.org/jira/browse/NUTCH-2107
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin
>Affects Versions: 2.3, 1.10, 1.11
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.11, 2.3.1
>
> Attachments: NUTCH-2107.patch
>
>
> Some of the plugin.xml do not validate against the plugin.dtd:
> {noformat}
> % xmllint --noout --dtdvalid ./src/plugin/plugin.dtd 
> src/plugin/urlnormalizer-regex/plugin.xml
> src/plugin/urlnormalizer-regex/plugin.xml:30: element requires: validity 
> error : Element requires content does not follow the DTD, expecting 
> (import)+, got (include )
> src/plugin/urlnormalizer-regex/plugin.xml:31: element include: validity error 
> : No declaration for element include
> src/plugin/urlnormalizer-regex/plugin.xml:31: element include: validity error 
> : No declaration for attribute file of element include
> Document src/plugin/urlnormalizer-regex/plugin.xml does not validate against 
> ./src/plugin/plugin.dtd
> % ...
> src/plugin/subcollection/plugin.xml:22: element plugin: validity error : 
> Element plugin content does not follow the DTD, expecting (runtime? , 
> requires? , extension-point* , extension*), got (requires runtime extension )
> % ...
> src/plugin/lib-selenium/plugin.xml:76: element requires: validity error : 
> Element requires content does not follow the DTD, expecting (import)+, got 
> (library library )
> src/plugin/lib-selenium/plugin.xml:80: element library: validity error : 
> Element library content does not follow the DTD, expecting (export)*, got 
> (export exclude )
> src/plugin/lib-selenium/plugin.xml:82: element exclude: validity error : No 
> declaration for element exclude
> src/plugin/lib-selenium/plugin.xml:82: element exclude: validity error : No 
> declaration for attribute name of element exclude
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2172) index-more: document format of contenttype-mapping.txt

2015-12-06 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2172:
---
Component/s: indexer

> index-more: document format of contenttype-mapping.txt
> --
>
> Key: NUTCH-2172
> URL: https://issues.apache.org/jira/browse/NUTCH-2172
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, metadata
>Affects Versions: 1.10
> Environment: Macosx, Java 8
>Reporter: Nicola Tonellotto
>Assignee: Sebastian Nagel
>Priority: Minor
>  Labels: easyfix, newbie
> Fix For: 1.12
>
> Attachments: NUTCH-2172-1.patch, NUTCH-2172-2.patch
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> The index-more plugin uses the conf/contenttype-mapping.txt file to build up 
> the mimeMap hash table (in the readConfiguration() method).
> The line splitting is performed around "\t", so it silently skip lines 
> separated by simple spaces or more than one tab (see line 325).
> Changing the single-char string "\t" with the regex "\\s+" should do the 
> magic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2172) Parsing whitespace not just tabs in contenttype-mapping.txt

2015-12-06 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2172:
--

Assignee: Sebastian Nagel

> Parsing whitespace not just tabs in contenttype-mapping.txt
> ---
>
> Key: NUTCH-2172
> URL: https://issues.apache.org/jira/browse/NUTCH-2172
> Project: Nutch
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 1.10
> Environment: Macosx, Java 8
>Reporter: Nicola Tonellotto
>Assignee: Sebastian Nagel
>Priority: Minor
>  Labels: easyfix, newbie
> Attachments: NUTCH-2172-1.patch, NUTCH-2172-2.patch
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> The index-more plugin uses the conf/contenttype-mapping.txt file to build up 
> the mimeMap hash table (in the readConfiguration() method).
> The line splitting is performed around "\t", so it silently skip lines 
> separated by simple spaces or more than one tab (see line 325).
> Changing the single-char string "\t" with the regex "\\s+" should do the 
> magic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2172) Parsing whitespace not just tabs in contenttype-mapping.txt

2015-12-06 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2172:
---
Fix Version/s: 1.12

> Parsing whitespace not just tabs in contenttype-mapping.txt
> ---
>
> Key: NUTCH-2172
> URL: https://issues.apache.org/jira/browse/NUTCH-2172
> Project: Nutch
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 1.10
> Environment: Macosx, Java 8
>Reporter: Nicola Tonellotto
>Assignee: Sebastian Nagel
>Priority: Minor
>  Labels: easyfix, newbie
> Fix For: 1.12
>
> Attachments: NUTCH-2172-1.patch, NUTCH-2172-2.patch
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> The index-more plugin uses the conf/contenttype-mapping.txt file to build up 
> the mimeMap hash table (in the readConfiguration() method).
> The line splitting is performed around "\t", so it silently skip lines 
> separated by simple spaces or more than one tab (see line 325).
> Changing the single-char string "\t" with the regex "\\s+" should do the 
> magic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2172) Parsing whitespace not just tabs in contenttype-mapping.txt

2015-12-06 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2172:
---
Issue Type: Improvement  (was: Bug)

> Parsing whitespace not just tabs in contenttype-mapping.txt
> ---
>
> Key: NUTCH-2172
> URL: https://issues.apache.org/jira/browse/NUTCH-2172
> Project: Nutch
>  Issue Type: Improvement
>  Components: metadata
>Affects Versions: 1.10
> Environment: Macosx, Java 8
>Reporter: Nicola Tonellotto
>Assignee: Sebastian Nagel
>Priority: Minor
>  Labels: easyfix, newbie
> Fix For: 1.12
>
> Attachments: NUTCH-2172-1.patch, NUTCH-2172-2.patch
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> The index-more plugin uses the conf/contenttype-mapping.txt file to build up 
> the mimeMap hash table (in the readConfiguration() method).
> The line splitting is performed around "\t", so it silently skip lines 
> separated by simple spaces or more than one tab (see line 325).
> Changing the single-char string "\t" with the regex "\\s+" should do the 
> magic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2172) index-more: document format of contenttype-mapping.txt

2015-12-06 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2172:
---
Summary: index-more: document format of contenttype-mapping.txt  (was: 
Parsing whitespace not just tabs in contenttype-mapping.txt)

> index-more: document format of contenttype-mapping.txt
> --
>
> Key: NUTCH-2172
> URL: https://issues.apache.org/jira/browse/NUTCH-2172
> Project: Nutch
>  Issue Type: Improvement
>  Components: metadata
>Affects Versions: 1.10
> Environment: Macosx, Java 8
>Reporter: Nicola Tonellotto
>Assignee: Sebastian Nagel
>Priority: Minor
>  Labels: easyfix, newbie
> Fix For: 1.12
>
> Attachments: NUTCH-2172-1.patch, NUTCH-2172-2.patch
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> The index-more plugin uses the conf/contenttype-mapping.txt file to build up 
> the mimeMap hash table (in the readConfiguration() method).
> The line splitting is performed around "\t", so it silently skip lines 
> separated by simple spaces or more than one tab (see line 325).
> Changing the single-char string "\t" with the regex "\\s+" should do the 
> magic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2172) index-more: document format of contenttype-mapping.txt

2015-12-06 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2172.

Resolution: Fixed

Committed to trunk, r1718223.

Thanks, [~nicola.tonellotto]! Although this solution is not that you've 
suggested, hopefully it will help users to write the content type mappings in 
the right format. 
If a non-comment line does not follow the format the hadoop.log will show now a 
warning, e.g.:
{noformat}
2015-12-06 22:06:05,169 INFO  more.MoreIndexingFilter - Reading content type 
mappings from file contenttype-mapping.txt
2015-12-06 22:06:05,174 WARN  more.MoreIndexingFilter - Wrong format of line: 
only spaces no tabs
2015-12-06 22:06:05,174 WARN  more.MoreIndexingFilter - Expected format: 
   [  ...]
{noformat}
Btw., multiple tabs as separators are no problem as the original format allows 
multiple content types to be mapped to one target type. 

> index-more: document format of contenttype-mapping.txt
> --
>
> Key: NUTCH-2172
> URL: https://issues.apache.org/jira/browse/NUTCH-2172
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, metadata
>Affects Versions: 1.10
> Environment: Macosx, Java 8
>Reporter: Nicola Tonellotto
>Assignee: Sebastian Nagel
>Priority: Minor
>  Labels: easyfix, newbie
> Fix For: 1.12
>
> Attachments: NUTCH-2172-1.patch, NUTCH-2172-2.patch
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> The index-more plugin uses the conf/contenttype-mapping.txt file to build up 
> the mimeMap hash table (in the readConfiguration() method).
> The line splitting is performed around "\t", so it silently skip lines 
> separated by simple spaces or more than one tab (see line 325).
> Changing the single-char string "\t" with the regex "\\s+" should do the 
> magic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2076) exceptions are not handled when using method waitForCompletion in a try block

2015-12-08 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2076?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15047481#comment-15047481
 ] 

Sebastian Nagel commented on NUTCH-2076:


After a second look: the problem is the return statement in the finally block 
which causes the exception not to be thrown. In case of an exception (as well 
as an unsuccessful job) we should either mark the failed job in the returned 
results structure or path the exception to the calling function.

> exceptions are not handled when using method waitForCompletion in a try block
> -
>
> Key: NUTCH-2076
> URL: https://issues.apache.org/jira/browse/NUTCH-2076
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 2.2, 2.3
>Reporter: songwanging
>Priority: Minor
>
> Locations: src\java\org\apache\nutch\crawl\WebTableReader.java
> when using function waitForCompletion in a try block, exceptions are not 
> handled :
> waitForCompletion might throw  : IOException, InterruptedException, 
> ClassNotFoundException
> so when calling this function in a try block, we should use a catch block to 
> handle potential Exceptions.
> public Map run(Map args) throws Exception {
> ...
> try {
>   currentJob.waitForCompletion(true);
> } finally {
>   ToolUtil.recordJobStatus(null, currentJob, results);
>   if (!currentJob.isSuccessful()) {
> fileSystem.delete(tmpFolder, true);
> return results;
>   }
> }
> ...
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2172) Parsing whitespace not just tabs in contenttype-mapping.txt

2015-12-03 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2172:
---
Attachment: NUTCH-2172-2.patch

It is about MIME types which are already normalized either by Tika or via 
MimeUtil.cleanMimeType(), not about potential garbage sent by web servers.
Anyway, we should keep the format as is but provide documentation and add 
meaningful warnings. The new patch also does the latter.

> Parsing whitespace not just tabs in contenttype-mapping.txt
> ---
>
> Key: NUTCH-2172
> URL: https://issues.apache.org/jira/browse/NUTCH-2172
> Project: Nutch
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 1.10
> Environment: Macosx, Java 8
>Reporter: Nicola Tonellotto
>Priority: Minor
>  Labels: easyfix, newbie
> Attachments: NUTCH-2172-1.patch, NUTCH-2172-2.patch
>
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> The index-more plugin uses the conf/contenttype-mapping.txt file to build up 
> the mimeMap hash table (in the readConfiguration() method).
> The line splitting is performed around "\t", so it silently skip lines 
> separated by simple spaces or more than one tab (see line 325).
> Changing the single-char string "\t" with the regex "\\s+" should do the 
> magic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2172) Parsing whitespace not just tabs in contenttype-mapping.txt

2015-12-01 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2172?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15034352#comment-15034352
 ] 

Sebastian Nagel commented on NUTCH-2172:


This could be an improvement if we assume that MIME types do not contain white 
space. In fact, there may be space between subtype and optional parameters as 
in {{text/html; charset=UTF-8}} but that's a rather theoretical problem because 
[tika-mimetypes.xml|https://github.com/apache/tika/blob/trunk/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml]
 does not use any spaces before parameters, e.g. 
{{application/x-berkeley-db;format=hash;version=2}}.
More important would be an example contenttype-mapping.txt which explains the 
format, in combination with verbose warnings if the expected format isn't 
matched.


> Parsing whitespace not just tabs in contenttype-mapping.txt
> ---
>
> Key: NUTCH-2172
> URL: https://issues.apache.org/jira/browse/NUTCH-2172
> Project: Nutch
>  Issue Type: Bug
>  Components: metadata
>Affects Versions: 1.10
> Environment: Macosx, Java 8
>Reporter: Nicola Tonellotto
>Priority: Minor
>  Labels: easyfix, newbie
>   Original Estimate: 0.5h
>  Remaining Estimate: 0.5h
>
> The index-more plugin uses the conf/contenttype-mapping.txt file to build up 
> the mimeMap hash table (in the readConfiguration() method).
> The line splitting is performed around "\t", so it silently skip lines 
> separated by simple spaces or more than one tab (see line 325).
> Changing the single-char string "\t" with the regex "\\s+" should do the 
> magic.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2193) Upgrade feed parser plugin to use rome 1.5

2016-01-04 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2193:
---
Attachment: NUTCH-2193.patch

> Upgrade feed parser plugin to use rome 1.5
> --
>
> Key: NUTCH-2193
> URL: https://issues.apache.org/jira/browse/NUTCH-2193
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.11
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.12
>
> Attachments: NUTCH-2193.patch
>
>
> The class loader issue in the rome library (NUTCH-1494, [[rometools 
> #130|https://github.com/rometools/rome/issues/130]]) is fixed with rome 1.5. 
> Time to upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2193) Upgrade feed parser plugin to use rome 1.5

2016-01-04 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2193:
--

 Summary: Upgrade feed parser plugin to use rome 1.5
 Key: NUTCH-2193
 URL: https://issues.apache.org/jira/browse/NUTCH-2193
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.11
Reporter: Sebastian Nagel
Priority: Minor
 Fix For: 1.12


The class loader issue in the rome library (NUTCH-1494, [[rometools 
#130|https://github.com/rometools/rome/issues/130]]) is fixed with rome 1.5. 
Time to upgrade.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2143) GeneratorJob ignores batch id passed as argument

2016-01-06 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15085327#comment-15085327
 ] 

Sebastian Nagel commented on NUTCH-2143:


Excellent! Please, attach a 
[patch|http://wiki.apache.org/nutch/HowToContribute#Creating_a_patch] 
containing the fix. Thanks!

> GeneratorJob ignores batch id passed as argument
> 
>
> Key: NUTCH-2143
> URL: https://issues.apache.org/jira/browse/NUTCH-2143
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.3.1
>Reporter: Sebastian Nagel
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 2.3.1
>
>
> The batch id passed to GeneratorJob by option/argument -batchId  is 
> ignored and a generated batch id is used to mark the current batch. Log 
> snippets from a run of bin/crawl:
> {noformat}
> bin/nutch generate ... -batchId 1444941073-14208
> ...
> GeneratorJob: generated batch id: 1444941074-858443668 containing 1 URLs
> Fetching : 
> bin/nutch fetch ... 1444941073-14208 ...
> ...
> QueueFeeder finished: total 0 records. Hit by time limit :0
> {noformat}
> The generated URLs are marked with the wrong batch id:
> {noformat}
> hbase(main):010:0> scan 'test_webpage'
> ROWCOLUMN+CELL
>  org.apache.nutch:http/column=f:bid, timestamp=1444941077080, 
> value=1444941074-858443668
>  ...
>  org.apache.nutch:http/column=mk:_gnmrk_, timestamp=1444941077080, 
> value=1444941074-858443668
> {noformat}
> and fetcher will not fetch anything. This problem was reported by Sherban 
> Drulea 
> [[1|https://www.mail-archive.com/user@nutch.apache.org/msg13894.html]], 
> [[2|https://www.mail-archive.com/user@nutch.apache.org/msg13912.html]].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2168) Parse-tika fails to retrieve parser

2016-01-05 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083285#comment-15083285
 ] 

Sebastian Nagel commented on NUTCH-2168:


Hi [~kalanya], looks like the indexed raw content of the JPEGs are causing the 
invalid utf-8 character. The index-html plugin tries to treat any raw content 
as readable content converting it to a String based on the platform-dependent 
charset. What happens if the field content is specified as "binary" in 
schema.xml (cf. patch for NUTCH-2130)?

Without the patch applied, non-HTML documents simply fail to parse and are 
never indexed. That's probably the reason why the fix for this issue causes the 
problem with index-html. I would suggest to open a separate issue to address 
the indexer-solr problem with raw content from index-html.

> Parse-tika fails to retrieve parser
> ---
>
> Key: NUTCH-2168
> URL: https://issues.apache.org/jira/browse/NUTCH-2168
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.3.1
>Reporter: Sebastian Nagel
> Fix For: 2.3.1
>
> Attachments: NUTCH-2168.patch
>
>
> The plugin parse-tika fails to parse most (all?) kinds of document types 
> (PDF, xlsx, ...) when run via ParserChecker or ParserJob:
> {noformat}
> 2015-11-12 19:14:30,903 INFO  parse.ParserJob - Parsing 
> http://localhost/pdftest.pdf
> 2015-11-12 19:14:30,905 INFO  parse.ParserFactory - ...
> 2015-11-12 19:14:30,907 ERROR tika.TikaParser - Can't retrieve Tika parser 
> for mime-type application/pdf
> 2015-11-12 19:14:30,913 WARN  parse.ParseUtil - Unable to successfully parse 
> content http://localhost/pdftest.pdf of type application/pdf
> {noformat}
> The same document is successfully parsed by TestPdfParser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2191) Add protocol-htmlunit

2016-01-05 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15083603#comment-15083603
 ] 

Sebastian Nagel commented on NUTCH-2191:


As [~haraldk] mentioned in [this 
discussion|https://mail-archives.apache.org/mod_mbox/nutch-user/201404.mbox/%3c53576563.1030...@raytion.com%3E]
 there is only isolation between plugins (resp. their class loaders), but not 
between the plugin class loader and its parent loading classes from 
$NUTCH_HOME/lib/.

> Add protocol-htmlunit
> -
>
> Key: NUTCH-2191
> URL: https://issues.apache.org/jira/browse/NUTCH-2191
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2191.patch
>
>
> HtmlUnit is, opposed to other Javascript enabled headless browsers, a 
> portable library and should therefore be better suited for very large scale 
> crawls. This issue is an attempt to implement protocol-htmlunit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2143) GeneratorJob ignores batch id passed as argument

2016-01-07 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2143:
---
Attachment: NUTCH-2143-v3.patch

Ok, with the patch applied the unit testFetch() fails because the generated 
batch id is not properly returned by generate(...). Patch v3 fixes this.

> GeneratorJob ignores batch id passed as argument
> 
>
> Key: NUTCH-2143
> URL: https://issues.apache.org/jira/browse/NUTCH-2143
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.3.1
>Reporter: Sebastian Nagel
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 2.3.1
>
> Attachments: NUTCH-2143-v2.patch, NUTCH-2143-v3.patch, patch
>
>
> The batch id passed to GeneratorJob by option/argument -batchId  is 
> ignored and a generated batch id is used to mark the current batch. Log 
> snippets from a run of bin/crawl:
> {noformat}
> bin/nutch generate ... -batchId 1444941073-14208
> ...
> GeneratorJob: generated batch id: 1444941074-858443668 containing 1 URLs
> Fetching : 
> bin/nutch fetch ... 1444941073-14208 ...
> ...
> QueueFeeder finished: total 0 records. Hit by time limit :0
> {noformat}
> The generated URLs are marked with the wrong batch id:
> {noformat}
> hbase(main):010:0> scan 'test_webpage'
> ROWCOLUMN+CELL
>  org.apache.nutch:http/column=f:bid, timestamp=1444941077080, 
> value=1444941074-858443668
>  ...
>  org.apache.nutch:http/column=mk:_gnmrk_, timestamp=1444941077080, 
> value=1444941074-858443668
> {noformat}
> and fetcher will not fetch anything. This problem was reported by Sherban 
> Drulea 
> [[1|https://www.mail-archive.com/user@nutch.apache.org/msg13894.html]], 
> [[2|https://www.mail-archive.com/user@nutch.apache.org/msg13912.html]].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2168) Parse-tika fails to retrieve parser

2016-01-09 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2168?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2168.

Resolution: Fixed

Committed to 2.x, r1723851. Opened NUTCH-2198 to track the problem when 
indexing the raw binary content using the plugin index-html. Thanks, [~lewismc] 
and [~kalanya], for the review!

> Parse-tika fails to retrieve parser
> ---
>
> Key: NUTCH-2168
> URL: https://issues.apache.org/jira/browse/NUTCH-2168
> Project: Nutch
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.3.1
>Reporter: Sebastian Nagel
> Fix For: 2.3.1
>
> Attachments: NUTCH-2168.patch
>
>
> The plugin parse-tika fails to parse most (all?) kinds of document types 
> (PDF, xlsx, ...) when run via ParserChecker or ParserJob:
> {noformat}
> 2015-11-12 19:14:30,903 INFO  parse.ParserJob - Parsing 
> http://localhost/pdftest.pdf
> 2015-11-12 19:14:30,905 INFO  parse.ParserFactory - ...
> 2015-11-12 19:14:30,907 ERROR tika.TikaParser - Can't retrieve Tika parser 
> for mime-type application/pdf
> 2015-11-12 19:14:30,913 WARN  parse.ParseUtil - Unable to successfully parse 
> content http://localhost/pdftest.pdf of type application/pdf
> {noformat}
> The same document is successfully parsed by TestPdfParser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2198) Indexing binary content by index-html causes Solr Exception

2016-01-09 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2198:
--

 Summary: Indexing binary content by index-html causes Solr 
Exception
 Key: NUTCH-2198
 URL: https://issues.apache.org/jira/browse/NUTCH-2198
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 2.3.1
Reporter: Sebastian Nagel
 Fix For: 2.4


(reported by [~kalanya] in NUTCH-2168)
If raw binary is indexed using the plugin index-html this may cause an 
exception in Solr:
{noformat}
2016-01-05 12:28:00,152 INFO html.HtmlIndexingFilter - Html indexing for: 
http://ujiapps.uji.es/com/investigacio/img/ciencia11.jpg
2016-01-05 12:28:00,163 INFO html.HtmlIndexingFilter - Html indexing for: 
http://ujiapps.uji.es/serveis/cd/bib/reservori/2015/e-llibres/
2016-01-05 12:28:00,164 INFO solr.SolrIndexWriter - Adding 250 documents
2016-01-05 12:28:00,531 INFO solr.SolrIndexWriter - Adding 250 documents
2016-01-05 12:28:00,842 WARN mapred.LocalJobRunner - job_local1207147570_0001
java.lang.Exception: 
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was 
class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char 
#137317, byte #139263)
{noformat}

The index-html plugin tries to treat any raw content as readable content 
converting it to a String based on the platform-dependent charset (cf. [Scanner 
API docus|http://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html]):
{code:title=HtmlIndexingFilter.java}
Scanner scanner = new Scanner(arrayInputStream);
scanner.useDelimiter("\\Z");//To read all scanner content in one 
String
String data = "";
if (scanner.hasNext()) {
data = scanner.next();
}
doc.add("rawcontent", StringUtil.cleanField(data));
{code}

The field "rawcontent" is of type "string":
{code:xml|title=conf/schema.xml}


{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2198) Indexing binary content by index-html causes Solr Exception

2016-01-09 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15090625#comment-15090625
 ] 

Sebastian Nagel commented on NUTCH-2198:


Tried to reproduce the Solr exception by indexing on of the JPEGs shown in the 
log snippet (ciencia11.jpg).
* the Solr exception is not caused by this image (or Solr 4.10.4 is safe)
* however, the indexed rawcontent is modified. E.g., the 4 leading bytes are 
stripped:
{noformat}
% od -tcx1 ciencia11.jpg | head -2
000 377 330 377 341  \v   /   E   x   i   f  \0  \0   M   M  \0   *
 ff  d8  ff  e1  0b  2f  45  78  69  66  00  00  4d  4d  00  2a
{noformat}
vs.
{noformat}
% curl -s 
'http://localhost:8983/solr/collection1/select?q=url%3A%22http%3A%2F%2Flocalhost%2Fnutch%2Ftest%2Fciencia11.jpg%22=json=true'
{
  "responseHeader":{
"status":0,
"QTime":0,
"params":{
  "q":"url:\"http://localhost/nutch/test/ciencia11.jpg\";,
  "indent":"true",
  "wt":"json"}},
  "response":{"numFound":1,"start":0,"docs":[
  {
"tstamp":"1970-01-01T00:00:00Z",
"rawcontent":"#11;/Exif#0;#0;MM#0;*#0;#0;#0;#8;#0; ...
{noformat}

We need a different mechanism to index HTML or binary content -- as binary 
field, converting it to Base64, etc. Forcing a string conversion by a 
platform-dependent charset and then stripping some (but not all!) binary 
characters away is surely no proper solution.

> Indexing binary content by index-html causes Solr Exception
> ---
>
> Key: NUTCH-2198
> URL: https://issues.apache.org/jira/browse/NUTCH-2198
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 2.3.1
>Reporter: Sebastian Nagel
> Fix For: 2.4
>
>
> (reported by [~kalanya] in NUTCH-2168)
> If raw binary is indexed using the plugin index-html this may cause an 
> exception in Solr:
> {noformat}
> 2016-01-05 12:28:00,152 INFO html.HtmlIndexingFilter - Html indexing for: 
> http://ujiapps.uji.es/com/investigacio/img/ciencia11.jpg
> 2016-01-05 12:28:00,163 INFO html.HtmlIndexingFilter - Html indexing for: 
> http://ujiapps.uji.es/serveis/cd/bib/reservori/2015/e-llibres/
> 2016-01-05 12:28:00,164 INFO solr.SolrIndexWriter - Adding 250 documents
> 2016-01-05 12:28:00,531 INFO solr.SolrIndexWriter - Adding 250 documents
> 2016-01-05 12:28:00,842 WARN mapred.LocalJobRunner - job_local1207147570_0001
> java.lang.Exception: 
> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was 
> class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char 
> #137317, byte #139263)
> {noformat}
> The index-html plugin tries to treat any raw content as readable content 
> converting it to a String based on the platform-dependent charset (cf. 
> [Scanner API 
> docs|http://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html]):
> {code:title=HtmlIndexingFilter.java}
> Scanner scanner = new Scanner(arrayInputStream);
> scanner.useDelimiter("\\Z");//To read all scanner content in one 
> String
> String data = "";
> if (scanner.hasNext()) {
> data = scanner.next();
> }
> doc.add("rawcontent", StringUtil.cleanField(data));
> {code}
> The field "rawcontent" is of type "string":
> {code:xml|title=conf/schema.xml}
> 
> 
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2198) Indexing binary content by index-html causes Solr Exception

2016-01-09 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2198:
---
Description: 
(reported by [~kalanya] in NUTCH-2168)
If raw binary is indexed using the plugin index-html this may cause an 
exception in Solr:
{noformat}
2016-01-05 12:28:00,152 INFO html.HtmlIndexingFilter - Html indexing for: 
http://ujiapps.uji.es/com/investigacio/img/ciencia11.jpg
2016-01-05 12:28:00,163 INFO html.HtmlIndexingFilter - Html indexing for: 
http://ujiapps.uji.es/serveis/cd/bib/reservori/2015/e-llibres/
2016-01-05 12:28:00,164 INFO solr.SolrIndexWriter - Adding 250 documents
2016-01-05 12:28:00,531 INFO solr.SolrIndexWriter - Adding 250 documents
2016-01-05 12:28:00,842 WARN mapred.LocalJobRunner - job_local1207147570_0001
java.lang.Exception: 
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was 
class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char 
#137317, byte #139263)
{noformat}

The index-html plugin tries to treat any raw content as readable content 
converting it to a String based on the platform-dependent charset (cf. [Scanner 
API docs|http://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html]):
{code:title=HtmlIndexingFilter.java}
Scanner scanner = new Scanner(arrayInputStream);
scanner.useDelimiter("\\Z");//To read all scanner content in one 
String
String data = "";
if (scanner.hasNext()) {
data = scanner.next();
}
doc.add("rawcontent", StringUtil.cleanField(data));
{code}

The field "rawcontent" is of type "string":
{code:xml|title=conf/schema.xml}


{code}

  was:
(reported by [~kalanya] in NUTCH-2168)
If raw binary is indexed using the plugin index-html this may cause an 
exception in Solr:
{noformat}
2016-01-05 12:28:00,152 INFO html.HtmlIndexingFilter - Html indexing for: 
http://ujiapps.uji.es/com/investigacio/img/ciencia11.jpg
2016-01-05 12:28:00,163 INFO html.HtmlIndexingFilter - Html indexing for: 
http://ujiapps.uji.es/serveis/cd/bib/reservori/2015/e-llibres/
2016-01-05 12:28:00,164 INFO solr.SolrIndexWriter - Adding 250 documents
2016-01-05 12:28:00,531 INFO solr.SolrIndexWriter - Adding 250 documents
2016-01-05 12:28:00,842 WARN mapred.LocalJobRunner - job_local1207147570_0001
java.lang.Exception: 
org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was 
class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char 
#137317, byte #139263)
{noformat}

The index-html plugin tries to treat any raw content as readable content 
converting it to a String based on the platform-dependent charset (cf. [Scanner 
API docus|http://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html]):
{code:title=HtmlIndexingFilter.java}
Scanner scanner = new Scanner(arrayInputStream);
scanner.useDelimiter("\\Z");//To read all scanner content in one 
String
String data = "";
if (scanner.hasNext()) {
data = scanner.next();
}
doc.add("rawcontent", StringUtil.cleanField(data));
{code}

The field "rawcontent" is of type "string":
{code:xml|title=conf/schema.xml}


{code}


> Indexing binary content by index-html causes Solr Exception
> ---
>
> Key: NUTCH-2198
> URL: https://issues.apache.org/jira/browse/NUTCH-2198
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 2.3.1
>Reporter: Sebastian Nagel
> Fix For: 2.4
>
>
> (reported by [~kalanya] in NUTCH-2168)
> If raw binary is indexed using the plugin index-html this may cause an 
> exception in Solr:
> {noformat}
> 2016-01-05 12:28:00,152 INFO html.HtmlIndexingFilter - Html indexing for: 
> http://ujiapps.uji.es/com/investigacio/img/ciencia11.jpg
> 2016-01-05 12:28:00,163 INFO html.HtmlIndexingFilter - Html indexing for: 
> http://ujiapps.uji.es/serveis/cd/bib/reservori/2015/e-llibres/
> 2016-01-05 12:28:00,164 INFO solr.SolrIndexWriter - Adding 250 documents
> 2016-01-05 12:28:00,531 INFO solr.SolrIndexWriter - Adding 250 documents
> 2016-01-05 12:28:00,842 WARN mapred.LocalJobRunner - job_local1207147570_0001
> java.lang.Exception: 
> org.apache.solr.client.solrj.impl.HttpSolrServer$RemoteSolrException: [was 
> class java.io.CharConversionException] Invalid UTF-8 character 0xfffe at char 
> #137317, byte #139263)
> {noformat}
> The index-html plugin tries to treat any raw content as readable content 
> converting it to a String based on the platform-dependent charset (cf. 
> [Scanner API 
> docs|http://docs.oracle.com/javase/7/docs/api/java/util/Scanner.html]):
> {code:title=HtmlIndexingFilter.java}
> Scanner scanner = new Scanner(arrayInputStream);
> 

[jira] [Resolved] (NUTCH-2169) Integrate index-html into Nutch build

2016-01-08 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2169?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2169.

Resolution: Fixed
  Assignee: Sebastian Nagel

Committed to 2.x, r1723794.

> Integrate index-html into Nutch build
> -
>
> Key: NUTCH-2169
> URL: https://issues.apache.org/jira/browse/NUTCH-2169
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 2.3.1
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
> Fix For: 2.3.1
>
> Attachments: NUTCH-2169.patch
>
>
> The plugin index-html (added by NUTCH-1944) is loosely integrated:
> - code is in Nutch version control
> - no build (compile, javadoc generation)
> - src/plugin/index-html/src/java/org/apache/nutch/indexer/html/package.html 
> contains a description how to do the integration
> Well, the plugin should be available just by adding it to plugin.includes 
> without any extra efforts.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2143) GeneratorJob ignores batch id passed as argument

2016-01-07 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2143.

Resolution: Fixed

Committed to 2.x, r1723626. Thanks!

> GeneratorJob ignores batch id passed as argument
> 
>
> Key: NUTCH-2143
> URL: https://issues.apache.org/jira/browse/NUTCH-2143
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 2.3.1
>Reporter: Sebastian Nagel
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 2.3.1
>
> Attachments: NUTCH-2143-v2.patch, NUTCH-2143-v3.patch, patch
>
>
> The batch id passed to GeneratorJob by option/argument -batchId  is 
> ignored and a generated batch id is used to mark the current batch. Log 
> snippets from a run of bin/crawl:
> {noformat}
> bin/nutch generate ... -batchId 1444941073-14208
> ...
> GeneratorJob: generated batch id: 1444941074-858443668 containing 1 URLs
> Fetching : 
> bin/nutch fetch ... 1444941073-14208 ...
> ...
> QueueFeeder finished: total 0 records. Hit by time limit :0
> {noformat}
> The generated URLs are marked with the wrong batch id:
> {noformat}
> hbase(main):010:0> scan 'test_webpage'
> ROWCOLUMN+CELL
>  org.apache.nutch:http/column=f:bid, timestamp=1444941077080, 
> value=1444941074-858443668
>  ...
>  org.apache.nutch:http/column=mk:_gnmrk_, timestamp=1444941077080, 
> value=1444941074-858443668
> {noformat}
> and fetcher will not fetch anything. This problem was reported by Sherban 
> Drulea 
> [[1|https://www.mail-archive.com/user@nutch.apache.org/msg13894.html]], 
> [[2|https://www.mail-archive.com/user@nutch.apache.org/msg13912.html]].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2189) Domain filter must deactivate if no rules are present

2015-12-22 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068723#comment-15068723
 ] 

Sebastian Nagel commented on NUTCH-2189:


+1 makes the urlfilter-domain more robust, patch looks good

But shouldn't the same be done for urlfilter-domainblacklist? Maybe the patch 
even applies?


> Domain filter must deactivate if no rules are present
> -
>
> Key: NUTCH-2189
> URL: https://issues.apache.org/jira/browse/NUTCH-2189
> Project: Nutch
>  Issue Type: Bug
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: NUTCH-2189.patch
>
>
> We just erased an entire CrawlDB by accident due to a misconfiguration and 
> the nice fact that the domain filter deletes everything if it has no rules. 
> This issue will deactivate the filter if no rules are present, because it 
> makes no sense to configure it without any rules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2065) Domain URL filter to support protocols

2015-12-22 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2065?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068757#comment-15068757
 ] 

Sebastian Nagel commented on NUTCH-2065:


* in general: wouldn't a URL normalizer be preferable? If URLs of one protocol 
are suppressed, missing links may get lost. Some documents of a site which uses 
https mostly may be referenced from few http pages only.
* before domain url filter was agnostic regarding the protocol: shouldn't this 
behaviour be kept in all cases, i.e., also for ftp? Almost everything now is 
http or https, but may we should keep the interpretation of "no protocol 
specified" -> "any protocol allowed".

> Domain URL filter to support protocols
> --
>
> Key: NUTCH-2065
> URL: https://issues.apache.org/jira/browse/NUTCH-2065
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.10
>Reporter: Markus Jelsma
> Attachments: NUTCH-2065.patch, NUTCH-2065.patch
>
>
> The filter allows all protocols for all whitelisted domains, hosts or 
> suffixes but it usually makes little sense to index both http and https URL's 
> of the same domain. This is not unlike the host URL filter, which prevents 
> indexing of duplicate hosts e.g. apache.org and www.apache.org.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2189) Domain filter must deactivate if no rules are present

2015-12-25 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15071661#comment-15071661
 ] 

Sebastian Nagel commented on NUTCH-2189:


Yes, you're right!

> Domain filter must deactivate if no rules are present
> -
>
> Key: NUTCH-2189
> URL: https://issues.apache.org/jira/browse/NUTCH-2189
> Project: Nutch
>  Issue Type: Bug
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: NUTCH-2189.patch
>
>
> We just erased an entire CrawlDB by accident due to a misconfiguration and 
> the nice fact that the domain filter deletes everything if it has no rules. 
> This issue will deactivate the filter if no rules are present, because it 
> makes no sense to configure it without any rules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2189) Domain filter must deactivate if no rules are present

2015-12-25 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2189?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15071662#comment-15071662
 ] 

Sebastian Nagel commented on NUTCH-2189:


Yes, you're right!

> Domain filter must deactivate if no rules are present
> -
>
> Key: NUTCH-2189
> URL: https://issues.apache.org/jira/browse/NUTCH-2189
> Project: Nutch
>  Issue Type: Bug
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Attachments: NUTCH-2189.patch
>
>
> We just erased an entire CrawlDB by accident due to a misconfiguration and 
> the nice fact that the domain filter deletes everything if it has no rules. 
> This issue will deactivate the filter if no rules are present, because it 
> makes no sense to configure it without any rules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (NUTCH-2158) Upgrade to Tika 1.11

2015-11-25 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15023131#comment-15023131
 ] 

Sebastian Nagel edited comment on NUTCH-2158 at 11/26/15 7:28 AM:
--

Patch to adjust tests of protocol-http:
- accept text/html as MIME type
- also change jsp documents so that they do not look like .xhtml


was (Author: wastl-nagel):
Patch to adjust tests of protocol-http:
- accept text/html as MIME type
- also change jsp documents so that XHTML they do not look like .xhml

> Upgrade to Tika 1.11
> 
>
> Key: NUTCH-2158
> URL: https://issues.apache.org/jira/browse/NUTCH-2158
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Julien Nioche
> Fix For: 1.11
>
> Attachments: NUTCH-2158-test-protocol-http.patch, NUTCH-2158.patch
>
>
> Upgrade parse-tika to 1.11 release for Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2175) Typos in property descriptions in nutch-default.xml

2015-11-24 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2175:
--

Assignee: Sebastian Nagel

> Typos in property descriptions in nutch-default.xml
> ---
>
> Key: NUTCH-2175
> URL: https://issues.apache.org/jira/browse/NUTCH-2175
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.11
>Reporter: Roannel Fernández Hernández
>Assignee: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.11
>
> Attachments: NUTCH-2175.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2177) Generator produces only one partition even in distributed mode

2015-11-30 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032108#comment-15032108
 ] 

Sebastian Nagel commented on NUTCH-2177:


Rely on {{mapred.job.tracker}}, cf. 
[[1|http://hortonworks.com/blog/running-existing-applications-on-hadoop-2-yarn/]]?

> Generator produces only one partition even in distributed mode
> --
>
> Key: NUTCH-2177
> URL: https://issues.apache.org/jira/browse/NUTCH-2177
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Reporter: Julien Nioche
>Priority: Blocker
> Fix For: 1.11
>
>
> See 
> [https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/crawl/Generator.java#L542]
> 'mapred.job.tracker' is deprecated and has been replaced by 
> 'mapreduce.jobtracker.address', however when running Nutch on EMR 
> mapreduce.jobtracker.address has local as a value. As a result we generate a 
> single partition i.e. have a single map fetching later on (which defeats the 
> object of having a distributed crawler).
> We should probably detect whether we are running on YARN instead, see 
> [http://stackoverflow.com/questions/29680155/why-there-is-a-mapreduce-jobtracker-address-configuration-on-yarn]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2177) Generator produces only one partition even in distributed mode

2015-12-01 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15033595#comment-15033595
 ] 

Sebastian Nagel commented on NUTCH-2177:


Yes, of course, I was just unable to copy-paste the right property! 

> Generator produces only one partition even in distributed mode
> --
>
> Key: NUTCH-2177
> URL: https://issues.apache.org/jira/browse/NUTCH-2177
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Reporter: Julien Nioche
>Priority: Blocker
> Fix For: 1.11
>
> Attachments: NUTCH-2177.patch
>
>
> See 
> [https://github.com/apache/nutch/blob/trunk/src/java/org/apache/nutch/crawl/Generator.java#L542]
> 'mapred.job.tracker' is deprecated and has been replaced by 
> 'mapreduce.jobtracker.address', however when running Nutch on EMR 
> mapreduce.jobtracker.address has local as a value. As a result we generate a 
> single partition i.e. have a single map fetching later on (which defeats the 
> object of having a distributed crawler).
> We should probably detect whether we are running on YARN instead, see 
> [http://stackoverflow.com/questions/29680155/why-there-is-a-mapreduce-jobtracker-address-configuration-on-yarn]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2158) Upgrade to Tika 1.11

2015-11-26 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2158.

Resolution: Fixed

Thanks! Committed to trunk, r1716573.

> Upgrade to Tika 1.11
> 
>
> Key: NUTCH-2158
> URL: https://issues.apache.org/jira/browse/NUTCH-2158
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Julien Nioche
> Fix For: 1.11
>
> Attachments: NUTCH-2158-test-protocol-http.patch, NUTCH-2158.patch
>
>
> Upgrade parse-tika to 1.11 release for Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2158) Upgrade to Tika 1.11

2015-11-23 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15023123#comment-15023123
 ] 

Sebastian Nagel commented on NUTCH-2158:


We need to the pass the rendered HTML, returned by the server (Jetty) for the 
jsp page, to Tika. Done by adding a sleep to the unit test so that the document 
can be fetched:
{noformat}
% wget -O basic-http.jsp.html -d http://127.0.0.1:47504/basic-http.jsp
HTTP/1.1 200 OK
Content-Type: text/html; charset=utf-8
...
Server: Jetty(6.1.26)
...
% cat basic-http.jsp.html 

http://www.w3.org/1999/xhtml;>
  
http://127.0.0.1:47504/;>

HelloWorld








  
  
  
Hello World!!! 
  

% java -jar tika-app-1.10.jar -d basic-http.jsp.html 
application/xhtml+xml
% java -jar tika-app-1.11.jar -d basic-http.jsp.html 
text/html
{noformat}

It's definitely a change in Tika, probably by TIKA-1771 which lowers the 
probability of {{application/xhtml+xml}}.

But we can probably live with this changed behavior, it's more an improvement 
than a bug:
- both the HTTP header and the metadata claim {{text/html}}
- the document itself isn't clean XHTML

> Upgrade to Tika 1.11
> 
>
> Key: NUTCH-2158
> URL: https://issues.apache.org/jira/browse/NUTCH-2158
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Julien Nioche
> Fix For: 1.11
>
> Attachments: NUTCH-2158.patch
>
>
> Upgrade parse-tika to 1.11 release for Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2158) Upgrade to Tika 1.11

2015-11-23 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2158?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2158:
---
Attachment: NUTCH-2158-test-protocol-http.patch

Patch to adjust tests of protocol-http:
- accept text/html as MIME type
- also change jsp documents so that XHTML they do not look like .xhml

> Upgrade to Tika 1.11
> 
>
> Key: NUTCH-2158
> URL: https://issues.apache.org/jira/browse/NUTCH-2158
> Project: Nutch
>  Issue Type: Task
>  Components: parser
>Reporter: Chris A. Mattmann
>Assignee: Julien Nioche
> Fix For: 1.11
>
> Attachments: NUTCH-2158-test-protocol-http.patch, NUTCH-2158.patch
>
>
> Upgrade parse-tika to 1.11 release for Tika.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2175) Typos in property descriptions in nutch-default.xml

2015-11-24 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2175:
---
Issue Type: Improvement  (was: Bug)

> Typos in property descriptions in nutch-default.xml
> ---
>
> Key: NUTCH-2175
> URL: https://issues.apache.org/jira/browse/NUTCH-2175
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.11
>Reporter: Roannel Fernández Hernández
>Priority: Trivial
> Fix For: 1.11
>
> Attachments: NUTCH-2175.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2175) Typos in property descriptions in nutch-default.xml

2015-11-24 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2175.

Resolution: Fixed

And a spell checker detected some more obvious misspellings...
Committed to trunk, r1716177. Thanks, [~roannel]!

> Typos in property descriptions in nutch-default.xml
> ---
>
> Key: NUTCH-2175
> URL: https://issues.apache.org/jira/browse/NUTCH-2175
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.11
>Reporter: Roannel Fernández Hernández
>Priority: Trivial
> Fix For: 1.11
>
> Attachments: NUTCH-2175.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2175) Typos in property descriptions in nutch-default.xml

2015-11-24 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2175?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2175:
---
Summary: Typos in property descriptions in nutch-default.xml  (was: 
Misspelling at word "attempts" in description of http.max.delays property.)

> Typos in property descriptions in nutch-default.xml
> ---
>
> Key: NUTCH-2175
> URL: https://issues.apache.org/jira/browse/NUTCH-2175
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.11
>Reporter: Roannel Fernández Hernández
>Priority: Trivial
> Fix For: 1.11
>
> Attachments: NUTCH-2175.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Work started] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job

2016-01-11 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Work on NUTCH-1712 started by Sebastian Nagel.
--
> Use MultipleInputs in Injector to make it a single mapreduce job
> 
>
> Key: NUTCH-1712
> URL: https://issues.apache.org/jira/browse/NUTCH-1712
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 1.7
>Reporter: Tejas Patil
>Assignee: Sebastian Nagel
> Attachments: NUTCH-1712-trunk.v1.patch
>
>
> Currently Injector creates two mapreduce jobs:
> 1. sort job: get the urls from seeds file, emit CrawlDatum objects.
> 2. merge job: read CrawlDatum objects from both crawldb and output of sort 
> job. Merge and emit final CrawlDatum objects.
> Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls 
> from seeds file simultaneously and perform inject in a single map-reduce job.
> Also, here are additional things covered with this jira:
> 1. Pushed filtering and normalization above metadata extraction so that the 
> unwanted records are ruled out quickly.
> 2. Migrated to new mapreduce API
> 3. Improved documentation 
> 4. New junits with better coverage
> Relevant discussion over nutch-dev can be found here:
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job

2016-01-11 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15092924#comment-15092924
 ] 

Sebastian Nagel commented on NUTCH-1712:


The merging is done together with minor improvements 
(https://github.com/apache/nutch/compare/trunk...sebastian-nagel:NUTCH-1712), 
but still  need to adapt test unit (TestCrawlDbStates.java).


> Use MultipleInputs in Injector to make it a single mapreduce job
> 
>
> Key: NUTCH-1712
> URL: https://issues.apache.org/jira/browse/NUTCH-1712
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 1.7
>Reporter: Tejas Patil
>Assignee: Sebastian Nagel
> Attachments: NUTCH-1712-trunk.v1.patch
>
>
> Currently Injector creates two mapreduce jobs:
> 1. sort job: get the urls from seeds file, emit CrawlDatum objects.
> 2. merge job: read CrawlDatum objects from both crawldb and output of sort 
> job. Merge and emit final CrawlDatum objects.
> Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls 
> from seeds file simultaneously and perform inject in a single map-reduce job.
> Also, here are additional things covered with this jira:
> 1. Pushed filtering and normalization above metadata extraction so that the 
> unwanted records are ruled out quickly.
> 2. Migrated to new mapreduce API
> 3. Improved documentation 
> 4. New junits with better coverage
> Relevant discussion over nutch-dev can be found here:
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2272) Index checker server to optionally keep client connection open

2016-06-15 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15331434#comment-15331434
 ] 

Sebastian Nagel commented on NUTCH-2272:


Not included in [1.12 release 
candidate|https://dist.apache.org/repos/dist/dev/nutch/1.12/]: ev. need to 
change "Fix Version/s" and CHANGES.txt

> Index checker server to optionally keep client connection open
> --
>
> Key: NUTCH-2272
> URL: https://issues.apache.org/jira/browse/NUTCH-2272
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.12
>
> Attachments: NUTCH-2272.patch
>
>
> As the title says: for easier testing without having to start up the 
> indexchecker JVM every time.
> {code}
> bin/nutch org.apache.nutch.indexer.IndexingFiltersChecker -normalize 
> -followRedirects -keepClientCnxOpen -listen 5000
> {code}
> Just telnet to it an send URL's with line feed to get output fast.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-827) HTTP POST Authentication

2016-06-13 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15328204#comment-15328204
 ] 

Sebastian Nagel commented on NUTCH-827:
---

Hi [~stevegy], would you mind to open a new Jira for this problem? Thanks!

> HTTP POST Authentication
> 
>
> Key: NUTCH-827
> URL: https://issues.apache.org/jira/browse/NUTCH-827
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.1, nutchgora
>Reporter: Jasper van Veghel
>Assignee: Lewis John McGibbney
>Priority: Minor
>  Labels: authentication, memex
> Fix For: 1.10
>
> Attachments: NUTCH-827-trunk-v3.patch, NUTCH-827-trunk.patch, 
> NUTCH-827-trunkv2.patch, http-client-form-authtication.patch, 
> nutch-http-cookies.patch
>
>
> I've created a patch against the trunk which adds support for very 
> rudimentary POST-based authentication support. It takes a link from 
> nutch-site.xml with a site to POST to and its respective parameters 
> (username, password, etc.). It then checks upon every request whether any 
> cookies have been initialized, and if none have, it fetches them from the 
> given link.
> This isn't perfect but Works For Me (TM) as I generally only need to retrieve 
> results from a single domain and so have no cookie overlap (i.e. if the 
> domain cookies expire, all cookies disappear from the HttpClient and I can 
> simply re-fetch them). A natural improvement would be to be able to specify 
> one particular cookie to check the expiration-date against. If anyone is 
> interested in this beside me I'd be glad to put some more effort into making 
> this more universally applicable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2281) Support non-default FileSystem

2016-06-17 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2281:
--

 Summary: Support non-default FileSystem
 Key: NUTCH-2281
 URL: https://issues.apache.org/jira/browse/NUTCH-2281
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.12
Reporter: Sebastian Nagel
 Fix For: 1.13


If a path (input or output) does not belong to the configured default 
FileSystem various Nutch tools may raise an exception like
{noformat}
  Exception in ... java.lang.IllegalArgumentException: Wrong FS: s3a://..., 
expected: hdfs://...
{noformat}

This is fixed by getting a reference to the FileSystem from the Path object
{noformat}
  FileSystem fs = path.getFileSystem(getConf());
{noformat}
instead of
{noformat}
  FileSystem fs = FileSystem.get(getConf());
{noformat}
A given path (e.g., {{s3a://...}}) may not belong to the default file system 
({{hdfs://}} or {{file://}} in local mode) and simple checks such as 
{{fs.exists(path)}} then will fail. Cf. 
[FileSystem.checkPath(path)|https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/FileSystem.html#checkPath(org.apache.hadoop.fs.Path)],
 and 
[FileSystem.get(conf)|https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/FileSystem.html#get(org.apache.hadoop.conf.Configuration)]
 vs. 
[FileSystem.get(URI,conf)|https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/FileSystem.html#get(java.net.URI,%20org.apache.hadoop.conf.Configuration)]
 which is called by 
[Path.getFileSystem(conf)|https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/Path.html#getFileSystem%28org.apache.hadoop.conf.Configuration%29].
  
Note that the FileSystem for input and output may be different, e.g., read from 
HDFS and write to S3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2281) Support non-default FileSystem

2016-06-21 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15341680#comment-15341680
 ] 

Sebastian Nagel commented on NUTCH-2281:


I tried to fix all tools but haven't tested all of them yet.  Yes, there may be 
some I've overseen :(.  I didn't fix unit tests, rarely used tools (Benchmark, 
DmozParser) and some main() methods which are intended for debugging or 
explicitly take the file system as argument (ParseData, ParseText).  I'll 
continue testing the next days but help is welcome!

> Support non-default FileSystem
> --
>
> Key: NUTCH-2281
> URL: https://issues.apache.org/jira/browse/NUTCH-2281
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.12
>Reporter: Sebastian Nagel
> Fix For: 1.13
>
>
> If a path (input or output) does not belong to the configured default 
> FileSystem various Nutch tools may raise an exception like
> {noformat}
>   Exception in ... java.lang.IllegalArgumentException: Wrong FS: s3a://..., 
> expected: hdfs://...
> {noformat}
> This is fixed by getting a reference to the FileSystem from the Path object
> {noformat}
>   FileSystem fs = path.getFileSystem(getConf());
> {noformat}
> instead of
> {noformat}
>   FileSystem fs = FileSystem.get(getConf());
> {noformat}
> A given path (e.g., {{s3a://...}}) may not belong to the default file system 
> ({{hdfs://}} or {{file://}} in local mode) and simple checks such as 
> {{fs.exists(path)}} then will fail. Cf. 
> [FileSystem.checkPath(path)|https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/FileSystem.html#checkPath(org.apache.hadoop.fs.Path)],
>  and 
> [FileSystem.get(conf)|https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/FileSystem.html#get(org.apache.hadoop.conf.Configuration)]
>  vs. 
> [FileSystem.get(URI,conf)|https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/FileSystem.html#get(java.net.URI,%20org.apache.hadoop.conf.Configuration)]
>  which is called by 
> [Path.getFileSystem(conf)|https://hadoop.apache.org/docs/r2.7.2/api/org/apache/hadoop/fs/Path.html#getFileSystem%28org.apache.hadoop.conf.Configuration%29].
>   
> Note that the FileSystem for input and output may be different, e.g., read 
> from HDFS and write to S3.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2286) CrawlDbReader -stats fetch time and interval

2016-06-23 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2286:
--

 Summary: CrawlDbReader -stats fetch time and interval
 Key: NUTCH-2286
 URL: https://issues.apache.org/jira/browse/NUTCH-2286
 Project: Nutch
  Issue Type: Improvement
  Components: crawldb
Affects Versions: 1.12
Reporter: Sebastian Nagel
Priority: Minor
 Fix For: 1.13


An overview about fetch times and fetch intervals could be useful to configure 
a crawl.  CrawlDbReader could easily calculate min, max and average and show it 
as part of the statistics job (command-line option {{-stats}}):
{noformat}
% bin/nutch readdb .../crawldb/ -stats
...
TOTAL urls: 544910
shortest fetch interval:   7 days, 00:00:00
avg fetch interval:7 days, 17:43:58
longest fetch interval:   10 days, 12:00:00
earliest fetch time:   Wed May 25 11:42:00 CEST 2016
avg of fetch times:Sun Jun 05 18:11:00 CEST 2016
latest fetch time: Wed Jun 22 10:25:00 CEST 2016
...
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2272) Index checker server to optionally keep client connection open

2016-06-23 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2272:
---
Fix Version/s: (was: 1.12)
   1.13

> Index checker server to optionally keep client connection open
> --
>
> Key: NUTCH-2272
> URL: https://issues.apache.org/jira/browse/NUTCH-2272
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.13
>
> Attachments: NUTCH-2272.patch
>
>
> As the title says: for easier testing without having to start up the 
> indexchecker JVM every time.
> {code}
> bin/nutch org.apache.nutch.indexer.IndexingFiltersChecker -normalize 
> -followRedirects -keepClientCnxOpen -listen 5000
> {code}
> Just telnet to it an send URL's with line feed to get output fast.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2286) CrawlDbReader -stats to show fetch time and interval

2016-06-23 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2286:
---
Summary: CrawlDbReader -stats to show fetch time and interval  (was: 
CrawlDbReader -stats fetch time and interval)

> CrawlDbReader -stats to show fetch time and interval
> 
>
> Key: NUTCH-2286
> URL: https://issues.apache.org/jira/browse/NUTCH-2286
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb
>Affects Versions: 1.12
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.13
>
>
> An overview about fetch times and fetch intervals could be useful to 
> configure a crawl.  CrawlDbReader could easily calculate min, max and average 
> and show it as part of the statistics job (command-line option {{-stats}}):
> {noformat}
> % bin/nutch readdb .../crawldb/ -stats
> ...
> TOTAL urls: 544910
> shortest fetch interval:   7 days, 00:00:00
> avg fetch interval:7 days, 17:43:58
> longest fetch interval:   10 days, 12:00:00
> earliest fetch time:   Wed May 25 11:42:00 CEST 2016
> avg of fetch times:Sun Jun 05 18:11:00 CEST 2016
> latest fetch time: Wed Jun 22 10:25:00 CEST 2016
> ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2272) Index checker server to optionally keep client connection open

2016-06-23 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2272?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15346585#comment-15346585
 ] 

Sebastian Nagel commented on NUTCH-2272:


Not included in released 1.12: removed from CHANGES.txt, set correct "Fix 
Version/s".

> Index checker server to optionally keep client connection open
> --
>
> Key: NUTCH-2272
> URL: https://issues.apache.org/jira/browse/NUTCH-2272
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.13
>
> Attachments: NUTCH-2272.patch
>
>
> As the title says: for easier testing without having to start up the 
> indexchecker JVM every time.
> {code}
> bin/nutch org.apache.nutch.indexer.IndexingFiltersChecker -normalize 
> -followRedirects -keepClientCnxOpen -listen 5000
> {code}
> Just telnet to it an send URL's with line feed to get output fast.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2269) Clean not working after crawl

2016-06-27 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15351824#comment-15351824
 ] 

Sebastian Nagel commented on NUTCH-2269:


Thanks for reporting the problems. Afaics, they can be solved by using "clean" 
the right way in combination with the required Solr version:
# "nutch clean" will not run on the linkdb:
#* the command-line help is clear
{noformat}
% bin/nutch clean
Usage: CleaningJob  [-noCommit]
{noformat}
#* and also the error message gives a clear hint:
{noformat}
java.lang.Exception: java.lang.ClassCastException: 
org.apache.nutch.crawl.Inlinks cannot be cast to 
org.apache.nutch.crawl.CrawlDatum
...
2016-06-27 22:00:09,628 ERROR indexer.CleaningJob - CleaningJob: 
java.io.IOException: Job failed!
...
2016-06-27 22:00:52,057 ERROR indexer.CleaningJob - Missing crawldb. Usage: 
CleaningJob  [-noCommit]
{noformat}
#* unfortunately, both CrawlDb and LinkDb are formally map files which makes it 
difficult to check the right usage in advance.
# I was able to reproduce the error "IllegalStateException: Connection pool 
shut down" when using Nutch 1.12 in combination with Solr 4.10.4. However, 
Nutch 1.12 is built against Solr 5.4.1 which is probably the reason. Are you 
able to reproduce the problem with the correct Solr version?
# The message
{noformat}
WARN output.FileOutputCommitter - Output Path is null in commitJob()
{noformat}
is only a warning and no problem: Indeed, the cleaning job is a map-reduce job 
without output, deletions are sent to the Solr server.  It's uncommon for a 
map-reduce job to have no output but it is not a problem.

> Clean not working after crawl
> -
>
> Key: NUTCH-2269
> URL: https://issues.apache.org/jira/browse/NUTCH-2269
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.12
> Environment: Vagrant, Ubuntu, Java 8, Solr 4.10
>Reporter: Francesco Capponi
>
> I'm have been having this problem for a while and I had to rollback using the 
> old solr clean instead of the newer version. 
> Once it inserts/update correctly every document in Nutch, when it tries to 
> clean, it returns error 255:
> {quote}
> 2016-05-30 10:13:04,992 WARN  output.FileOutputCommitter - Output Path is 
> null in setupJob()
> 2016-05-30 10:13:07,284 INFO  indexer.IndexWriters - Adding 
> org.apache.nutch.indexwriter.solr.SolrIndexWriter
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: content dest: 
> content
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: title dest: 
> title
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: host dest: host
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: segment dest: 
> segment
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: boost dest: 
> boost
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: digest dest: 
> digest
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: tstamp dest: 
> tstamp
> 2016-05-30 10:13:08,133 INFO  solr.SolrIndexWriter - SolrIndexer: deleting 
> 15/15 documents
> 2016-05-30 10:13:08,919 WARN  output.FileOutputCommitter - Output Path is 
> null in cleanupJob()
> 2016-05-30 10:13:08,937 WARN  mapred.LocalJobRunner - job_local662730477_0001
> java.lang.Exception: java.lang.IllegalStateException: Connection pool shut 
> down
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.IllegalStateException: Connection pool shut down
>   at org.apache.http.util.Asserts.check(Asserts.java:34)
>   at 
> org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:169)
>   at 
> org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:202)
>   at 
> org.apache.http.impl.conn.PoolingClientConnectionManager.requestConnection(PoolingClientConnectionManager.java:184)
>   at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:415)
>   at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
>   at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
>   at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
>   at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
>   at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:480)
>   at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)
>   at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)
>   at 
> 

[jira] [Issue Comment Deleted] (NUTCH-2269) Clean not working after crawl

2016-06-27 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2269:
---
Comment: was deleted

(was: The message
{noformat}
WARN output.FileOutputCommitter - Output Path is null in commitJob()
{noformat}
is only a warning and no problem: Indeed, the cleaning job is a map-reduce job 
without output, deletions are sent to the Solr server.  That's uncommon but not 
a problem.)

> Clean not working after crawl
> -
>
> Key: NUTCH-2269
> URL: https://issues.apache.org/jira/browse/NUTCH-2269
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.12
> Environment: Vagrant, Ubuntu, Java 8, Solr 4.10
>Reporter: Francesco Capponi
>
> I'm have been having this problem for a while and I had to rollback using the 
> old solr clean instead of the newer version. 
> Once it inserts/update correctly every document in Nutch, when it tries to 
> clean, it returns error 255:
> {quote}
> 2016-05-30 10:13:04,992 WARN  output.FileOutputCommitter - Output Path is 
> null in setupJob()
> 2016-05-30 10:13:07,284 INFO  indexer.IndexWriters - Adding 
> org.apache.nutch.indexwriter.solr.SolrIndexWriter
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: content dest: 
> content
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: title dest: 
> title
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: host dest: host
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: segment dest: 
> segment
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: boost dest: 
> boost
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: digest dest: 
> digest
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: tstamp dest: 
> tstamp
> 2016-05-30 10:13:08,133 INFO  solr.SolrIndexWriter - SolrIndexer: deleting 
> 15/15 documents
> 2016-05-30 10:13:08,919 WARN  output.FileOutputCommitter - Output Path is 
> null in cleanupJob()
> 2016-05-30 10:13:08,937 WARN  mapred.LocalJobRunner - job_local662730477_0001
> java.lang.Exception: java.lang.IllegalStateException: Connection pool shut 
> down
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.IllegalStateException: Connection pool shut down
>   at org.apache.http.util.Asserts.check(Asserts.java:34)
>   at 
> org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:169)
>   at 
> org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:202)
>   at 
> org.apache.http.impl.conn.PoolingClientConnectionManager.requestConnection(PoolingClientConnectionManager.java:184)
>   at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:415)
>   at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
>   at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
>   at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
>   at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
>   at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:480)
>   at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)
>   at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)
>   at 
> org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:150)
>   at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:483)
>   at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:464)
>   at 
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:190)
>   at 
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:178)
>   at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115)
>   at 
> org.apache.nutch.indexer.CleaningJob$DeleterReducer.close(CleaningJob.java:120)
>   at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:237)
>   at 
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459)
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> 

[jira] [Issue Comment Deleted] (NUTCH-2269) Clean not working after crawl

2016-06-27 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2269:
---
Comment: was deleted

(was: The message
{noformat}
WARN output.FileOutputCommitter - Output Path is null in commitJob()
{noformat}
is only a warning and no problem: Indeed, the cleaning job is a map-reduce job 
without output, deletions are sent to the Solr server.  That's uncommon but not 
a problem.)

> Clean not working after crawl
> -
>
> Key: NUTCH-2269
> URL: https://issues.apache.org/jira/browse/NUTCH-2269
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.12
> Environment: Vagrant, Ubuntu, Java 8, Solr 4.10
>Reporter: Francesco Capponi
>
> I'm have been having this problem for a while and I had to rollback using the 
> old solr clean instead of the newer version. 
> Once it inserts/update correctly every document in Nutch, when it tries to 
> clean, it returns error 255:
> {quote}
> 2016-05-30 10:13:04,992 WARN  output.FileOutputCommitter - Output Path is 
> null in setupJob()
> 2016-05-30 10:13:07,284 INFO  indexer.IndexWriters - Adding 
> org.apache.nutch.indexwriter.solr.SolrIndexWriter
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: content dest: 
> content
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: title dest: 
> title
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: host dest: host
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: segment dest: 
> segment
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: boost dest: 
> boost
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: digest dest: 
> digest
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: tstamp dest: 
> tstamp
> 2016-05-30 10:13:08,133 INFO  solr.SolrIndexWriter - SolrIndexer: deleting 
> 15/15 documents
> 2016-05-30 10:13:08,919 WARN  output.FileOutputCommitter - Output Path is 
> null in cleanupJob()
> 2016-05-30 10:13:08,937 WARN  mapred.LocalJobRunner - job_local662730477_0001
> java.lang.Exception: java.lang.IllegalStateException: Connection pool shut 
> down
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.IllegalStateException: Connection pool shut down
>   at org.apache.http.util.Asserts.check(Asserts.java:34)
>   at 
> org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:169)
>   at 
> org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:202)
>   at 
> org.apache.http.impl.conn.PoolingClientConnectionManager.requestConnection(PoolingClientConnectionManager.java:184)
>   at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:415)
>   at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
>   at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
>   at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
>   at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
>   at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:480)
>   at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)
>   at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)
>   at 
> org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:150)
>   at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:483)
>   at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:464)
>   at 
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:190)
>   at 
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:178)
>   at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115)
>   at 
> org.apache.nutch.indexer.CleaningJob$DeleterReducer.close(CleaningJob.java:120)
>   at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:237)
>   at 
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459)
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> 

[jira] [Issue Comment Deleted] (NUTCH-2269) Clean not working after crawl

2016-06-27 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2269?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2269:
---
Comment: was deleted

(was: The message
{noformat}
WARN output.FileOutputCommitter - Output Path is null in commitJob()
{noformat}
is only a warning and no problem: Indeed, the cleaning job is a map-reduce job 
without output, deletions are sent to the Solr server.  That's uncommon but not 
a problem.)

> Clean not working after crawl
> -
>
> Key: NUTCH-2269
> URL: https://issues.apache.org/jira/browse/NUTCH-2269
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.12
> Environment: Vagrant, Ubuntu, Java 8, Solr 4.10
>Reporter: Francesco Capponi
>
> I'm have been having this problem for a while and I had to rollback using the 
> old solr clean instead of the newer version. 
> Once it inserts/update correctly every document in Nutch, when it tries to 
> clean, it returns error 255:
> {quote}
> 2016-05-30 10:13:04,992 WARN  output.FileOutputCommitter - Output Path is 
> null in setupJob()
> 2016-05-30 10:13:07,284 INFO  indexer.IndexWriters - Adding 
> org.apache.nutch.indexwriter.solr.SolrIndexWriter
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: content dest: 
> content
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: title dest: 
> title
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: host dest: host
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: segment dest: 
> segment
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: boost dest: 
> boost
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: digest dest: 
> digest
> 2016-05-30 10:13:08,114 INFO  solr.SolrMappingReader - source: tstamp dest: 
> tstamp
> 2016-05-30 10:13:08,133 INFO  solr.SolrIndexWriter - SolrIndexer: deleting 
> 15/15 documents
> 2016-05-30 10:13:08,919 WARN  output.FileOutputCommitter - Output Path is 
> null in cleanupJob()
> 2016-05-30 10:13:08,937 WARN  mapred.LocalJobRunner - job_local662730477_0001
> java.lang.Exception: java.lang.IllegalStateException: Connection pool shut 
> down
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.IllegalStateException: Connection pool shut down
>   at org.apache.http.util.Asserts.check(Asserts.java:34)
>   at 
> org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:169)
>   at 
> org.apache.http.pool.AbstractConnPool.lease(AbstractConnPool.java:202)
>   at 
> org.apache.http.impl.conn.PoolingClientConnectionManager.requestConnection(PoolingClientConnectionManager.java:184)
>   at 
> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:415)
>   at 
> org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
>   at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
>   at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:106)
>   at 
> org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
>   at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:480)
>   at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:241)
>   at 
> org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:230)
>   at 
> org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:150)
>   at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:483)
>   at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:464)
>   at 
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:190)
>   at 
> org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:178)
>   at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:115)
>   at 
> org.apache.nutch.indexer.CleaningJob$DeleterReducer.close(CleaningJob.java:120)
>   at org.apache.hadoop.io.IOUtils.cleanup(IOUtils.java:237)
>   at 
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:459)
>   at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
>   at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
>   at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
>   at java.util.concurrent.FutureTask.run(FutureTask.java:262)
>   at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>   at 
> 

[jira] [Updated] (NUTCH-1314) Impose a limit on the length of outlink target urls

2016-02-03 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1314?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1314:
---
Fix Version/s: 1.12

> Impose a limit on the length of outlink target urls
> ---
>
> Key: NUTCH-1314
> URL: https://issues.apache.org/jira/browse/NUTCH-1314
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ferdy Galema
> Fix For: 2.4, 1.12
>
> Attachments: NUTCH-1314-trunk.patch, NUTCH-1314-v2.patch, 
> NUTCH-1314-v3.patch, NUTCH-1314.patch
>
>
> In the past we have encountered situations where crawling specific broken 
> sites resulted in ridiciously long urls that caused the stalling of tasks. 
> The regex plugins (normalizing/filtering) processed single urls for hours, if 
> not indefinitely hanging.
> My suggestion is to limit the outlink url target length as soon possible. It 
> is a configurable limit, the default is 3000. This should be reasonably long 
> enough for most uses. But sufficienly strict enough to make sure regex 
> plugins do not choke on urls that are too long. Please see attached patch for 
> the Nutchgora implementation.
> I'd like to hear what you think about this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2228) index-replace unit test fails

2016-02-22 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157655#comment-15157655
 ] 

Sebastian Nagel commented on NUTCH-2228:


The name of the failing test "testInvalidPatterns" indicates that it is 
intended to check whether syntax errors in regular expressions are properly 
caught. A solution could be to replace {\h} by, e.g., {\s+**} which will be 
hardly become valid in the next Java version. The error is properly caught and 
reported in the logs:
{noformat}
2016-02-22 21:35:42,067 ERROR replace.FieldReplacer 
(FieldReplacer.java:(97)) - Pattern this\s+**plugin for field 
metatag.description failed to compile: java.util.regex.PatternSyntaxException: 
Dangling meta character '*' near index 7
this\s+**plugin
   ^
{noformat}


> index-replace unit test fails
> -
>
> Key: NUTCH-2228
> URL: https://issues.apache.org/jira/browse/NUTCH-2228
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Priority: Blocker
> Fix For: 1.12
>
>
> {code}
> - Standard Error -
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/home/markus/projects/apache/nutch/trunk/build/test/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/markus/projects/apache/nutch/trunk/build/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> -  ---
> Testcase: testGlobalAndUrlNotMatchesPattern took 1.052 sec
> Testcase: testGlobalReplacement took 0.149 sec
> Testcase: testReplacementsWithFlags took 0.105 sec
> Testcase: testUrlMatchesPattern took 0.116 sec
> Testcase: testReplacementsDifferentTarget took 0.099 sec
> Testcase: testReplacementsRunInSpecifedOrder took 0.1 sec
> Testcase: testInvalidPatterns took 0.078 sec
> FAILED
> expected: but was: ]plugin, I control th...>
> junit.framework.AssertionFailedError: expected: th...> but was:
> at 
> org.apache.nutch.indexer.replace.TestIndexReplace.testInvalidPatterns(TestIndexReplace.java:203)
> Testcase: testGlobalAndUrlMatchesPattern took 0.079 sec
> Testcase: testUrlNotMatchesPattern took 0.06 sec
> Testcase: testPropertyParse took 0.03 sec
> {code}
> Does the initial committer know what the outcome of the test should be?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2228) index-replace unit test fails

2016-02-22 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2228:
---
Attachment: NUTCH-2228.patch

> index-replace unit test fails
> -
>
> Key: NUTCH-2228
> URL: https://issues.apache.org/jira/browse/NUTCH-2228
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Priority: Blocker
> Fix For: 1.12
>
> Attachments: NUTCH-2228.patch
>
>
> {code}
> - Standard Error -
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/home/markus/projects/apache/nutch/trunk/build/test/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/markus/projects/apache/nutch/trunk/build/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> -  ---
> Testcase: testGlobalAndUrlNotMatchesPattern took 1.052 sec
> Testcase: testGlobalReplacement took 0.149 sec
> Testcase: testReplacementsWithFlags took 0.105 sec
> Testcase: testUrlMatchesPattern took 0.116 sec
> Testcase: testReplacementsDifferentTarget took 0.099 sec
> Testcase: testReplacementsRunInSpecifedOrder took 0.1 sec
> Testcase: testInvalidPatterns took 0.078 sec
> FAILED
> expected: but was: ]plugin, I control th...>
> junit.framework.AssertionFailedError: expected: th...> but was:
> at 
> org.apache.nutch.indexer.replace.TestIndexReplace.testInvalidPatterns(TestIndexReplace.java:203)
> Testcase: testGlobalAndUrlMatchesPattern took 0.079 sec
> Testcase: testUrlNotMatchesPattern took 0.06 sec
> Testcase: testPropertyParse took 0.03 sec
> {code}
> Does the initial committer know what the outcome of the test should be?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (NUTCH-2228) index-replace unit test fails

2016-02-22 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157655#comment-15157655
 ] 

Sebastian Nagel edited comment on NUTCH-2228 at 2/22/16 8:38 PM:
-

The name of the failing test "testInvalidPatterns" indicates that it is 
intended to check whether syntax errors in regular expressions are properly 
caught. A solution could be to replace {{\h}} by, e.g., {{\s+**}} which will be 
hardly become valid in the next Java version. The error is properly caught and 
reported in the logs:
{noformat}
2016-02-22 21:35:42,067 ERROR replace.FieldReplacer 
(FieldReplacer.java:(97)) - Pattern this\s+**plugin for field 
metatag.description failed to compile: java.util.regex.PatternSyntaxException: 
Dangling meta character '*' near index 7
this\s+**plugin
   ^
{noformat}



was (Author: wastl-nagel):
The name of the failing test "testInvalidPatterns" indicates that it is 
intended to check whether syntax errors in regular expressions are properly 
caught. A solution could be to replace {\h} by, e.g., {\s+**} which will be 
hardly become valid in the next Java version. The error is properly caught and 
reported in the logs:
{noformat}
2016-02-22 21:35:42,067 ERROR replace.FieldReplacer 
(FieldReplacer.java:(97)) - Pattern this\s+**plugin for field 
metatag.description failed to compile: java.util.regex.PatternSyntaxException: 
Dangling meta character '*' near index 7
this\s+**plugin
   ^
{noformat}


> index-replace unit test fails
> -
>
> Key: NUTCH-2228
> URL: https://issues.apache.org/jira/browse/NUTCH-2228
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Priority: Blocker
> Fix For: 1.12
>
>
> {code}
> - Standard Error -
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/home/markus/projects/apache/nutch/trunk/build/test/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/markus/projects/apache/nutch/trunk/build/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> -  ---
> Testcase: testGlobalAndUrlNotMatchesPattern took 1.052 sec
> Testcase: testGlobalReplacement took 0.149 sec
> Testcase: testReplacementsWithFlags took 0.105 sec
> Testcase: testUrlMatchesPattern took 0.116 sec
> Testcase: testReplacementsDifferentTarget took 0.099 sec
> Testcase: testReplacementsRunInSpecifedOrder took 0.1 sec
> Testcase: testInvalidPatterns took 0.078 sec
> FAILED
> expected: but was: ]plugin, I control th...>
> junit.framework.AssertionFailedError: expected: th...> but was:
> at 
> org.apache.nutch.indexer.replace.TestIndexReplace.testInvalidPatterns(TestIndexReplace.java:203)
> Testcase: testGlobalAndUrlMatchesPattern took 0.079 sec
> Testcase: testUrlNotMatchesPattern took 0.06 sec
> Testcase: testPropertyParse took 0.03 sec
> {code}
> Does the initial committer know what the outcome of the test should be?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2228) index-replace unit test fails

2016-02-22 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2228:
---
Patch Info: Patch Available

> index-replace unit test fails
> -
>
> Key: NUTCH-2228
> URL: https://issues.apache.org/jira/browse/NUTCH-2228
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Priority: Blocker
> Fix For: 1.12
>
> Attachments: NUTCH-2228.patch
>
>
> {code}
> - Standard Error -
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/home/markus/projects/apache/nutch/trunk/build/test/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/markus/projects/apache/nutch/trunk/build/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> -  ---
> Testcase: testGlobalAndUrlNotMatchesPattern took 1.052 sec
> Testcase: testGlobalReplacement took 0.149 sec
> Testcase: testReplacementsWithFlags took 0.105 sec
> Testcase: testUrlMatchesPattern took 0.116 sec
> Testcase: testReplacementsDifferentTarget took 0.099 sec
> Testcase: testReplacementsRunInSpecifedOrder took 0.1 sec
> Testcase: testInvalidPatterns took 0.078 sec
> FAILED
> expected: but was: ]plugin, I control th...>
> junit.framework.AssertionFailedError: expected: th...> but was:
> at 
> org.apache.nutch.indexer.replace.TestIndexReplace.testInvalidPatterns(TestIndexReplace.java:203)
> Testcase: testGlobalAndUrlMatchesPattern took 0.079 sec
> Testcase: testUrlNotMatchesPattern took 0.06 sec
> Testcase: testPropertyParse took 0.03 sec
> {code}
> Does the initial committer know what the outcome of the test should be?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2228) index-replace unit test fails

2016-02-22 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157632#comment-15157632
 ] 

Sebastian Nagel commented on NUTCH-2228:


That's only a problem if Nutch is built with Java 8. {{\h}} is invalid in [Java 
7 regex Pattern 
syntax|https://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html#jcc]
 but became a valid character class (horizontal space) in [Java 
8|https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#sum]. 
Should be fixed as part of NUTCH-2171, or earlier.


> index-replace unit test fails
> -
>
> Key: NUTCH-2228
> URL: https://issues.apache.org/jira/browse/NUTCH-2228
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Priority: Blocker
> Fix For: 1.12
>
>
> {code}
> - Standard Error -
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in 
> [jar:file:/home/markus/projects/apache/nutch/trunk/build/test/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in 
> [jar:file:/home/markus/projects/apache/nutch/trunk/build/lib/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an 
> explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> -  ---
> Testcase: testGlobalAndUrlNotMatchesPattern took 1.052 sec
> Testcase: testGlobalReplacement took 0.149 sec
> Testcase: testReplacementsWithFlags took 0.105 sec
> Testcase: testUrlMatchesPattern took 0.116 sec
> Testcase: testReplacementsDifferentTarget took 0.099 sec
> Testcase: testReplacementsRunInSpecifedOrder took 0.1 sec
> Testcase: testInvalidPatterns took 0.078 sec
> FAILED
> expected: but was: ]plugin, I control th...>
> junit.framework.AssertionFailedError: expected: th...> but was:
> at 
> org.apache.nutch.indexer.replace.TestIndexReplace.testInvalidPatterns(TestIndexReplace.java:203)
> Testcase: testGlobalAndUrlMatchesPattern took 0.079 sec
> Testcase: testUrlNotMatchesPattern took 0.06 sec
> Testcase: testPropertyParse took 0.03 sec
> {code}
> Does the initial committer know what the outcome of the test should be?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2220) Rename db.* options used only by the linkdb to linkdb.*

2016-02-22 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157831#comment-15157831
 ] 

Sebastian Nagel commented on NUTCH-2220:


0 / +1
Since this breaks existing crawl configurations: a note (section 
"incompatibility") in CHANGES.txt or the release report could be quite useful, 
changing the meaning of the property "db.ignore.internal.links" into nearly the 
opposite may harm otherwise.

> Rename db.* options used only by the linkdb to linkdb.*
> ---
>
> Key: NUTCH-2220
> URL: https://issues.apache.org/jira/browse/NUTCH-2220
> Project: Nutch
>  Issue Type: Task
>  Components: linkdb
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2220.patch
>
>
> We need an option db.ignore.internal.links that operates in FetcherThread, 
> just like db.ignore.external.links. It already exists but it only used by the 
> LinkDB, and defaults to true, which is no good option for FetcherThread.
> I propose to make a clear distinction between which are used for LinkDB or 
> not. Most options used by LinkDB already use the right prefix but 
> db.ignore.*.links, db.max.inlinks and db.max.anchor.length not yet.
> This patch will rename those options to linkdb.* prefixes so afterwards we 
> can implement db.ignore.internal.links that operates in FetcherThread, just 
> like db.ignore.external.links.
> This will introduce a change in default parameters. Please comment.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2221) Introduce db.ignore.internal.links to FetcherThread

2016-02-22 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2221?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15157816#comment-15157816
 ] 

Sebastian Nagel commented on NUTCH-2221:


+1
Just to consider: the additional argument to 
ParseOutputFormat.filterNormalize(...) may conflict with changes for NUTCH-2144.

> Introduce db.ignore.internal.links to FetcherThread
> ---
>
> Key: NUTCH-2221
> URL: https://issues.apache.org/jira/browse/NUTCH-2221
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.11
>Reporter: Markus Jelsma
> Fix For: 1.12
>
> Attachments: NUTCH-2216-NUTCH-2220-NUTCH-2221.patch, NUTCH-2221.patch
>
>
> FetcherThread has support for db.ignore.external.links. In config you can 
> find db.ignore.internal.links as well, but it only operates on LinkDB, which 
> is confusing. This patch will introduce db.ignore.internal.links to 
> FetcherThread, similar to db.ignore.external.links. With both parameter set 
> to true you can limit the crawl to the injected seed list.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2216) db.ignore.*.links to optionally follow internal redirects

2016-02-22 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1515#comment-1515
 ] 

Sebastian Nagel commented on NUTCH-2216:


* this was the case before, but shouldn't {{db.ignore.external.links.mode}} 
also apply to redirects? Esp. domain-internal redirects are frequently used 
(e.g., helpdesk.xyz.com -> www.xyz.com/helpdesk/)
* (also an additional improvement) description of 
{{db.ignore.(in|ex)ternal.links}} could be:
bq. "This is an effective way to limit the crawl to include only initially 
injected hosts +or domains+, without creating complex URLFilters. See 
'db.ignore.external.links.mode'.
* "db.ignore.treat.redirects.as.links" / "ignoreTreatRedirectsAsLinks": sounds 
complex when reading it the first time, maybe "db.ignore.also.redirects" or as 
antonym "db.follow.redirects"?

> db.ignore.*.links to optionally follow internal redirects
> -
>
> Key: NUTCH-2216
> URL: https://issues.apache.org/jira/browse/NUTCH-2216
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.12
>
> Attachments: NUTCH-2216.patch
>
>
> db.ignore.internal.links doesn't follow any internal hyperlinks or redirects. 
> Together with db.ignore.external.links it helps to restrict the crawl to a 
> predefined set of URL's, for example provided by a customer.
> In many cases, a few of those URL's are redirects, which are not followed. 
> This issue adds an option to optionally allow internal redirects despite 
> db.ignore.internal.links being enabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-1712) Use MultipleInputs in Injector to make it a single mapreduce job

2016-02-25 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1712?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-1712.

   Resolution: Fixed
Fix Version/s: 1.12

Committed to trunk (f5e430e).

> Use MultipleInputs in Injector to make it a single mapreduce job
> 
>
> Key: NUTCH-1712
> URL: https://issues.apache.org/jira/browse/NUTCH-1712
> Project: Nutch
>  Issue Type: Improvement
>  Components: injector
>Affects Versions: 1.7
>Reporter: Tejas Patil
>Assignee: Sebastian Nagel
> Fix For: 1.12
>
> Attachments: NUTCH-1712-trunk.v1.patch
>
>
> Currently Injector creates two mapreduce jobs:
> 1. sort job: get the urls from seeds file, emit CrawlDatum objects.
> 2. merge job: read CrawlDatum objects from both crawldb and output of sort 
> job. Merge and emit final CrawlDatum objects.
> Using MultipleInputs, we can read CrawlDatum objects from crawldb and urls 
> from seeds file simultaneously and perform inject in a single map-reduce job.
> Also, here are additional things covered with this jira:
> 1. Pushed filtering and normalization above metadata extraction so that the 
> unwanted records are ruled out quickly.
> 2. Migrated to new mapreduce API
> 3. Improved documentation 
> 4. New junits with better coverage
> Relevant discussion over nutch-dev can be found here:
> http://mail-archives.apache.org/mod_mbox/nutch-dev/201401.mbox/%3ccafkhtfyxo6wl7gyuv+a5y1pzntdcoqpz4jz_up_bkp9cje8...@mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2204) remove junit lib from runtime

2016-01-22 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2204:
---
Attachment: NUTCH-2204.patch

> remove junit lib from runtime
> -
>
> Key: NUTCH-2204
> URL: https://issues.apache.org/jira/browse/NUTCH-2204
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.11
>Reporter: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.12
>
> Attachments: NUTCH-2204.patch
>
>
> The junit library is shipped in the Nutch bin package as an unnecessary 
> dependency (apache-nutch-1.11/lib/junit-3.8.1.jar). Unit tests use a 
> different library version:
> {noformat}
> % ls build/lib/junit* build/test/lib/junit*
> build/lib/junit-3.8.1.jar  build/test/lib/junit-4.11.jar
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2204) remove junit lib from runtime

2016-01-22 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2204:
--

 Summary: remove junit lib from runtime
 Key: NUTCH-2204
 URL: https://issues.apache.org/jira/browse/NUTCH-2204
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 1.11
Reporter: Sebastian Nagel
Priority: Trivial
 Fix For: 1.12


The junit library is shipped in the Nutch bin package as an unnecessary 
dependency (apache-nutch-1.11/lib/junit-3.8.1.jar). Unit tests use a different 
library version:
{noformat}
% ls build/lib/junit* build/test/lib/junit*
build/lib/junit-3.8.1.jar  build/test/lib/junit-4.11.jar
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2204) Remove junit lib from runtime

2016-01-22 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2204:
---
Summary: Remove junit lib from runtime  (was: remove junit lib from runtime)

> Remove junit lib from runtime
> -
>
> Key: NUTCH-2204
> URL: https://issues.apache.org/jira/browse/NUTCH-2204
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.11
>Reporter: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.12
>
> Attachments: NUTCH-2204.patch
>
>
> The junit library is shipped in the Nutch bin package as an unnecessary 
> dependency (apache-nutch-1.11/lib/junit-3.8.1.jar). Unit tests use a 
> different library version:
> {noformat}
> % ls build/lib/junit* build/test/lib/junit*
> build/lib/junit-3.8.1.jar  build/test/lib/junit-4.11.jar
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2204) remove junit lib from runtime

2016-01-22 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2204?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2204.

Resolution: Fixed

Committed to trunk, r1726318.

> remove junit lib from runtime
> -
>
> Key: NUTCH-2204
> URL: https://issues.apache.org/jira/browse/NUTCH-2204
> Project: Nutch
>  Issue Type: Improvement
>  Components: build
>Affects Versions: 1.11
>Reporter: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.12
>
> Attachments: NUTCH-2204.patch
>
>
> The junit library is shipped in the Nutch bin package as an unnecessary 
> dependency (apache-nutch-1.11/lib/junit-3.8.1.jar). Unit tests use a 
> different library version:
> {noformat}
> % ls build/lib/junit* build/test/lib/junit*
> build/lib/junit-3.8.1.jar  build/test/lib/junit-4.11.jar
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2144) Plugin to override db.ignore.external to exempt interesting external domain URLs

2016-02-14 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15146685#comment-15146685
 ] 

Sebastian Nagel commented on NUTCH-2144:


Hi [~thammegowda],
thanks! Everything looks good with the changes. It's definitely a good idea to 
reuse the code from urlfilter-regex, and users will appreciate if rules/regexes 
work the same way. The ant build files are ok, afaics, but I'll try to test the 
plugin tomorrow.

Two points, I would like to bring up for discussion now, since this plugin will 
introduce a new interface, and interfaces aren't easily changed later:
# currently the filter(...) method takes fromUrl and toUrl as arguments. The 
interface could be more powerful and adaptible to further use cases if we add
## the tag name where the link comes from ("a", "img", "form", etc.). Currently 
the tag name is not available in ParseOutputFormat, we would have to pass it 
via Outlink from the parser where tag names are already used to filter links, 
cf. property "parser.html.outlinks.ignore_tags". Tag names would be the easier 
way to distinguish between page resources and real outlinks.
## similar whether it's a link or a redirect: could be used to follow redirects 
when a site has moved to a different host and is now redirected, while still 
ignoring external outlinks
# the naming could be more explicit: "URLExemptionFilter" or 
"urlfilter-ignoreexempt" do make clear that it's about an exemption from the 
"db.ignore.external.links" property. Only the config file 
"conf/db-ignore-external-exemptions.txt" is sufficiently precise. To avoid 
overlong names (e.g., "IgnoreExternalLinksExemptionUrlFilter"), maybe resolve 
the double negation to something like "AcceptExternalUrlFilter" or 
"urlfilter-externallink".

As said both points are just for discussion, or for later improvements.

> Plugin to override db.ignore.external to exempt interesting external domain 
> URLs
> 
>
> Key: NUTCH-2144
> URL: https://issues.apache.org/jira/browse/NUTCH-2144
> Project: Nutch
>  Issue Type: New Feature
>  Components: crawldb, fetcher
>Reporter: Thamme Gowda N
>Assignee: Chris A. Mattmann
>Priority: Minor
> Fix For: 1.12
>
> Attachments: ignore-exempt.patch, ignore-exempt.patch
>
>
> Create a rule based urlfilter plugin that allows focused crawler 
> (db.ignore.external.links=true) to fetch static resources from external 
> domains.
> The generalized version of this: This plugin should permit interesting URLs 
> from external domains (by overriding db.ignore.external). The interesting 
> urls are decided from a combination of regex and mime-type rules.
> Concrete use case:
>   When using Nutch to crawl images from a set of domains, the crawler needs 
> to fetch all images which may be linked from CDNs and other domains. In this 
> scenario, allowing all external links and then writing hundreds of regular 
> expressions is not feasible for large number of domains.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2060) dedup is removing entries with status db_gone

2016-03-01 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15174628#comment-15174628
 ] 

Sebastian Nagel commented on NUTCH-2060:


Afaics from the mentioned thread on the user mailing list: the problem is 
caused by running the dedup job with db.update.purge.404 == true. [~rupam_01], 
could you confirm that this is also the reason for the problem you observed? 
The quick fix would be to disable the 404 purging.

> dedup is removing entries with status db_gone
> -
>
> Key: NUTCH-2060
> URL: https://issues.apache.org/jira/browse/NUTCH-2060
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.9
>Reporter: Steven Hayles
>Priority: Minor
>
> Using the standard bin/crawl script, Solr is never informed when a previously 
> indexed document has been deleted.
> "bin/nutch update" sets db_gone status in the crawl db for requests returning 
> HTTP 404 status.
> "bin/nutch dedup" remove entries with status db_gone from the crawl db .
> As a result "bin/nutch clean" never sees the db_gone status, so does not 
> inform Solr.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2242) lastModified not always set

2016-03-24 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15210136#comment-15210136
 ] 

Sebastian Nagel commented on NUTCH-2242:


Hi Jurian, thanks for reporting this problem. This is part of a problem 
reported November last year, see NUTCH-2164 and the referenced thread on the 
user mailing list.
Thanks, for the patch. I've just 3 days prepared one for NUTCH-2164 on 
[github|https://github.com/apache/nutch/compare/master...sebastian-nagel:NUTCH-2164].
 I'll merge both and resolve both issues within the next days. One question: 
why should the modified time also be set in CrawlDbReducer.reduce()? Shouldn't 
it be sufficient to do this once in the FetchSchedule implementation which is a 
customizable and pluggable.

> lastModified not always set
> ---
>
> Key: NUTCH-2242
> URL: https://issues.apache.org/jira/browse/NUTCH-2242
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.11
>Reporter: Jurian Broertjes
>Priority: Minor
> Attachments: NUTCH-2242.patch
>
>
> I observed two issues:
> - When using the DefaultFetchSchedule, CrawlDatum's modifiedTime field is not 
> updated on the first successful fetch. 
> - When a document modification is detected (protocol- or signature-wise), the 
> modifiedTime isn't updated
> I can provide a patch later today.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2237) DeduplicationJob: Add extra order criteria based on slug

2016-03-03 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178587#comment-15178587
 ] 

Sebastian Nagel commented on NUTCH-2237:


Good idea! Nice patch, including unit tests. A few comments for possible 
improvements:
* maybe URLUtil.java would be the better place for the slug functions, next to 
chooseRepr(...) which provides a similar functionality
* URLs are now always decoded, even if the decision which URL/document to keep 
is done solely by comparison of score or fetch time. Since decoding URLs isn't 
a cheap computation
*# it should be done lazily, and
*# the result could be cached for later comparisons if there are more than 2 
duplicates. This would be an improvement of the current state, but should be 
done for both the decoded URL string and the slug length.
* Is it safe to first decode the URL string and then parse the resulting string 
as URL? After decoding there may be forbidden or reserved characters, so that 
the URL path and query fail to get properly parsed.
* no branch of this if clause is reachable given that compareUrlSlug(...) 
returns -1, 0, or 1:
{code}
if (compareUrlSlug(urlExisting, urlnewDoc) > 1) {
  // mark new one as duplicate
  ...
} else if (compareUrlSlug(urlnewDoc, urlExisting) > 1) {
{code}


> DeduplicationJob: Add extra order criteria based on slug
> 
>
> Key: NUTCH-2237
> URL: https://issues.apache.org/jira/browse/NUTCH-2237
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ron van der Vegt
> Fix For: 1.12
>
> Attachments: NUTCH-2237.patch
>
>
> Currently user can elect the main document when signatures are the same on 
> score, url lenght and fetchtime. The quality of the slug, based mainly on the 
> amount of meaningful characters, could give users more flexibility to make a 
> difference between slugified urls and urls based on page id.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2237) DeduplicationJob: Add extra order criteria based on slug

2016-03-03 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2237:
---
Fix Version/s: 1.12

> DeduplicationJob: Add extra order criteria based on slug
> 
>
> Key: NUTCH-2237
> URL: https://issues.apache.org/jira/browse/NUTCH-2237
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Ron van der Vegt
> Fix For: 1.12
>
> Attachments: NUTCH-2237.patch
>
>
> Currently user can elect the main document when signatures are the same on 
> score, url lenght and fetchtime. The quality of the slug, based mainly on the 
> amount of meaningful characters, could give users more flexibility to make a 
> difference between slugified urls and urls based on page id.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2256) Inconsistent log level practice

2016-04-29 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2256:
--

Assignee: Sebastian Nagel

> Inconsistent log level practice
> ---
>
> Key: NUTCH-2256
> URL: https://issues.apache.org/jira/browse/NUTCH-2256
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 2.2, 2.3, 1.11, 2.3.1
>Reporter: songwanging
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 2.4, 1.12, 2.3.2
>
>
> In method "run()" of class: apache-nutch 
> 2.3.1\src\java\org\apache\nutch\fetcher\FetcherReducer.java
> The log level is not correct, after checking "LOG.isDebugEnabled()", we 
> should use "LOG.debug(msg, e);", while now we use " LOG.info(msg, e);". In 
> this case, the log level is inconsistent and developers may lose debug 
> messages because of this.
> The related source code is as follows:
>  if (LOG.isDebugEnabled()) {
>   LOG.info("Crawl delay for queue: " + fit.queueID
>   + " is set to " + fiq.crawlDelay
>   + " as per robots.txt. url: " + fit.url);
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2256) Inconsistent log level practice

2016-04-29 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264274#comment-15264274
 ] 

Sebastian Nagel commented on NUTCH-2256:


Good catch, will fix right now. Thanks, [~songwang]!

> Inconsistent log level practice
> ---
>
> Key: NUTCH-2256
> URL: https://issues.apache.org/jira/browse/NUTCH-2256
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 2.2, 2.3, 1.11, 2.3.1
>Reporter: songwanging
>Priority: Minor
> Fix For: 2.4, 1.12, 2.3.2
>
>
> In method "run()" of class: apache-nutch 
> 2.3.1\src\java\org\apache\nutch\fetcher\FetcherReducer.java
> The log level is not correct, after checking "LOG.isDebugEnabled()", we 
> should use "LOG.debug(msg, e);", while now we use " LOG.info(msg, e);". In 
> this case, the log level is inconsistent and developers may lose debug 
> messages because of this.
> The related source code is as follows:
>  if (LOG.isDebugEnabled()) {
>   LOG.info("Crawl delay for queue: " + fit.queueID
>   + " is set to " + fiq.crawlDelay
>   + " as per robots.txt. url: " + fit.url);
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2256) Inconsistent log level practice

2016-04-29 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2256:
---
Fix Version/s: 2.3.2
   1.12
   2.4

> Inconsistent log level practice
> ---
>
> Key: NUTCH-2256
> URL: https://issues.apache.org/jira/browse/NUTCH-2256
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 2.2, 2.3, 1.11, 2.3.1
>Reporter: songwanging
>Priority: Minor
> Fix For: 2.4, 1.12, 2.3.2
>
>
> In method "run()" of class: apache-nutch 
> 2.3.1\src\java\org\apache\nutch\fetcher\FetcherReducer.java
> The log level is not correct, after checking "LOG.isDebugEnabled()", we 
> should use "LOG.debug(msg, e);", while now we use " LOG.info(msg, e);". In 
> this case, the log level is inconsistent and developers may lose debug 
> messages because of this.
> The related source code is as follows:
>  if (LOG.isDebugEnabled()) {
>   LOG.info("Crawl delay for queue: " + fit.queueID
>   + " is set to " + fiq.crawlDelay
>   + " as per robots.txt. url: " + fit.url);
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2256) Inconsistent log level practice

2016-04-29 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2256:
---
Affects Version/s: 1.11

> Inconsistent log level practice
> ---
>
> Key: NUTCH-2256
> URL: https://issues.apache.org/jira/browse/NUTCH-2256
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 2.2, 2.3, 1.11, 2.3.1
>Reporter: songwanging
>Priority: Minor
> Fix For: 2.4, 1.12, 2.3.2
>
>
> In method "run()" of class: apache-nutch 
> 2.3.1\src\java\org\apache\nutch\fetcher\FetcherReducer.java
> The log level is not correct, after checking "LOG.isDebugEnabled()", we 
> should use "LOG.debug(msg, e);", while now we use " LOG.info(msg, e);". In 
> this case, the log level is inconsistent and developers may lose debug 
> messages because of this.
> The related source code is as follows:
>  if (LOG.isDebugEnabled()) {
>   LOG.info("Crawl delay for queue: " + fit.queueID
>   + " is set to " + fiq.crawlDelay
>   + " as per robots.txt. url: " + fit.url);
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2254) Charset issues when using -addBinaryContent and -base64 options

2016-04-27 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2254.

Resolution: Fixed

Committed, r6d2bfa9. Thanks, [~fedechicco]!

> Charset issues when using -addBinaryContent and -base64 options
> ---
>
> Key: NUTCH-2254
> URL: https://issues.apache.org/jira/browse/NUTCH-2254
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.11
>Reporter: Federico Bonelli
>Assignee: Sebastian Nagel
>Priority: Minor
> Attachments: base64-nutch.patch
>
>
> The bug is reproducible with these steps:
> # find a site with cp1252 encoded pages like "http://www.ilsole24ore.com/; 
> and characters with accents (byte representation >127, like [àèéìòù])
> # start a crawl on that site indexing on Solr with options -addBinaryContent 
> -base64
> # find a document inside the newly indexed Solr collection with those 
> accented characters
> # get the base64 binary representation for said html page and decode it back 
> to raw binary, save it
> The file obtained will have invalid characters, which are neither UTF-8 nor 
> cp1252.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2254) Charset issues when using -addBinaryContent and -base64 options

2016-04-25 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15256225#comment-15256225
 ] 

Sebastian Nagel commented on NUTCH-2254:


Hi [~fedechicco], the patch should work. Thanks!
I'll add a JUnit test, and maybe add a comment.

> Charset issues when using -addBinaryContent and -base64 options
> ---
>
> Key: NUTCH-2254
> URL: https://issues.apache.org/jira/browse/NUTCH-2254
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.11
>Reporter: Federico Bonelli
>Priority: Minor
> Attachments: base64-nutch.patch
>
>
> The bug is reproducible with these steps:
> # find a site with cp1252 encoded pages like "http://www.ilsole24ore.com/; 
> and characters with accents (byte representation >127, like [àèéìòù])
> # start a crawl on that site indexing on Solr with options -addBinaryContent 
> -base64
> # find a document inside the newly indexed Solr collection with those 
> accented characters
> # get the base64 binary representation for said html page and decode it back 
> to raw binary, save it
> The file obtained will have invalid characters, which are neither UTF-8 nor 
> cp1252.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (NUTCH-2254) Charset issues when using -addBinaryContent and -base64 options

2016-04-25 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2254?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2254:
--

Assignee: Sebastian Nagel

> Charset issues when using -addBinaryContent and -base64 options
> ---
>
> Key: NUTCH-2254
> URL: https://issues.apache.org/jira/browse/NUTCH-2254
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.11
>Reporter: Federico Bonelli
>Assignee: Sebastian Nagel
>Priority: Minor
> Attachments: base64-nutch.patch
>
>
> The bug is reproducible with these steps:
> # find a site with cp1252 encoded pages like "http://www.ilsole24ore.com/; 
> and characters with accents (byte representation >127, like [àèéìòù])
> # start a crawl on that site indexing on Solr with options -addBinaryContent 
> -base64
> # find a document inside the newly indexed Solr collection with those 
> accented characters
> # get the base64 binary representation for said html page and decode it back 
> to raw binary, save it
> The file obtained will have invalid characters, which are neither UTF-8 nor 
> cp1252.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2256) Inconsistent log level practice

2016-04-29 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2256.

   Resolution: Fixed
Fix Version/s: (was: 2.3.2)

Fixed and committed to 1.x (r0e03daf) and 2.x (r1fc254e).

> Inconsistent log level practice
> ---
>
> Key: NUTCH-2256
> URL: https://issues.apache.org/jira/browse/NUTCH-2256
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 2.2, 2.3, 1.11, 2.3.1
>Reporter: songwanging
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 2.4, 1.12
>
>
> In method "run()" of class: apache-nutch 
> 2.3.1\src\java\org\apache\nutch\fetcher\FetcherReducer.java
> The log level is not correct, after checking "LOG.isDebugEnabled()", we 
> should use "LOG.debug(msg, e);", while now we use " LOG.info(msg, e);". In 
> this case, the log level is inconsistent and developers may lose debug 
> messages because of this.
> The related source code is as follows:
>  if (LOG.isDebugEnabled()) {
>   LOG.info("Crawl delay for queue: " + fit.queueID
>   + " is set to " + fiq.crawlDelay
>   + " as per robots.txt. url: " + fit.url);
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Closed] (NUTCH-2256) Inconsistent log level practice

2016-04-29 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel closed NUTCH-2256.
--

Also did a grep on all Java files for errors of the same kind - nothing found. 
Thanks, [~songwang]!

> Inconsistent log level practice
> ---
>
> Key: NUTCH-2256
> URL: https://issues.apache.org/jira/browse/NUTCH-2256
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 2.2, 2.3, 1.11, 2.3.1
>Reporter: songwanging
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 2.4, 1.12
>
>
> In method "run()" of class: apache-nutch 
> 2.3.1\src\java\org\apache\nutch\fetcher\FetcherReducer.java
> The log level is not correct, after checking "LOG.isDebugEnabled()", we 
> should use "LOG.debug(msg, e);", while now we use " LOG.info(msg, e);". In 
> this case, the log level is inconsistent and developers may lose debug 
> messages because of this.
> The related source code is as follows:
>  if (LOG.isDebugEnabled()) {
>   LOG.info("Crawl delay for queue: " + fit.queueID
>   + " is set to " + fiq.crawlDelay
>   + " as per robots.txt. url: " + fit.url);
> }



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (NUTCH-2164) Inconsistent 'Modified Time' in crawl db

2016-05-19 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2164?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2164:
---
Fix Version/s: 1.13

> Inconsistent 'Modified Time' in crawl db
> 
>
> Key: NUTCH-2164
> URL: https://issues.apache.org/jira/browse/NUTCH-2164
> Project: Nutch
>  Issue Type: Improvement
>  Components: crawldb, fetcher
>Affects Versions: 1.11
>Reporter: Thamme Gowda N
>Priority: Minor
> Fix For: 1.13
>
>
> The 'Modified time' in crawldb is invalid. It is set to (0-Timezone 
> Difference)
> *How to verify/reproduce:*
>   Run 'nutch readdb /path/to/crawldb -dump yy' and then inspect content of 
> 'yy'
> The following improvements can be done:
> 1. Set modified time by DefaultFetchSchedule
> 2. Set ProtocolStatus.lastModified if modified time is available in protocol 
> response headers
> This issue is also discussed in dev mailing lists: 
> http://www.mail-archive.com/dev@nutch.apache.org/msg19803.html#



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1858) Migrate Nutch documentation from Moin Moin to Confluence

2016-05-19 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15291591#comment-15291591
 ] 

Sebastian Nagel commented on NUTCH-1858:


It's hardly a work for a single person. First steps could be
- an inventory and classification of the old wiki to decide what to keep / 
needs revision / skip
- defining the new structure so that everything to be kept gets the right place

If this is discussed and agreed on, the real work starts and should not take 
too long, or we end-up with two wikis (one outdated, the other incomplete).

Or does it make sense, to automatically convert the content and do the clean-up 
afterwards?

> Migrate Nutch documentation from Moin Moin to Confluence
> 
>
> Key: NUTCH-1858
> URL: https://issues.apache.org/jira/browse/NUTCH-1858
> Project: Nutch
>  Issue Type: Task
>  Components: documentation
>Reporter: Lewis John McGibbney
>Priority: Critical
>
> We've had initial support for moving the documentation out of MoinMoin and on 
> to Confluence.
> If we manage confluence correctly then this 'could' pave a path for us having 
> much more structured, meaningful and useful documentation over MoinMoin.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (NUTCH-2252) Allow phantomjs as a browser for selenium options

2016-05-08 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reopened NUTCH-2252:


Tests fail to compile 
[[1|https://builds.apache.org/job/Nutch-trunk/3365/console]]:
{noformat}
   [javac] 
/home/jenkins/jenkins-slave/workspace/Nutch-trunk/src/test/org/apache/nutch/tools/TestCommonCrawlDataDumper.java:104:
 error: method dump in class CommonCrawlDataDumper cannot be applied to given 
types;
   [javac]  dumper.dump(tempDir, sampleSegmentDir, false, null, 
false, "", false);
   [javac]^
   [javac]   required: File,File,File,boolean,String[],boolean,String,boolean
   [javac]   found: File,File,boolean,,boolean,String,boolean

{noformat}

> Allow phantomjs as a browser for selenium options
> -
>
> Key: NUTCH-2252
> URL: https://issues.apache.org/jira/browse/NUTCH-2252
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Affects Versions: 1.12
>Reporter: Kim Whitehall
>Assignee: Chris A. Mattmann
>Priority: Trivial
> Fix For: 1.12
>
>
> Adding phantomjs libraries to lib-selenium so you can choose this as a 
> browser with the selenium option



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2242) lastModified not always set

2016-05-11 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15280076#comment-15280076
 ] 

Sebastian Nagel commented on NUTCH-2242:


Opened pull request [#108|https://github.com/apache/nutch/pull/108] to fix this 
issue (and NUTCH-2164).

> lastModified not always set
> ---
>
> Key: NUTCH-2242
> URL: https://issues.apache.org/jira/browse/NUTCH-2242
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.11
>Reporter: Jurian Broertjes
>Priority: Minor
> Fix For: 1.12
>
> Attachments: NUTCH-2242.patch
>
>
> I observed two issues:
> - When using the DefaultFetchSchedule, CrawlDatum's modifiedTime field is not 
> updated on the first successful fetch. 
> - When a document modification is detected (protocol- or signature-wise), the 
> modifiedTime isn't updated
> I can provide a patch later today.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2242) lastModified not always set

2016-05-11 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2242?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15279942#comment-15279942
 ] 

Sebastian Nagel commented on NUTCH-2242:


[~markus17]: Sorry, I didn't upload a final patch, simply because the solution 
on github (see 
[diff|https://github.com/apache/nutch/compare/master...sebastian-nagel:NUTCH-2164])
 was not finally tested. I'll prepare a final patch / pull request.
[~jurian]: Setting the modified time in CrawlDb is done by 
AdaptiveFetchSchedule and (now) by DefaultFetchSchedule. It does not really 
make sense to do this twice. Also, (if done at this place) it would overwrite 
the modified time, e.g., detected by a signature comparison.

> lastModified not always set
> ---
>
> Key: NUTCH-2242
> URL: https://issues.apache.org/jira/browse/NUTCH-2242
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.11
>Reporter: Jurian Broertjes
>Priority: Minor
> Fix For: 1.12
>
> Attachments: NUTCH-2242.patch
>
>
> I observed two issues:
> - When using the DefaultFetchSchedule, CrawlDatum's modifiedTime field is not 
> updated on the first successful fetch. 
> - When a document modification is detected (protocol- or signature-wise), the 
> modifiedTime isn't updated
> I can provide a patch later today.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-1785) Ability to index raw content

2016-04-20 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1785?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15250812#comment-15250812
 ] 

Sebastian Nagel commented on NUTCH-1785:


The class o.a.n.indexer.NutchField supports only a couple of classes as 
document field value: String, Boolean, Integer, Long, Float, Date.  But also 
IndexWriter implementations (indexer plugins) must support all used data types, 
resp. the data must provide a toString() method. In case of byte[], toString() 
does not return a meaningful String (you hardly want to index {{[B@13afed55}}.  
The conversion via {{new String(bytes)}} isn't stable, cf. NUTCH-1807.  
However, it is a clean string, readable, though it may not preserve 
bytes/characters from the original.  That's probably the intention.

Maybe it's anyway better to preserve the original encoding, esp. for base64 
where a String representation is defined.  Please, open a new issue for your 
problem.  Can you give an example for the charset issue?

> Ability to index raw content
> 
>
> Key: NUTCH-1785
> URL: https://issues.apache.org/jira/browse/NUTCH-1785
> Project: Nutch
>  Issue Type: New Feature
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.11
>
> Attachments: NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, 
> NUTCH-1785-trunk.patch, NUTCH-1785-trunk.patch, NUTCH-1785-trunkv2.patch
>
>
> Some use-cases require Nutch to actually write the raw content a configured 
> indexing back-end. Since Content is never read, a plugin is out of the 
> question and therefore we need to force IndexJob to process Content as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (NUTCH-2191) Add protocol-htmlunit

2016-04-18 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2191.

Resolution: Fixed

Merged pull request #105. Build should succeed now. Thanks, [~karanjeets]!

> Add protocol-htmlunit
> -
>
> Key: NUTCH-2191
> URL: https://issues.apache.org/jira/browse/NUTCH-2191
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.12
>
> Attachments: NUTCH-2191.patch, NUTCH-2191.patch, NUTCH-2191.patch, 
> NUTCH-2191.patch
>
>
> HtmlUnit is, opposed to other Javascript enabled headless browsers, a 
> portable library and should therefore be better suited for very large scale 
> crawls. This issue is an attempt to implement protocol-htmlunit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Reopened] (NUTCH-2191) Add protocol-htmlunit

2016-04-18 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2191?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reopened NUTCH-2191:


Build fails because protocol-htmlunit's build.xml claims to have unit tests but 
there aren't any:
{noformat}
/home/jenkins/.../src/plugin/protocol-htmlunit/build.xml:41: 
.../src/plugin/protocol-htmlunit/src/test does not exist.
{noformat}
[~karanjeets], are the tests accidentally not included in [pull request 
100|https://github.com/apache/nutch/pull/100]?

> Add protocol-htmlunit
> -
>
> Key: NUTCH-2191
> URL: https://issues.apache.org/jira/browse/NUTCH-2191
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.11
>Reporter: Markus Jelsma
>Assignee: Chris A. Mattmann
>  Labels: memex
> Fix For: 1.12
>
> Attachments: NUTCH-2191.patch, NUTCH-2191.patch, NUTCH-2191.patch, 
> NUTCH-2191.patch
>
>
> HtmlUnit is, opposed to other Javascript enabled headless browsers, a 
> portable library and should therefore be better suited for very large scale 
> crawls. This issue is an attempt to implement protocol-htmlunit.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (NUTCH-2297) CrawlDbReader -stats wrong values for earliest fetch time and shortest interval

2016-08-08 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15411716#comment-15411716
 ] 

Sebastian Nagel commented on NUTCH-2297:


The wrong values are already in the temporary output of the stats job:
# comment out {{fileSystem.delete(tmpFolder, true);}} in 
CrawlDbReader.processStatJobHelper(...)
# dump data in {{crawldb/stat_tmp}} via {{hadoop fs -text ...}}
While there is only one value for the minima (scn, fin, ftn) there are multiple 
values for totals and maxima:
{noformat}
retry 1 148125397
retry 2 82761892
retry 3 41645830
scn 0
scx 7369
sct 14807601
scx 7110
sct 20791107
scx 8390
sct 13135199
... (scx and sct repeating)
scx 7010
sct 17505486
fin 1512
fix 1360800
fit 1336710211200
fix 1360800
fit 1180199008800
...
fix 1360800
fit 1319982048000
ftn 597986250
ftx 26821441
ftt 35611037001815
ftx 26821441
...
{noformat}
The values for "fin" and "ftn" are already wrong at this point:
{noformat}
# 1512 sec. = 175 days
% echo $((1512/(60*60*24)))
175
# 597986250 as "epoche minutes":
% date -u --date=@$((597986250*60))
Thu Dec 20 05:30:00 UTC 3106
{noformat}
Need to trace what's going wrong in the CrawlDbStatMapper / CrawlDbStatCombiner 
/ CrawlDbStatReducer.

> CrawlDbReader -stats wrong values for earliest fetch time and shortest 
> interval
> ---
>
> Key: NUTCH-2297
> URL: https://issues.apache.org/jira/browse/NUTCH-2297
> Project: Nutch
>  Issue Type: Bug
>  Components: crawldb
>Affects Versions: 1.13
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.13
>
>
> NUTCH-2286 added min, max and average for fetch interval and fetch time.
> When running in distributed mode (not reproducible in local mode), the values 
> for the minimum (earliest fetch time and shortest fetch interval) may be 
> wrong with implausible values:
> {noformat}
> TOTAL urls: 7180518032
>  shortest fetch interval:175 days, 00:00:00 << 
>  avg fetch interval: 10 days, 08:01:36
>  longest fetch interval: 15 days, 18:00:00
>  earliest fetch time:Thu Dec 20 05:30:00 UTC 3106   << 
>  avg of fetch times: Fri Feb 19 00:07:00 UTC 2016
>  latest fetch time:  Mon Jul 18 05:22:00 UTC 2016
>  retry 0:6907984913
>  retry 1:148125397
>  retry 2:82761892
>  retry 3:41645830
>  min score:  0.0
>  avg score:  0.014360981
>  max score:  9.25
>  ...
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2297) CrawlDbReader -stats wrong values for earliest fetch time and shortest interval

2016-08-08 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2297:
--

 Summary: CrawlDbReader -stats wrong values for earliest fetch time 
and shortest interval
 Key: NUTCH-2297
 URL: https://issues.apache.org/jira/browse/NUTCH-2297
 Project: Nutch
  Issue Type: Bug
  Components: crawldb
Affects Versions: 1.13
Reporter: Sebastian Nagel
Assignee: Sebastian Nagel
Priority: Minor
 Fix For: 1.13


NUTCH-2286 added min, max and average for fetch interval and fetch time.
When running in distributed mode (not reproducible in local mode), the values 
for the minimum (earliest fetch time and shortest fetch interval) may be wrong 
with implausible values:
{noformat}
TOTAL urls: 7180518032
 shortest fetch interval:175 days, 00:00:00 << 
 avg fetch interval: 10 days, 08:01:36
 longest fetch interval: 15 days, 18:00:00
 earliest fetch time:Thu Dec 20 05:30:00 UTC 3106   << 
 avg of fetch times: Fri Feb 19 00:07:00 UTC 2016
 latest fetch time:  Mon Jul 18 05:22:00 UTC 2016
 retry 0:6907984913
 retry 1:148125397
 retry 2:82761892
 retry 3:41645830
 min score:  0.0
 avg score:  0.014360981
 max score:  9.25
 ...
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (NUTCH-2291) Fix mrunit dependencies

2016-06-30 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2291:
--

 Summary: Fix mrunit dependencies
 Key: NUTCH-2291
 URL: https://issues.apache.org/jira/browse/NUTCH-2291
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.13
Reporter: Sebastian Nagel
Priority: Blocker
 Fix For: 1.13


The Jenkins builds fail with a NoClassDefFoundError, see [build #3376 
log|https://builds.apache.org/job/Nutch-trunk/3376/testReport/org.apache.nutch.crawl/TestCrawlDbStates/testCrawlDbStatTransitionInject/].
 The missing class org/mockito/stubbing/Answer is part of 
build/test/lib/mockito-core-1.9.5.jar which was a dependency of mrunit 
(screenshot mrunit-deps-cached.png). After removing mrunit from my local ivy 
cache ({{rm -rf ~/.ivy2/cache/org.apache.mrunit/}} mrunit lost mockito as 
dependency (screenshot mrunit-deps-new.png) and the build failure is 
reproducible.
I don't understand what triggered the loss of the transitive dependency: the 
upgrade to Hadoop 2.7.2 (NUTCH-2236) or the addition of 
{{maven:classifier="hadoop2"}} in commit 
[7956daee|https://git-wip-us.apache.org/repos/asf?p=nutch.git;a=commitdiff;h=7956daee8ac91180070f92949ecf99deae9b5ef0#patch2].




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


<    4   5   6   7   8   9   10   11   12   13   >