[jira] [Created] (NUTCH-2782) protocol-http / lib-http: support TLSv1.3

2020-04-23 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2782:
--

 Summary: protocol-http / lib-http: support TLSv1.3
 Key: NUTCH-2782
 URL: https://issues.apache.org/jira/browse/NUTCH-2782
 Project: Nutch
  Issue Type: Improvement
  Components: plugin, protocol
Affects Versions: 1.16
Reporter: Sebastian Nagel
 Fix For: 1.18


[TLSv1.3| https://en.wikipedia.org/wiki/Transport_Layer_Security#TLS_1.3] 
(since 2018) is not included in the list of supported protocols in lib-http 
([HttpBase.java, line 
311|https://github.com/apache/nutch/blob/dcbb0f2bf450c6bec6f45125c68f5c7a0f061474/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java#L311]).
 It should be added. Also the list of supported ciphers needs to be updated 
accordingly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Assigned] (NUTCH-1103) Port protocol-sftp to 1.4

2020-04-23 Thread Shashanka Balakuntala Srinivasa (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shashanka Balakuntala Srinivasa reassigned NUTCH-1103:
--

Assignee: Shashanka Balakuntala Srinivasa

> Port protocol-sftp to 1.4
> -
>
> Key: NUTCH-1103
> URL: https://issues.apache.org/jira/browse/NUTCH-1103
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Shashanka Balakuntala Srinivasa
>Priority: Minor
>
> Port protocol-sftp from trunk back to 1.4



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-1194) Generator: CrawlDB lock should be released earlier

2020-04-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090639#comment-17090639
 ] 

ASF GitHub Bot commented on NUTCH-1194:
---

sebastian-nagel opened a new pull request #514:
URL: https://github.com/apache/nutch/pull/514


   - release CrawlDb lock after select step, in case, generated items are not 
marked in CrawlDb (generate.update.crawldb is false)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Generator: CrawlDB lock should be released earlier
> --
>
> Key: NUTCH-1194
> URL: https://issues.apache.org/jira/browse/NUTCH-1194
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.17
>
>
> Lock on the CrawlDB is released when everything is finished. But when 
> generating many segments, the lock remains in place while it's not neccessary 
> anymore. If GENERATE_UPDATE_DB is false we can release the lock immediately 
> after the selector has finished.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [nutch] sebastian-nagel opened a new pull request #514: NUTCH-1194 Generator: CrawlDB lock should be released earlier

2020-04-23 Thread GitBox


sebastian-nagel opened a new pull request #514:
URL: https://github.com/apache/nutch/pull/514


   - release CrawlDb lock after select step, in case, generated items are not 
marked in CrawlDb (generate.update.crawldb is false)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Updated] (NUTCH-1194) Generator: CrawlDB lock should be released earlier

2020-04-23 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1194:
---
Summary: Generator: CrawlDB lock should be released earlier  (was: CrawlDB 
lock should be released earlier)

> Generator: CrawlDB lock should be released earlier
> --
>
> Key: NUTCH-1194
> URL: https://issues.apache.org/jira/browse/NUTCH-1194
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.17
>
>
> Lock on the CrawlDB is released when everything is finished. But when 
> generating many segments, the lock remains in place while it's not neccessary 
> anymore. If GENERATE_UPDATE_DB is false we can release the lock immediately 
> after the selector has finished.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (NUTCH-2274) InteractiveSelenium Plugin's DefaultHandler Returns Null

2020-04-23 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2274.

Fix Version/s: (was: 1.17)
   Resolution: Abandoned

Nutch now uses Selenium 3.141.5 (after NUTCH-2676). Closing this issue as it 
does likely not apply to the recent Nutch version. Thanks anyway, [~bmzhao]!

> InteractiveSelenium Plugin's DefaultHandler Returns Null
> 
>
> Key: NUTCH-2274
> URL: https://issues.apache.org/jira/browse/NUTCH-2274
> Project: Nutch
>  Issue Type: Bug
>  Components: plugin
>Affects Versions: 1.11
>Reporter: Brian Zhao
>Assignee: Lewis John McGibbney
>Priority: Major
>
> The Interactive Selenium plugin's DefaultHandler.java always returns null for 
> its "processDriver(WebDriver driver)" method. 
> It should (probably?) instead return the body of the html:
>  public String processDriver(WebDriver driver) {
> return 
> driver.findElement(By.tagName("body")).getAttribute("innerHTML");
> }



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (NUTCH-2385) 1.x Elasticsearch Indexer - path.home is not configured

2020-04-23 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2385.

Fix Version/s: (was: 1.17)
   Resolution: Abandoned

Nutch now uses the Elasticsearch REST client v7.3.0, this shouldn't be a 
problem anymore. Thanks for reporting, [~sjwoodard]!

> 1.x Elasticsearch Indexer - path.home is not configured
> ---
>
> Key: NUTCH-2385
> URL: https://issues.apache.org/jira/browse/NUTCH-2385
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer
>Affects Versions: 1.13
> Environment: Ubuntu 16.04, Nutch 1.13 binaries, Amazon ElasticSearch 
> 2.3
>Reporter: Steven W
>Priority: Major
>
> Running Nutch 1.13 binaries, and configured to use indexer-elastic throws 
> this error when indexing:
> java.lang.Exception: java.lang.IllegalStateException: path.home is not 
> configured
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
> Caused by: java.lang.IllegalStateException: path.home is not configured
> at org.elasticsearch.env.Environment.(Environment.java:101)
> at 
> org.elasticsearch.node.internal.InternalSettingsPreparer.prepareEnvironment(InternalSettingsPreparer.java:81)
> at org.elasticsearch.node.Node.(Node.java:140)
> at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:143)
> at org.elasticsearch.node.NodeBuilder.node(NodeBuilder.java:150)
> at 
> org.apache.nutch.indexwriter.elastic.ElasticIndexWriter.makeClient(ElasticIndexWriter.java:141)
> at 
> org.apache.nutch.indexwriter.elastic.ElasticIndexWriter.open(ElasticIndexWriter.java:91)
> at org.apache.nutch.indexer.IndexWriters.open(IndexWriters.java:77)
> at 
> org.apache.nutch.indexer.IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:39)
> at 
> org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.(ReduceTask.java:484)
> at 
> org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:414)
> at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
> at 
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.run(FutureTask.java:266)
> at 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (NUTCH-2681) ClassCastException - Apache Nutch 1.x, Selenium v2.48.2, firefox 31.4.0

2020-04-23 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090510#comment-17090510
 ] 

Sebastian Nagel edited comment on NUTCH-2681 at 4/23/20, 11:01 AM:
---

Well, Nutch now uses Selenium 3.141.5 (after NUTCH-2676) and Firefox is on 
version 75. Closing Thanks, [~venkata...@hcl.com]!


was (Author: wastl-nagel):
Well, Nutch now uses Selenium 3.141.5 (after NUTCH-2716) and Firefox is on 
version 75. Closing Thanks, [~venkata...@hcl.com]!

> ClassCastException - Apache Nutch 1.x, Selenium v2.48.2, firefox 31.4.0
> ---
>
> Key: NUTCH-2681
> URL: https://issues.apache.org/jira/browse/NUTCH-2681
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.15
> Environment: * Apache nutch 1.x 
> (https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium)
>  * Selenium v2.48.2
>  * Firefox 31.4.0
>  * Environment: CentOS-7
>Reporter: Venkata Madhusudhana Rao
>Priority: Major
> Fix For: 1.17
>
>
> Fetching of Ajax content using _*protocol-selenium*_, with the specified 
> selenium and firefox versions, while executing _*bin/nutch fetch,*_ below 
> ClassCastException thrown
> {quote}Caused by: org.openqa.selenium.WebDriverException: 
> java.lang.ClassCastException: 
> org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to 
> javax.xml.parsers.DocumentBuilderFactory
> Build info: version: '2.48.2', revision: 
> '41bccdd10cf2c0560f637404c2d96164b67d9d67', time: '2015-10-09 13:08:06'
> System info: host: '24labs', ip: '10.0.10.24', os.name: 'Linux', os.arch: 
> 'amd64', os.version: '3.10.0-327.13.1.el7.x86_64', java.version: '1.8.0_191'
> Driver info: driver.version: FirefoxDriver
>  at 
> org.openqa.selenium.firefox.internal.FileExtension.readIdFromInstallRdf(FileExtension.java:142)
>  at 
> org.openqa.selenium.firefox.internal.FileExtension.writeTo(FileExtension.java:61)
>  at 
> org.openqa.selenium.firefox.internal.ClasspathExtension.writeTo(ClasspathExtension.java:64)
>  at 
> org.openqa.selenium.firefox.FirefoxProfile.installExtensions(FirefoxProfile.java:443)
>  at 
> org.openqa.selenium.firefox.FirefoxProfile.layoutOnDisk(FirefoxProfile.java:421)
>  at 
> org.openqa.selenium.firefox.internal.NewProfileExtensionConnection.start(NewProfileExtensionConnection.java:95)
> ... 12 more
> Caused by: java.lang.ClassCastException: 
> org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to 
> javax.xml.parsers.DocumentBuilderFactory
>  at javax.xml.parsers.DocumentBuilderFactory.newInstance(Unknown Source)
>  at 
> org.openqa.selenium.firefox.internal.FileExtension.readIdFromInstallRdf(FileExtension.java:95)
> {quote}
> Also tried with below firefox versions (Firefox: 60.3 oesr (64 bit), Selenium 
> : v3.4.0,  Geckodriver: 0.23.0 ( 2018-10-04)), ended with same casting 
> exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (NUTCH-2681) ClassCastException - Apache Nutch 1.x, Selenium v2.48.2, firefox 31.4.0

2020-04-23 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2681:
---
Fix Version/s: (was: 1.17)

> ClassCastException - Apache Nutch 1.x, Selenium v2.48.2, firefox 31.4.0
> ---
>
> Key: NUTCH-2681
> URL: https://issues.apache.org/jira/browse/NUTCH-2681
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.15
> Environment: * Apache nutch 1.x 
> (https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium)
>  * Selenium v2.48.2
>  * Firefox 31.4.0
>  * Environment: CentOS-7
>Reporter: Venkata Madhusudhana Rao
>Priority: Major
>
> Fetching of Ajax content using _*protocol-selenium*_, with the specified 
> selenium and firefox versions, while executing _*bin/nutch fetch,*_ below 
> ClassCastException thrown
> {quote}Caused by: org.openqa.selenium.WebDriverException: 
> java.lang.ClassCastException: 
> org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to 
> javax.xml.parsers.DocumentBuilderFactory
> Build info: version: '2.48.2', revision: 
> '41bccdd10cf2c0560f637404c2d96164b67d9d67', time: '2015-10-09 13:08:06'
> System info: host: '24labs', ip: '10.0.10.24', os.name: 'Linux', os.arch: 
> 'amd64', os.version: '3.10.0-327.13.1.el7.x86_64', java.version: '1.8.0_191'
> Driver info: driver.version: FirefoxDriver
>  at 
> org.openqa.selenium.firefox.internal.FileExtension.readIdFromInstallRdf(FileExtension.java:142)
>  at 
> org.openqa.selenium.firefox.internal.FileExtension.writeTo(FileExtension.java:61)
>  at 
> org.openqa.selenium.firefox.internal.ClasspathExtension.writeTo(ClasspathExtension.java:64)
>  at 
> org.openqa.selenium.firefox.FirefoxProfile.installExtensions(FirefoxProfile.java:443)
>  at 
> org.openqa.selenium.firefox.FirefoxProfile.layoutOnDisk(FirefoxProfile.java:421)
>  at 
> org.openqa.selenium.firefox.internal.NewProfileExtensionConnection.start(NewProfileExtensionConnection.java:95)
> ... 12 more
> Caused by: java.lang.ClassCastException: 
> org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to 
> javax.xml.parsers.DocumentBuilderFactory
>  at javax.xml.parsers.DocumentBuilderFactory.newInstance(Unknown Source)
>  at 
> org.openqa.selenium.firefox.internal.FileExtension.readIdFromInstallRdf(FileExtension.java:95)
> {quote}
> Also tried with below firefox versions (Firefox: 60.3 oesr (64 bit), Selenium 
> : v3.4.0,  Geckodriver: 0.23.0 ( 2018-10-04)), ended with same casting 
> exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Resolved] (NUTCH-2681) ClassCastException - Apache Nutch 1.x, Selenium v2.48.2, firefox 31.4.0

2020-04-23 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2681.

Resolution: Abandoned

Well, Nutch now uses Selenium 3.141.5 (after NUTCH-2716) and Firefox is on 
version 75. Closing Thanks, [~venkata...@hcl.com]!

> ClassCastException - Apache Nutch 1.x, Selenium v2.48.2, firefox 31.4.0
> ---
>
> Key: NUTCH-2681
> URL: https://issues.apache.org/jira/browse/NUTCH-2681
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.15
> Environment: * Apache nutch 1.x 
> (https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium)
>  * Selenium v2.48.2
>  * Firefox 31.4.0
>  * Environment: CentOS-7
>Reporter: Venkata Madhusudhana Rao
>Priority: Major
> Fix For: 1.17
>
>
> Fetching of Ajax content using _*protocol-selenium*_, with the specified 
> selenium and firefox versions, while executing _*bin/nutch fetch,*_ below 
> ClassCastException thrown
> {quote}Caused by: org.openqa.selenium.WebDriverException: 
> java.lang.ClassCastException: 
> org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to 
> javax.xml.parsers.DocumentBuilderFactory
> Build info: version: '2.48.2', revision: 
> '41bccdd10cf2c0560f637404c2d96164b67d9d67', time: '2015-10-09 13:08:06'
> System info: host: '24labs', ip: '10.0.10.24', os.name: 'Linux', os.arch: 
> 'amd64', os.version: '3.10.0-327.13.1.el7.x86_64', java.version: '1.8.0_191'
> Driver info: driver.version: FirefoxDriver
>  at 
> org.openqa.selenium.firefox.internal.FileExtension.readIdFromInstallRdf(FileExtension.java:142)
>  at 
> org.openqa.selenium.firefox.internal.FileExtension.writeTo(FileExtension.java:61)
>  at 
> org.openqa.selenium.firefox.internal.ClasspathExtension.writeTo(ClasspathExtension.java:64)
>  at 
> org.openqa.selenium.firefox.FirefoxProfile.installExtensions(FirefoxProfile.java:443)
>  at 
> org.openqa.selenium.firefox.FirefoxProfile.layoutOnDisk(FirefoxProfile.java:421)
>  at 
> org.openqa.selenium.firefox.internal.NewProfileExtensionConnection.start(NewProfileExtensionConnection.java:95)
> ... 12 more
> Caused by: java.lang.ClassCastException: 
> org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to 
> javax.xml.parsers.DocumentBuilderFactory
>  at javax.xml.parsers.DocumentBuilderFactory.newInstance(Unknown Source)
>  at 
> org.openqa.selenium.firefox.internal.FileExtension.readIdFromInstallRdf(FileExtension.java:95)
> {quote}
> Also tried with below firefox versions (Firefox: 60.3 oesr (64 bit), Selenium 
> : v3.4.0,  Geckodriver: 0.23.0 ( 2018-10-04)), ended with same casting 
> exception.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (NUTCH-2780) Upgrade index-solr to use Solr 8.5.1

2020-04-23 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2780:
---
Labels: help-wanted  (was: )

> Upgrade index-solr to use Solr 8.5.1
> 
>
> Key: NUTCH-2780
> URL: https://issues.apache.org/jira/browse/NUTCH-2780
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Affects Versions: 1.16
>Reporter: Sebastian Nagel
>Priority: Major
>  Labels: help-wanted
> Fix For: 1.17
>
>
> The indexer-solr plugin should be upgraded to be based on the latest Solr 
> version (currently, 8.5.1)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2379) crawl script dedup's crawldb update is slow

2020-04-23 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090496#comment-17090496
 ] 

Sebastian Nagel commented on NUTCH-2379:


This is addressed in [PR #513|https://github.com/apache/nutch/pull/513].

> crawl script dedup's crawldb update is slow 
> 
>
> Key: NUTCH-2379
> URL: https://issues.apache.org/jira/browse/NUTCH-2379
> Project: Nutch
>  Issue Type: Bug
>  Components: bin
>Affects Versions: 1.11
> Environment: shell
>Reporter: Michael Coffey
>Priority: Minor
> Fix For: 1.17
>
>
>  In the standard crawl script, there is a _bin_nutch updatedb command and, 
> soon after that, a _bin_nutch dedup command. Both of them launch hadoop jobs 
> with "crawldb /path/to/crawl/db" in their names (in addition to the actual 
> deduplication job).
> In my situation, the "crawldb" job launched by dedup takes twice as long as 
> the one launched by updatedb.
> I notice that the script passes $commonOptions to updatedb but not to dedup. 
> I suspect that the crawldb update launched by dedup may not be compressing 
> its output.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (NUTCH-2342) Inlinks are not being indexed as part of index-links plugin

2020-04-23 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2342:
---
Fix Version/s: 1.17

> Inlinks are not being indexed as part of index-links plugin
> ---
>
> Key: NUTCH-2342
> URL: https://issues.apache.org/jira/browse/NUTCH-2342
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer, linkdb
>Affects Versions: 1.12
> Environment: We are using linux machines for DEV and UAT.
>Reporter: Manish Bassi
>Priority: Major
> Fix For: 1.17
>
>
> I have used index-links plugin along with other plugins to index both the 
> inlinks and outlinks for a given page. But only the outlinks are getting 
> indexed and not the inlinks.
> Due to this issue, even the anchor plugin is not working as expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2342) Inlinks are not being indexed as part of index-links plugin

2020-04-23 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090491#comment-17090491
 ] 

Sebastian Nagel commented on NUTCH-2342:


Sounds like a documentation problem: in order to index inlinks and anchors from 
incoming links the LinkDb must have been created and passed to the indexer job. 

> Inlinks are not being indexed as part of index-links plugin
> ---
>
> Key: NUTCH-2342
> URL: https://issues.apache.org/jira/browse/NUTCH-2342
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer, linkdb
>Affects Versions: 1.12
> Environment: We are using linux machines for DEV and UAT.
>Reporter: Manish Bassi
>Priority: Major
> Fix For: 1.17
>
>
> I have used index-links plugin along with other plugins to index both the 
> inlinks and outlinks for a given page. But only the outlinks are getting 
> indexed and not the inlinks.
> Due to this issue, even the anchor plugin is not working as expected.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (NUTCH-2501) allow to set Java heap size when using crawl script in distributed mode

2020-04-23 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2501:
---
Summary: allow to set Java heap size when using crawl script in distributed 
mode  (was: Take into account $NUTCH_HEAPSIZE when crawling using crawl script)

> allow to set Java heap size when using crawl script in distributed mode
> ---
>
> Key: NUTCH-2501
> URL: https://issues.apache.org/jira/browse/NUTCH-2501
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.14
>Reporter: Moreno Feltscher
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.17
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script

2020-04-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090480#comment-17090480
 ] 

ASF GitHub Bot commented on NUTCH-2501:
---

sebastian-nagel commented on a change in pull request #279:
URL: https://github.com/apache/nutch/pull/279#discussion_r413699673



##
File path: src/bin/crawl
##
@@ -171,6 +175,8 @@ fi
 
 CRAWL_PATH="$1"
 LIMIT="$2"
+JAVA_CHILD_HEAP_MB=`expr "$NUTCH_HEAP_MB" / "$NUM_TASKS"`

Review comment:
   Hi @mfeltscher, this PR is now superceded by #513 - I've decided not to 
add any new environment variables but to document how the task memory can be 
set using the existing command-line flags:
   ```
   $> bin/crawl -D mapreduce.map.memory.mb=4608 -D 
mapreduce.map.java.opts=-Xmx4096m \
 -Dmapreduce.reduce.memory.mb=4608 
-Dmapreduce.reduce.java.opts=-Xmx4096m ...
   ```
   Thanks for contribution and the discussion!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Take into account $NUTCH_HEAPSIZE when crawling using crawl script
> --
>
> Key: NUTCH-2501
> URL: https://issues.apache.org/jira/browse/NUTCH-2501
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.14
>Reporter: Moreno Feltscher
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.17
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [nutch] sebastian-nagel commented on a change in pull request #279: NUTCH-2501: Take NUTCH_HEAPSIZE into account when crawling using crawl script

2020-04-23 Thread GitBox


sebastian-nagel commented on a change in pull request #279:
URL: https://github.com/apache/nutch/pull/279#discussion_r413699673



##
File path: src/bin/crawl
##
@@ -171,6 +175,8 @@ fi
 
 CRAWL_PATH="$1"
 LIMIT="$2"
+JAVA_CHILD_HEAP_MB=`expr "$NUTCH_HEAP_MB" / "$NUM_TASKS"`

Review comment:
   Hi @mfeltscher, this PR is now superceded by #513 - I've decided not to 
add any new environment variables but to document how the task memory can be 
set using the existing command-line flags:
   ```
   $> bin/crawl -D mapreduce.map.memory.mb=4608 -D 
mapreduce.map.java.opts=-Xmx4096m \
 -Dmapreduce.reduce.memory.mb=4608 
-Dmapreduce.reduce.java.opts=-Xmx4096m ...
   ```
   Thanks for contribution and the discussion!





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script

2020-04-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090475#comment-17090475
 ] 

ASF GitHub Bot commented on NUTCH-2501:
---

sebastian-nagel opened a new pull request #513:
URL: https://github.com/apache/nutch/pull/513


   - bin/crawl
  - add hint how to set map and reduce task memory via -D ... options
  - use -D options for all steps (Nutch tools)
  - fix quoting of -D options, eg. -D 
plugin.includes='protocol-xyz|parse-xyz'
  - use -D options for all steps (Nutch tools)
   - bin/nutch
 - document that environment variables are only used in local mode
   
   (includes #512 / NUTCH-2781)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Take into account $NUTCH_HEAPSIZE when crawling using crawl script
> --
>
> Key: NUTCH-2501
> URL: https://issues.apache.org/jira/browse/NUTCH-2501
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.14
>Reporter: Moreno Feltscher
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.17
>
>




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [nutch] sebastian-nagel opened a new pull request #513: NUTCH-2501 allow to set Java heap size when using crawl script in distributed mode

2020-04-23 Thread GitBox


sebastian-nagel opened a new pull request #513:
URL: https://github.com/apache/nutch/pull/513


   - bin/crawl
  - add hint how to set map and reduce task memory via -D ... options
  - use -D options for all steps (Nutch tools)
  - fix quoting of -D options, eg. -D 
plugin.includes='protocol-xyz|parse-xyz'
  - use -D options for all steps (Nutch tools)
   - bin/nutch
 - document that environment variables are only used in local mode
   
   (includes #512 / NUTCH-2781)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (NUTCH-2781) Increase default Java heap size

2020-04-23 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090384#comment-17090384
 ] 

ASF GitHub Bot commented on NUTCH-2781:
---

sebastian-nagel opened a new pull request #512:
URL: https://github.com/apache/nutch/pull/512


   - increase default value for NUTCH_HEAPSIZE to 4096 MB (from 1000 MB)
   - remove -Dmapred.child.java.opts=-Xmx1000m from default options in bin/crawl



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Increase default Java heap size
> ---
>
> Key: NUTCH-2781
> URL: https://issues.apache.org/jira/browse/NUTCH-2781
> Project: Nutch
>  Issue Type: Improvement
>  Components: runtime
>Affects Versions: 1.16
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.17
>
>
> The Nutch run script (bin/nutch) sets a "conservative" Java heap size of 1000 
> MB. This default was defined [15 years 
> ago|https://github.com/apache/nutch/blame/dcbb0f2bf450c6bec6f45125c68f5c7a0f061474/src/bin/nutch#L24].
>  It's probably safe to increase the heap size to a value suitable to process 
> more pages or larger documents. What about 4096 MB?
> Note this overlaps with NUTCH-2501 (Java heap size defined via 
> mapred.child.java.opts in distributed mode).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[GitHub] [nutch] sebastian-nagel opened a new pull request #512: NUTCH-2781 Increase default Java heap size

2020-04-23 Thread GitBox


sebastian-nagel opened a new pull request #512:
URL: https://github.com/apache/nutch/pull/512


   - increase default value for NUTCH_HEAPSIZE to 4096 MB (from 1000 MB)
   - remove -Dmapred.child.java.opts=-Xmx1000m from default options in bin/crawl



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (NUTCH-2779) Upgrade to Tika 1.24.1

2020-04-23 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090345#comment-17090345
 ] 

Sebastian Nagel commented on NUTCH-2779:


[Tika 1.24.1 is released|https://tika.apache.org/1.24.1/index.html], I'll merge 
the PR if there are no objections.

> Upgrade to Tika 1.24.1
> --
>
> Key: NUTCH-2779
> URL: https://issues.apache.org/jira/browse/NUTCH-2779
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser, plugin
>Affects Versions: 1.16
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.17
>
>
> Tika 1.24.1 should be released soon. I've upgraded Nutch to use the release 
> candidate: all unit tests pass and processing PDFs, MP3s, etc. works. I'll 
> open a PR but we need to wait for the final release of 1.24.1



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (NUTCH-2781) Increase default Java heap size

2020-04-23 Thread Sebastian Nagel (Jira)
Sebastian Nagel created NUTCH-2781:
--

 Summary: Increase default Java heap size
 Key: NUTCH-2781
 URL: https://issues.apache.org/jira/browse/NUTCH-2781
 Project: Nutch
  Issue Type: Improvement
  Components: runtime
Affects Versions: 1.16
Reporter: Sebastian Nagel
 Fix For: 1.17


The Nutch run script (bin/nutch) sets a "conservative" Java heap size of 1000 
MB. This default was defined [15 years 
ago|https://github.com/apache/nutch/blame/dcbb0f2bf450c6bec6f45125c68f5c7a0f061474/src/bin/nutch#L24].
 It's probably safe to increase the heap size to a value suitable to process 
more pages or larger documents. What about 4096 MB?

Note this overlaps with NUTCH-2501 (Java heap size defined via 
mapred.child.java.opts in distributed mode).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Comment Edited] (NUTCH-1103) Port protocol-sftp to 1.4

2020-04-23 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090321#comment-17090321
 ] 

Sebastian Nagel edited comment on NUTCH-1103 at 4/23/20, 6:50 AM:
--

Well, looking at the work log - obviously not :(
- there is a protocol-sftp in the 2.x branch - plugins shouldn't be difficult 
to port to 1.x/master
- however, it could be a challenge to get the URL stream handler registered, 
see NUTCH-2429

If you plan to tackle this/these issues, would be great!


was (Author: wastl-nagel):
Well, looking at the history obviously not :(
- there is a protocol-sftp in the 2.x branch - plugins shouldn't be difficult 
to port to 1.x/master
- however, it could be a challenge to get the URL stream handler registered, 
see NUTCH-2429
If you plan to tackle this/these issues, would be great!

> Port protocol-sftp to 1.4
> -
>
> Key: NUTCH-1103
> URL: https://issues.apache.org/jira/browse/NUTCH-1103
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Priority: Minor
>
> Port protocol-sftp from trunk back to 1.4



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (NUTCH-1103) Port protocol-sftp to 1.4

2020-04-23 Thread Sebastian Nagel (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090321#comment-17090321
 ] 

Sebastian Nagel commented on NUTCH-1103:


Well, looking at the history obviously not :(
- there is a protocol-sftp in the 2.x branch - plugins shouldn't be difficult 
to port to 1.x/master
- however, it could be a challenge to get the URL stream handler registered, 
see NUTCH-2429
If you plan to tackle this/these issues, would be great!

> Port protocol-sftp to 1.4
> -
>
> Key: NUTCH-1103
> URL: https://issues.apache.org/jira/browse/NUTCH-1103
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Priority: Minor
>
> Port protocol-sftp from trunk back to 1.4



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[DISCUSS] Release 1.17 ?

2020-04-23 Thread Sebastian Nagel
Hi all,

30 issues are done now
  https://issues.apache.org/jira/browse/NUTCH/fixforversion/12346090

including a number of important dependency upgrades:
- Hadoop 3.1 (NUTCH-2777)
- Elasticsearch 7.3.0 REST client (NUTCH-2739)
Thanks to Shashanka Balakuntala Srinivasa for both!

Dependency upgrades to be included (but still open right now):
- Tika 1.24.1
- Solr 8.5.1

The last release (1.16) was in October, so it's definitely not too early to
release 1.17.  As usual, we'll check all remaining issues whether they should
be fixed now or can be done later in 1.18.

I would be ready to push a release candidate during the next weeks and have
already started to work through the remaining issues. Please, comment on
issues you want to get fixed already in 1.17!

Thanks,
Sebastian