[jira] [Created] (NUTCH-2782) protocol-http / lib-http: support TLSv1.3
Sebastian Nagel created NUTCH-2782: -- Summary: protocol-http / lib-http: support TLSv1.3 Key: NUTCH-2782 URL: https://issues.apache.org/jira/browse/NUTCH-2782 Project: Nutch Issue Type: Improvement Components: plugin, protocol Affects Versions: 1.16 Reporter: Sebastian Nagel Fix For: 1.18 [TLSv1.3| https://en.wikipedia.org/wiki/Transport_Layer_Security#TLS_1.3] (since 2018) is not included in the list of supported protocols in lib-http ([HttpBase.java, line 311|https://github.com/apache/nutch/blob/dcbb0f2bf450c6bec6f45125c68f5c7a0f061474/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java#L311]). It should be added. Also the list of supported ciphers needs to be updated accordingly. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Assigned] (NUTCH-1103) Port protocol-sftp to 1.4
[ https://issues.apache.org/jira/browse/NUTCH-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shashanka Balakuntala Srinivasa reassigned NUTCH-1103: -- Assignee: Shashanka Balakuntala Srinivasa > Port protocol-sftp to 1.4 > - > > Key: NUTCH-1103 > URL: https://issues.apache.org/jira/browse/NUTCH-1103 > Project: Nutch > Issue Type: New Feature >Reporter: Markus Jelsma >Assignee: Shashanka Balakuntala Srinivasa >Priority: Minor > > Port protocol-sftp from trunk back to 1.4 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-1194) Generator: CrawlDB lock should be released earlier
[ https://issues.apache.org/jira/browse/NUTCH-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090639#comment-17090639 ] ASF GitHub Bot commented on NUTCH-1194: --- sebastian-nagel opened a new pull request #514: URL: https://github.com/apache/nutch/pull/514 - release CrawlDb lock after select step, in case, generated items are not marked in CrawlDb (generate.update.crawldb is false) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Generator: CrawlDB lock should be released earlier > -- > > Key: NUTCH-1194 > URL: https://issues.apache.org/jira/browse/NUTCH-1194 > Project: Nutch > Issue Type: Improvement > Components: generator >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.17 > > > Lock on the CrawlDB is released when everything is finished. But when > generating many segments, the lock remains in place while it's not neccessary > anymore. If GENERATE_UPDATE_DB is false we can release the lock immediately > after the selector has finished. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [nutch] sebastian-nagel opened a new pull request #514: NUTCH-1194 Generator: CrawlDB lock should be released earlier
sebastian-nagel opened a new pull request #514: URL: https://github.com/apache/nutch/pull/514 - release CrawlDb lock after select step, in case, generated items are not marked in CrawlDb (generate.update.crawldb is false) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Updated] (NUTCH-1194) Generator: CrawlDB lock should be released earlier
[ https://issues.apache.org/jira/browse/NUTCH-1194?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-1194: --- Summary: Generator: CrawlDB lock should be released earlier (was: CrawlDB lock should be released earlier) > Generator: CrawlDB lock should be released earlier > -- > > Key: NUTCH-1194 > URL: https://issues.apache.org/jira/browse/NUTCH-1194 > Project: Nutch > Issue Type: Improvement > Components: generator >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.17 > > > Lock on the CrawlDB is released when everything is finished. But when > generating many segments, the lock remains in place while it's not neccessary > anymore. If GENERATE_UPDATE_DB is false we can release the lock immediately > after the selector has finished. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (NUTCH-2274) InteractiveSelenium Plugin's DefaultHandler Returns Null
[ https://issues.apache.org/jira/browse/NUTCH-2274?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2274. Fix Version/s: (was: 1.17) Resolution: Abandoned Nutch now uses Selenium 3.141.5 (after NUTCH-2676). Closing this issue as it does likely not apply to the recent Nutch version. Thanks anyway, [~bmzhao]! > InteractiveSelenium Plugin's DefaultHandler Returns Null > > > Key: NUTCH-2274 > URL: https://issues.apache.org/jira/browse/NUTCH-2274 > Project: Nutch > Issue Type: Bug > Components: plugin >Affects Versions: 1.11 >Reporter: Brian Zhao >Assignee: Lewis John McGibbney >Priority: Major > > The Interactive Selenium plugin's DefaultHandler.java always returns null for > its "processDriver(WebDriver driver)" method. > It should (probably?) instead return the body of the html: > public String processDriver(WebDriver driver) { > return > driver.findElement(By.tagName("body")).getAttribute("innerHTML"); > } -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (NUTCH-2385) 1.x Elasticsearch Indexer - path.home is not configured
[ https://issues.apache.org/jira/browse/NUTCH-2385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2385. Fix Version/s: (was: 1.17) Resolution: Abandoned Nutch now uses the Elasticsearch REST client v7.3.0, this shouldn't be a problem anymore. Thanks for reporting, [~sjwoodard]! > 1.x Elasticsearch Indexer - path.home is not configured > --- > > Key: NUTCH-2385 > URL: https://issues.apache.org/jira/browse/NUTCH-2385 > Project: Nutch > Issue Type: Bug > Components: indexer >Affects Versions: 1.13 > Environment: Ubuntu 16.04, Nutch 1.13 binaries, Amazon ElasticSearch > 2.3 >Reporter: Steven W >Priority: Major > > Running Nutch 1.13 binaries, and configured to use indexer-elastic throws > this error when indexing: > java.lang.Exception: java.lang.IllegalStateException: path.home is not > configured > at > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462) > at > org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529) > Caused by: java.lang.IllegalStateException: path.home is not configured > at org.elasticsearch.env.Environment.(Environment.java:101) > at > org.elasticsearch.node.internal.InternalSettingsPreparer.prepareEnvironment(InternalSettingsPreparer.java:81) > at org.elasticsearch.node.Node.(Node.java:140) > at org.elasticsearch.node.NodeBuilder.build(NodeBuilder.java:143) > at org.elasticsearch.node.NodeBuilder.node(NodeBuilder.java:150) > at > org.apache.nutch.indexwriter.elastic.ElasticIndexWriter.makeClient(ElasticIndexWriter.java:141) > at > org.apache.nutch.indexwriter.elastic.ElasticIndexWriter.open(ElasticIndexWriter.java:91) > at org.apache.nutch.indexer.IndexWriters.open(IndexWriters.java:77) > at > org.apache.nutch.indexer.IndexerOutputFormat.getRecordWriter(IndexerOutputFormat.java:39) > at > org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.(ReduceTask.java:484) > at > org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:414) > at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392) > at > org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319) > at > java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) > at java.util.concurrent.FutureTask.run(FutureTask.java:266) > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (NUTCH-2681) ClassCastException - Apache Nutch 1.x, Selenium v2.48.2, firefox 31.4.0
[ https://issues.apache.org/jira/browse/NUTCH-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090510#comment-17090510 ] Sebastian Nagel edited comment on NUTCH-2681 at 4/23/20, 11:01 AM: --- Well, Nutch now uses Selenium 3.141.5 (after NUTCH-2676) and Firefox is on version 75. Closing Thanks, [~venkata...@hcl.com]! was (Author: wastl-nagel): Well, Nutch now uses Selenium 3.141.5 (after NUTCH-2716) and Firefox is on version 75. Closing Thanks, [~venkata...@hcl.com]! > ClassCastException - Apache Nutch 1.x, Selenium v2.48.2, firefox 31.4.0 > --- > > Key: NUTCH-2681 > URL: https://issues.apache.org/jira/browse/NUTCH-2681 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.15 > Environment: * Apache nutch 1.x > (https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium) > * Selenium v2.48.2 > * Firefox 31.4.0 > * Environment: CentOS-7 >Reporter: Venkata Madhusudhana Rao >Priority: Major > Fix For: 1.17 > > > Fetching of Ajax content using _*protocol-selenium*_, with the specified > selenium and firefox versions, while executing _*bin/nutch fetch,*_ below > ClassCastException thrown > {quote}Caused by: org.openqa.selenium.WebDriverException: > java.lang.ClassCastException: > org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to > javax.xml.parsers.DocumentBuilderFactory > Build info: version: '2.48.2', revision: > '41bccdd10cf2c0560f637404c2d96164b67d9d67', time: '2015-10-09 13:08:06' > System info: host: '24labs', ip: '10.0.10.24', os.name: 'Linux', os.arch: > 'amd64', os.version: '3.10.0-327.13.1.el7.x86_64', java.version: '1.8.0_191' > Driver info: driver.version: FirefoxDriver > at > org.openqa.selenium.firefox.internal.FileExtension.readIdFromInstallRdf(FileExtension.java:142) > at > org.openqa.selenium.firefox.internal.FileExtension.writeTo(FileExtension.java:61) > at > org.openqa.selenium.firefox.internal.ClasspathExtension.writeTo(ClasspathExtension.java:64) > at > org.openqa.selenium.firefox.FirefoxProfile.installExtensions(FirefoxProfile.java:443) > at > org.openqa.selenium.firefox.FirefoxProfile.layoutOnDisk(FirefoxProfile.java:421) > at > org.openqa.selenium.firefox.internal.NewProfileExtensionConnection.start(NewProfileExtensionConnection.java:95) > ... 12 more > Caused by: java.lang.ClassCastException: > org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to > javax.xml.parsers.DocumentBuilderFactory > at javax.xml.parsers.DocumentBuilderFactory.newInstance(Unknown Source) > at > org.openqa.selenium.firefox.internal.FileExtension.readIdFromInstallRdf(FileExtension.java:95) > {quote} > Also tried with below firefox versions (Firefox: 60.3 oesr (64 bit), Selenium > : v3.4.0, Geckodriver: 0.23.0 ( 2018-10-04)), ended with same casting > exception. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (NUTCH-2681) ClassCastException - Apache Nutch 1.x, Selenium v2.48.2, firefox 31.4.0
[ https://issues.apache.org/jira/browse/NUTCH-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2681: --- Fix Version/s: (was: 1.17) > ClassCastException - Apache Nutch 1.x, Selenium v2.48.2, firefox 31.4.0 > --- > > Key: NUTCH-2681 > URL: https://issues.apache.org/jira/browse/NUTCH-2681 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.15 > Environment: * Apache nutch 1.x > (https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium) > * Selenium v2.48.2 > * Firefox 31.4.0 > * Environment: CentOS-7 >Reporter: Venkata Madhusudhana Rao >Priority: Major > > Fetching of Ajax content using _*protocol-selenium*_, with the specified > selenium and firefox versions, while executing _*bin/nutch fetch,*_ below > ClassCastException thrown > {quote}Caused by: org.openqa.selenium.WebDriverException: > java.lang.ClassCastException: > org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to > javax.xml.parsers.DocumentBuilderFactory > Build info: version: '2.48.2', revision: > '41bccdd10cf2c0560f637404c2d96164b67d9d67', time: '2015-10-09 13:08:06' > System info: host: '24labs', ip: '10.0.10.24', os.name: 'Linux', os.arch: > 'amd64', os.version: '3.10.0-327.13.1.el7.x86_64', java.version: '1.8.0_191' > Driver info: driver.version: FirefoxDriver > at > org.openqa.selenium.firefox.internal.FileExtension.readIdFromInstallRdf(FileExtension.java:142) > at > org.openqa.selenium.firefox.internal.FileExtension.writeTo(FileExtension.java:61) > at > org.openqa.selenium.firefox.internal.ClasspathExtension.writeTo(ClasspathExtension.java:64) > at > org.openqa.selenium.firefox.FirefoxProfile.installExtensions(FirefoxProfile.java:443) > at > org.openqa.selenium.firefox.FirefoxProfile.layoutOnDisk(FirefoxProfile.java:421) > at > org.openqa.selenium.firefox.internal.NewProfileExtensionConnection.start(NewProfileExtensionConnection.java:95) > ... 12 more > Caused by: java.lang.ClassCastException: > org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to > javax.xml.parsers.DocumentBuilderFactory > at javax.xml.parsers.DocumentBuilderFactory.newInstance(Unknown Source) > at > org.openqa.selenium.firefox.internal.FileExtension.readIdFromInstallRdf(FileExtension.java:95) > {quote} > Also tried with below firefox versions (Firefox: 60.3 oesr (64 bit), Selenium > : v3.4.0, Geckodriver: 0.23.0 ( 2018-10-04)), ended with same casting > exception. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (NUTCH-2681) ClassCastException - Apache Nutch 1.x, Selenium v2.48.2, firefox 31.4.0
[ https://issues.apache.org/jira/browse/NUTCH-2681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel resolved NUTCH-2681. Resolution: Abandoned Well, Nutch now uses Selenium 3.141.5 (after NUTCH-2716) and Firefox is on version 75. Closing Thanks, [~venkata...@hcl.com]! > ClassCastException - Apache Nutch 1.x, Selenium v2.48.2, firefox 31.4.0 > --- > > Key: NUTCH-2681 > URL: https://issues.apache.org/jira/browse/NUTCH-2681 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.15 > Environment: * Apache nutch 1.x > (https://github.com/apache/nutch/tree/master/src/plugin/protocol-selenium) > * Selenium v2.48.2 > * Firefox 31.4.0 > * Environment: CentOS-7 >Reporter: Venkata Madhusudhana Rao >Priority: Major > Fix For: 1.17 > > > Fetching of Ajax content using _*protocol-selenium*_, with the specified > selenium and firefox versions, while executing _*bin/nutch fetch,*_ below > ClassCastException thrown > {quote}Caused by: org.openqa.selenium.WebDriverException: > java.lang.ClassCastException: > org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to > javax.xml.parsers.DocumentBuilderFactory > Build info: version: '2.48.2', revision: > '41bccdd10cf2c0560f637404c2d96164b67d9d67', time: '2015-10-09 13:08:06' > System info: host: '24labs', ip: '10.0.10.24', os.name: 'Linux', os.arch: > 'amd64', os.version: '3.10.0-327.13.1.el7.x86_64', java.version: '1.8.0_191' > Driver info: driver.version: FirefoxDriver > at > org.openqa.selenium.firefox.internal.FileExtension.readIdFromInstallRdf(FileExtension.java:142) > at > org.openqa.selenium.firefox.internal.FileExtension.writeTo(FileExtension.java:61) > at > org.openqa.selenium.firefox.internal.ClasspathExtension.writeTo(ClasspathExtension.java:64) > at > org.openqa.selenium.firefox.FirefoxProfile.installExtensions(FirefoxProfile.java:443) > at > org.openqa.selenium.firefox.FirefoxProfile.layoutOnDisk(FirefoxProfile.java:421) > at > org.openqa.selenium.firefox.internal.NewProfileExtensionConnection.start(NewProfileExtensionConnection.java:95) > ... 12 more > Caused by: java.lang.ClassCastException: > org.apache.xerces.jaxp.DocumentBuilderFactoryImpl cannot be cast to > javax.xml.parsers.DocumentBuilderFactory > at javax.xml.parsers.DocumentBuilderFactory.newInstance(Unknown Source) > at > org.openqa.selenium.firefox.internal.FileExtension.readIdFromInstallRdf(FileExtension.java:95) > {quote} > Also tried with below firefox versions (Firefox: 60.3 oesr (64 bit), Selenium > : v3.4.0, Geckodriver: 0.23.0 ( 2018-10-04)), ended with same casting > exception. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (NUTCH-2780) Upgrade index-solr to use Solr 8.5.1
[ https://issues.apache.org/jira/browse/NUTCH-2780?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2780: --- Labels: help-wanted (was: ) > Upgrade index-solr to use Solr 8.5.1 > > > Key: NUTCH-2780 > URL: https://issues.apache.org/jira/browse/NUTCH-2780 > Project: Nutch > Issue Type: Improvement > Components: indexer, plugin >Affects Versions: 1.16 >Reporter: Sebastian Nagel >Priority: Major > Labels: help-wanted > Fix For: 1.17 > > > The indexer-solr plugin should be upgraded to be based on the latest Solr > version (currently, 8.5.1) -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2379) crawl script dedup's crawldb update is slow
[ https://issues.apache.org/jira/browse/NUTCH-2379?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090496#comment-17090496 ] Sebastian Nagel commented on NUTCH-2379: This is addressed in [PR #513|https://github.com/apache/nutch/pull/513]. > crawl script dedup's crawldb update is slow > > > Key: NUTCH-2379 > URL: https://issues.apache.org/jira/browse/NUTCH-2379 > Project: Nutch > Issue Type: Bug > Components: bin >Affects Versions: 1.11 > Environment: shell >Reporter: Michael Coffey >Priority: Minor > Fix For: 1.17 > > > In the standard crawl script, there is a _bin_nutch updatedb command and, > soon after that, a _bin_nutch dedup command. Both of them launch hadoop jobs > with "crawldb /path/to/crawl/db" in their names (in addition to the actual > deduplication job). > In my situation, the "crawldb" job launched by dedup takes twice as long as > the one launched by updatedb. > I notice that the script passes $commonOptions to updatedb but not to dedup. > I suspect that the crawldb update launched by dedup may not be compressing > its output. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (NUTCH-2342) Inlinks are not being indexed as part of index-links plugin
[ https://issues.apache.org/jira/browse/NUTCH-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2342: --- Fix Version/s: 1.17 > Inlinks are not being indexed as part of index-links plugin > --- > > Key: NUTCH-2342 > URL: https://issues.apache.org/jira/browse/NUTCH-2342 > Project: Nutch > Issue Type: Bug > Components: indexer, linkdb >Affects Versions: 1.12 > Environment: We are using linux machines for DEV and UAT. >Reporter: Manish Bassi >Priority: Major > Fix For: 1.17 > > > I have used index-links plugin along with other plugins to index both the > inlinks and outlinks for a given page. But only the outlinks are getting > indexed and not the inlinks. > Due to this issue, even the anchor plugin is not working as expected. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2342) Inlinks are not being indexed as part of index-links plugin
[ https://issues.apache.org/jira/browse/NUTCH-2342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090491#comment-17090491 ] Sebastian Nagel commented on NUTCH-2342: Sounds like a documentation problem: in order to index inlinks and anchors from incoming links the LinkDb must have been created and passed to the indexer job. > Inlinks are not being indexed as part of index-links plugin > --- > > Key: NUTCH-2342 > URL: https://issues.apache.org/jira/browse/NUTCH-2342 > Project: Nutch > Issue Type: Bug > Components: indexer, linkdb >Affects Versions: 1.12 > Environment: We are using linux machines for DEV and UAT. >Reporter: Manish Bassi >Priority: Major > Fix For: 1.17 > > > I have used index-links plugin along with other plugins to index both the > inlinks and outlinks for a given page. But only the outlinks are getting > indexed and not the inlinks. > Due to this issue, even the anchor plugin is not working as expected. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Updated] (NUTCH-2501) allow to set Java heap size when using crawl script in distributed mode
[ https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2501: --- Summary: allow to set Java heap size when using crawl script in distributed mode (was: Take into account $NUTCH_HEAPSIZE when crawling using crawl script) > allow to set Java heap size when using crawl script in distributed mode > --- > > Key: NUTCH-2501 > URL: https://issues.apache.org/jira/browse/NUTCH-2501 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.14 >Reporter: Moreno Feltscher >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.17 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script
[ https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090480#comment-17090480 ] ASF GitHub Bot commented on NUTCH-2501: --- sebastian-nagel commented on a change in pull request #279: URL: https://github.com/apache/nutch/pull/279#discussion_r413699673 ## File path: src/bin/crawl ## @@ -171,6 +175,8 @@ fi CRAWL_PATH="$1" LIMIT="$2" +JAVA_CHILD_HEAP_MB=`expr "$NUTCH_HEAP_MB" / "$NUM_TASKS"` Review comment: Hi @mfeltscher, this PR is now superceded by #513 - I've decided not to add any new environment variables but to document how the task memory can be set using the existing command-line flags: ``` $> bin/crawl -D mapreduce.map.memory.mb=4608 -D mapreduce.map.java.opts=-Xmx4096m \ -Dmapreduce.reduce.memory.mb=4608 -Dmapreduce.reduce.java.opts=-Xmx4096m ... ``` Thanks for contribution and the discussion! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Take into account $NUTCH_HEAPSIZE when crawling using crawl script > -- > > Key: NUTCH-2501 > URL: https://issues.apache.org/jira/browse/NUTCH-2501 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.14 >Reporter: Moreno Feltscher >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.17 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [nutch] sebastian-nagel commented on a change in pull request #279: NUTCH-2501: Take NUTCH_HEAPSIZE into account when crawling using crawl script
sebastian-nagel commented on a change in pull request #279: URL: https://github.com/apache/nutch/pull/279#discussion_r413699673 ## File path: src/bin/crawl ## @@ -171,6 +175,8 @@ fi CRAWL_PATH="$1" LIMIT="$2" +JAVA_CHILD_HEAP_MB=`expr "$NUTCH_HEAP_MB" / "$NUM_TASKS"` Review comment: Hi @mfeltscher, this PR is now superceded by #513 - I've decided not to add any new environment variables but to document how the task memory can be set using the existing command-line flags: ``` $> bin/crawl -D mapreduce.map.memory.mb=4608 -D mapreduce.map.java.opts=-Xmx4096m \ -Dmapreduce.reduce.memory.mb=4608 -Dmapreduce.reduce.java.opts=-Xmx4096m ... ``` Thanks for contribution and the discussion! This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (NUTCH-2501) Take into account $NUTCH_HEAPSIZE when crawling using crawl script
[ https://issues.apache.org/jira/browse/NUTCH-2501?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090475#comment-17090475 ] ASF GitHub Bot commented on NUTCH-2501: --- sebastian-nagel opened a new pull request #513: URL: https://github.com/apache/nutch/pull/513 - bin/crawl - add hint how to set map and reduce task memory via -D ... options - use -D options for all steps (Nutch tools) - fix quoting of -D options, eg. -D plugin.includes='protocol-xyz|parse-xyz' - use -D options for all steps (Nutch tools) - bin/nutch - document that environment variables are only used in local mode (includes #512 / NUTCH-2781) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Take into account $NUTCH_HEAPSIZE when crawling using crawl script > -- > > Key: NUTCH-2501 > URL: https://issues.apache.org/jira/browse/NUTCH-2501 > Project: Nutch > Issue Type: Improvement >Affects Versions: 1.14 >Reporter: Moreno Feltscher >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.17 > > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [nutch] sebastian-nagel opened a new pull request #513: NUTCH-2501 allow to set Java heap size when using crawl script in distributed mode
sebastian-nagel opened a new pull request #513: URL: https://github.com/apache/nutch/pull/513 - bin/crawl - add hint how to set map and reduce task memory via -D ... options - use -D options for all steps (Nutch tools) - fix quoting of -D options, eg. -D plugin.includes='protocol-xyz|parse-xyz' - use -D options for all steps (Nutch tools) - bin/nutch - document that environment variables are only used in local mode (includes #512 / NUTCH-2781) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (NUTCH-2781) Increase default Java heap size
[ https://issues.apache.org/jira/browse/NUTCH-2781?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090384#comment-17090384 ] ASF GitHub Bot commented on NUTCH-2781: --- sebastian-nagel opened a new pull request #512: URL: https://github.com/apache/nutch/pull/512 - increase default value for NUTCH_HEAPSIZE to 4096 MB (from 1000 MB) - remove -Dmapred.child.java.opts=-Xmx1000m from default options in bin/crawl This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Increase default Java heap size > --- > > Key: NUTCH-2781 > URL: https://issues.apache.org/jira/browse/NUTCH-2781 > Project: Nutch > Issue Type: Improvement > Components: runtime >Affects Versions: 1.16 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.17 > > > The Nutch run script (bin/nutch) sets a "conservative" Java heap size of 1000 > MB. This default was defined [15 years > ago|https://github.com/apache/nutch/blame/dcbb0f2bf450c6bec6f45125c68f5c7a0f061474/src/bin/nutch#L24]. > It's probably safe to increase the heap size to a value suitable to process > more pages or larger documents. What about 4096 MB? > Note this overlaps with NUTCH-2501 (Java heap size defined via > mapred.child.java.opts in distributed mode). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[GitHub] [nutch] sebastian-nagel opened a new pull request #512: NUTCH-2781 Increase default Java heap size
sebastian-nagel opened a new pull request #512: URL: https://github.com/apache/nutch/pull/512 - increase default value for NUTCH_HEAPSIZE to 4096 MB (from 1000 MB) - remove -Dmapred.child.java.opts=-Xmx1000m from default options in bin/crawl This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org
[jira] [Commented] (NUTCH-2779) Upgrade to Tika 1.24.1
[ https://issues.apache.org/jira/browse/NUTCH-2779?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090345#comment-17090345 ] Sebastian Nagel commented on NUTCH-2779: [Tika 1.24.1 is released|https://tika.apache.org/1.24.1/index.html], I'll merge the PR if there are no objections. > Upgrade to Tika 1.24.1 > -- > > Key: NUTCH-2779 > URL: https://issues.apache.org/jira/browse/NUTCH-2779 > Project: Nutch > Issue Type: Improvement > Components: parser, plugin >Affects Versions: 1.16 >Reporter: Sebastian Nagel >Assignee: Sebastian Nagel >Priority: Major > Fix For: 1.17 > > > Tika 1.24.1 should be released soon. I've upgraded Nutch to use the release > candidate: all unit tests pass and processing PDFs, MP3s, etc. works. I'll > open a PR but we need to wait for the final release of 1.24.1 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (NUTCH-2781) Increase default Java heap size
Sebastian Nagel created NUTCH-2781: -- Summary: Increase default Java heap size Key: NUTCH-2781 URL: https://issues.apache.org/jira/browse/NUTCH-2781 Project: Nutch Issue Type: Improvement Components: runtime Affects Versions: 1.16 Reporter: Sebastian Nagel Fix For: 1.17 The Nutch run script (bin/nutch) sets a "conservative" Java heap size of 1000 MB. This default was defined [15 years ago|https://github.com/apache/nutch/blame/dcbb0f2bf450c6bec6f45125c68f5c7a0f061474/src/bin/nutch#L24]. It's probably safe to increase the heap size to a value suitable to process more pages or larger documents. What about 4096 MB? Note this overlaps with NUTCH-2501 (Java heap size defined via mapred.child.java.opts in distributed mode). -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (NUTCH-1103) Port protocol-sftp to 1.4
[ https://issues.apache.org/jira/browse/NUTCH-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090321#comment-17090321 ] Sebastian Nagel edited comment on NUTCH-1103 at 4/23/20, 6:50 AM: -- Well, looking at the work log - obviously not :( - there is a protocol-sftp in the 2.x branch - plugins shouldn't be difficult to port to 1.x/master - however, it could be a challenge to get the URL stream handler registered, see NUTCH-2429 If you plan to tackle this/these issues, would be great! was (Author: wastl-nagel): Well, looking at the history obviously not :( - there is a protocol-sftp in the 2.x branch - plugins shouldn't be difficult to port to 1.x/master - however, it could be a challenge to get the URL stream handler registered, see NUTCH-2429 If you plan to tackle this/these issues, would be great! > Port protocol-sftp to 1.4 > - > > Key: NUTCH-1103 > URL: https://issues.apache.org/jira/browse/NUTCH-1103 > Project: Nutch > Issue Type: New Feature >Reporter: Markus Jelsma >Priority: Minor > > Port protocol-sftp from trunk back to 1.4 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (NUTCH-1103) Port protocol-sftp to 1.4
[ https://issues.apache.org/jira/browse/NUTCH-1103?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17090321#comment-17090321 ] Sebastian Nagel commented on NUTCH-1103: Well, looking at the history obviously not :( - there is a protocol-sftp in the 2.x branch - plugins shouldn't be difficult to port to 1.x/master - however, it could be a challenge to get the URL stream handler registered, see NUTCH-2429 If you plan to tackle this/these issues, would be great! > Port protocol-sftp to 1.4 > - > > Key: NUTCH-1103 > URL: https://issues.apache.org/jira/browse/NUTCH-1103 > Project: Nutch > Issue Type: New Feature >Reporter: Markus Jelsma >Priority: Minor > > Port protocol-sftp from trunk back to 1.4 -- This message was sent by Atlassian Jira (v8.3.4#803005)
[DISCUSS] Release 1.17 ?
Hi all, 30 issues are done now https://issues.apache.org/jira/browse/NUTCH/fixforversion/12346090 including a number of important dependency upgrades: - Hadoop 3.1 (NUTCH-2777) - Elasticsearch 7.3.0 REST client (NUTCH-2739) Thanks to Shashanka Balakuntala Srinivasa for both! Dependency upgrades to be included (but still open right now): - Tika 1.24.1 - Solr 8.5.1 The last release (1.16) was in October, so it's definitely not too early to release 1.17. As usual, we'll check all remaining issues whether they should be fixed now or can be done later in 1.18. I would be ready to push a release candidate during the next weeks and have already started to work through the remaining issues. Please, comment on issues you want to get fixed already in 1.17! Thanks, Sebastian