[jira] [Commented] (NUTCH-2676) Update to the latest selenium and add code to use chrome and firefox headless mode with the remote web driver
[ https://issues.apache.org/jira/browse/NUTCH-2676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16744215#comment-16744215 ] Stas Batururimi commented on NUTCH-2676: https://github.com/apache/nutch/pull/430 I will probably add more updates during the month. I have also added a possibility to ignore robots.txt file. Let me know if needed to push it. > Update to the latest selenium and add code to use chrome and firefox headless > mode with the remote web driver > - > > Key: NUTCH-2676 > URL: https://issues.apache.org/jira/browse/NUTCH-2676 > Project: Nutch > Issue Type: New Feature > Components: protocol >Affects Versions: 1.15 >Reporter: Stas Batururimi >Priority: Major > Fix For: 1.16 > > Attachments: Screenshot 2018-11-16 at 18.15.52.png > > > * Selenium needs to be updated > * missing remote web driver for chrome > * necessity to add headless mode for both remote WebDriverBase Firefox & > Chrome > * use case with Selenium grid using docker (1 hub docker container, several > nodes in different docker containers, Nutch in another docker container, > streaming to Apache Solr in docker container, that is at least 4 different > docker containers) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2676) Update to the latest selenium and add code to use chrome and firefox headless mode with the remote web driver
[ https://issues.apache.org/jira/browse/NUTCH-2676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16744203#comment-16744203 ] Stas Batururimi commented on NUTCH-2676: [~wastl-nagel]I'm ready to make a pull request but don't see any NUTCH-2676 branch in the https://github.com/apache/nutch repository. Should I perform a pull request to master? > Update to the latest selenium and add code to use chrome and firefox headless > mode with the remote web driver > - > > Key: NUTCH-2676 > URL: https://issues.apache.org/jira/browse/NUTCH-2676 > Project: Nutch > Issue Type: New Feature > Components: protocol >Affects Versions: 1.15 >Reporter: Stas Batururimi >Priority: Major > Fix For: 1.16 > > Attachments: Screenshot 2018-11-16 at 18.15.52.png > > > * Selenium needs to be updated > * missing remote web driver for chrome > * necessity to add headless mode for both remote WebDriverBase Firefox & > Chrome > * use case with Selenium grid using docker (1 hub docker container, several > nodes in different docker containers, Nutch in another docker container, > streaming to Apache Solr in docker container, that is at least 4 different > docker containers) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2685) Add README.md file to all exchange plugins
[ https://issues.apache.org/jira/browse/NUTCH-2685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16744178#comment-16744178 ] ASF GitHub Bot commented on NUTCH-2685: --- r0ann3l commented on pull request #429: NUTCH-2685: README.md file for exchange-jexl plugin. URL: https://github.com/apache/nutch/pull/429 A README.md file explaining the exchange-jexl plugin usage. This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add README.md file to all exchange plugins > -- > > Key: NUTCH-2685 > URL: https://issues.apache.org/jira/browse/NUTCH-2685 > Project: Nutch > Issue Type: Sub-task > Components: documentation, indexer >Affects Versions: 1.15 >Reporter: Roannel Fernández Hernández >Assignee: Roannel Fernández Hernández >Priority: Trivial > Fix For: 1.16 > > > Adding the README.md file with plugin-specific documentation to all exchange > plugins. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2687) Regex for reading title from Content-Disposition is wrong
Markus Jelsma created NUTCH-2687: Summary: Regex for reading title from Content-Disposition is wrong Key: NUTCH-2687 URL: https://issues.apache.org/jira/browse/NUTCH-2687 Project: Nutch Issue Type: Bug Reporter: Markus Jelsma Assignee: Markus Jelsma Fix For: 1.16 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2687) Regex for reading title from Content-Disposition is wrong
[ https://issues.apache.org/jira/browse/NUTCH-2687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2687: - Attachment: NUTCH-2687.patch > Regex for reading title from Content-Disposition is wrong > - > > Key: NUTCH-2687 > URL: https://issues.apache.org/jira/browse/NUTCH-2687 > Project: Nutch > Issue Type: Bug >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Major > Fix For: 1.16 > > Attachments: NUTCH-2687.patch > > > Given URL: > https://www.amuse-project.org/file/download/default/E6D0537647AF1204656076943F4729B0/Koopstra2016_5fOntologically%20classifying%20ERP%20feature,%20the%20NEXT%20method_5fFinal.pdf > And regex: \\bfilename=['\"](.+)['\"] > We get the following title: > Koopstra2016_Ontologically classifying ERP feature, the NEXT > method_Final.pdf"; filename*=utf-8' > Changed regex to: \\bfilename=['\"]([^\"]+) fixes it -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2687) Regex for reading title from Content-Disposition is wrong
[ https://issues.apache.org/jira/browse/NUTCH-2687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-2687: - Description: Given URL: https://www.amuse-project.org/file/download/default/E6D0537647AF1204656076943F4729B0/Koopstra2016_5fOntologically%20classifying%20ERP%20feature,%20the%20NEXT%20method_5fFinal.pdf And regex: \\bfilename=['\"](.+)['\"] We get the following title: Koopstra2016_Ontologically classifying ERP feature, the NEXT method_Final.pdf"; filename*=utf-8' Changed regex to: \\bfilename=['\"]([^\"]+) fixes it > Regex for reading title from Content-Disposition is wrong > - > > Key: NUTCH-2687 > URL: https://issues.apache.org/jira/browse/NUTCH-2687 > Project: Nutch > Issue Type: Bug >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Major > Fix For: 1.16 > > > Given URL: > https://www.amuse-project.org/file/download/default/E6D0537647AF1204656076943F4729B0/Koopstra2016_5fOntologically%20classifying%20ERP%20feature,%20the%20NEXT%20method_5fFinal.pdf > And regex: \\bfilename=['\"](.+)['\"] > We get the following title: > Koopstra2016_Ontologically classifying ERP feature, the NEXT > method_Final.pdf"; filename*=utf-8' > Changed regex to: \\bfilename=['\"]([^\"]+) fixes it -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2686) Separate field for mime types mapped by index-more plugin
[ https://issues.apache.org/jira/browse/NUTCH-2686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16744053#comment-16744053 ] ASF GitHub Bot commented on NUTCH-2686: --- r0ann3l commented on pull request #428: NUTCH-2686 New property: "moreIndexingFilter.mapMimeTypes.field" URL: https://github.com/apache/nutch/pull/428 Includes a new property: "moreIndexingFilter.mapMimeTypes.field", which indicates the field's name where the mapped mime type must be written. If this property is NULL (default value) the field "type" will be replaced by the mapped mime type (current behavior). This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Separate field for mime types mapped by index-more plugin > - > > Key: NUTCH-2686 > URL: https://issues.apache.org/jira/browse/NUTCH-2686 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 1.15 >Reporter: Roannel Fernández Hernández >Assignee: Roannel Fernández Hernández >Priority: Minor > Fix For: 1.16 > > > Since [NUTCH-1262|https://issues.apache.org/jira/browse/NUTCH-1262], several > mime types can be mapped to a different value. By default, the behavior is to > replace the original value with the new one. But if we want to keep the > original mime type too? This issue pretends to accomplish this requirement. -- This message was sent by Atlassian JIRA (v7.6.3#76005)