[jira] [Commented] (NUTCH-2676) Update to the latest selenium and add code to use chrome and firefox headless mode with the remote web driver

2019-01-16 Thread Stas Batururimi (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16744215#comment-16744215
 ] 

Stas Batururimi commented on NUTCH-2676:


https://github.com/apache/nutch/pull/430

I will probably add more updates during the month.
I have also added a possibility to ignore robots.txt file. Let me know if 
needed to push it.

> Update to the latest selenium and add code to use chrome and firefox headless 
> mode with the remote web driver
> -
>
> Key: NUTCH-2676
> URL: https://issues.apache.org/jira/browse/NUTCH-2676
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.15
>Reporter: Stas Batururimi
>Priority: Major
> Fix For: 1.16
>
> Attachments: Screenshot 2018-11-16 at 18.15.52.png
>
>
> * Selenium needs to be updated
>  * missing remote web driver for chrome 
>  * necessity to add headless mode for both remote WebDriverBase Firefox & 
> Chrome
>  * use case with Selenium grid using docker (1 hub docker container, several 
> nodes in different docker containers, Nutch in another docker container, 
> streaming to Apache Solr in docker container, that is at least 4 different 
> docker containers)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2676) Update to the latest selenium and add code to use chrome and firefox headless mode with the remote web driver

2019-01-16 Thread Stas Batururimi (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2676?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16744203#comment-16744203
 ] 

Stas Batururimi commented on NUTCH-2676:


[~wastl-nagel]I'm ready to make a pull request but don't see any NUTCH-2676 
branch in the https://github.com/apache/nutch repository. Should I perform a 
pull request to master?

> Update to the latest selenium and add code to use chrome and firefox headless 
> mode with the remote web driver
> -
>
> Key: NUTCH-2676
> URL: https://issues.apache.org/jira/browse/NUTCH-2676
> Project: Nutch
>  Issue Type: New Feature
>  Components: protocol
>Affects Versions: 1.15
>Reporter: Stas Batururimi
>Priority: Major
> Fix For: 1.16
>
> Attachments: Screenshot 2018-11-16 at 18.15.52.png
>
>
> * Selenium needs to be updated
>  * missing remote web driver for chrome 
>  * necessity to add headless mode for both remote WebDriverBase Firefox & 
> Chrome
>  * use case with Selenium grid using docker (1 hub docker container, several 
> nodes in different docker containers, Nutch in another docker container, 
> streaming to Apache Solr in docker container, that is at least 4 different 
> docker containers)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2685) Add README.md file to all exchange plugins

2019-01-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16744178#comment-16744178
 ] 

ASF GitHub Bot commented on NUTCH-2685:
---

r0ann3l commented on pull request #429: NUTCH-2685: README.md file for 
exchange-jexl plugin.
URL: https://github.com/apache/nutch/pull/429
 
 
   A README.md file explaining the exchange-jexl plugin usage.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add README.md file to all exchange plugins
> --
>
> Key: NUTCH-2685
> URL: https://issues.apache.org/jira/browse/NUTCH-2685
> Project: Nutch
>  Issue Type: Sub-task
>  Components: documentation, indexer
>Affects Versions: 1.15
>Reporter: Roannel Fernández Hernández
>Assignee: Roannel Fernández Hernández
>Priority: Trivial
> Fix For: 1.16
>
>
> Adding the README.md file with plugin-specific documentation to all exchange 
> plugins.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2687) Regex for reading title from Content-Disposition is wrong

2019-01-16 Thread Markus Jelsma (JIRA)
Markus Jelsma created NUTCH-2687:


 Summary: Regex for reading title from Content-Disposition is wrong
 Key: NUTCH-2687
 URL: https://issues.apache.org/jira/browse/NUTCH-2687
 Project: Nutch
  Issue Type: Bug
Reporter: Markus Jelsma
Assignee: Markus Jelsma
 Fix For: 1.16






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2687) Regex for reading title from Content-Disposition is wrong

2019-01-16 Thread Markus Jelsma (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2687:
-
Attachment: NUTCH-2687.patch

> Regex for reading title from Content-Disposition is wrong
> -
>
> Key: NUTCH-2687
> URL: https://issues.apache.org/jira/browse/NUTCH-2687
> Project: Nutch
>  Issue Type: Bug
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.16
>
> Attachments: NUTCH-2687.patch
>
>
> Given URL: 
> https://www.amuse-project.org/file/download/default/E6D0537647AF1204656076943F4729B0/Koopstra2016_5fOntologically%20classifying%20ERP%20feature,%20the%20NEXT%20method_5fFinal.pdf
> And regex: \\bfilename=['\"](.+)['\"]
> We get the following title:
> Koopstra2016_Ontologically classifying ERP feature, the NEXT 
> method_Final.pdf"; filename*=utf-8'
> Changed regex to: \\bfilename=['\"]([^\"]+) fixes it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2687) Regex for reading title from Content-Disposition is wrong

2019-01-16 Thread Markus Jelsma (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2687:
-
Description: 
Given URL: 
https://www.amuse-project.org/file/download/default/E6D0537647AF1204656076943F4729B0/Koopstra2016_5fOntologically%20classifying%20ERP%20feature,%20the%20NEXT%20method_5fFinal.pdf

And regex: \\bfilename=['\"](.+)['\"]

We get the following title:
Koopstra2016_Ontologically classifying ERP feature, the NEXT method_Final.pdf"; 
filename*=utf-8'

Changed regex to: \\bfilename=['\"]([^\"]+) fixes it



> Regex for reading title from Content-Disposition is wrong
> -
>
> Key: NUTCH-2687
> URL: https://issues.apache.org/jira/browse/NUTCH-2687
> Project: Nutch
>  Issue Type: Bug
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Major
> Fix For: 1.16
>
>
> Given URL: 
> https://www.amuse-project.org/file/download/default/E6D0537647AF1204656076943F4729B0/Koopstra2016_5fOntologically%20classifying%20ERP%20feature,%20the%20NEXT%20method_5fFinal.pdf
> And regex: \\bfilename=['\"](.+)['\"]
> We get the following title:
> Koopstra2016_Ontologically classifying ERP feature, the NEXT 
> method_Final.pdf"; filename*=utf-8'
> Changed regex to: \\bfilename=['\"]([^\"]+) fixes it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2686) Separate field for mime types mapped by index-more plugin

2019-01-16 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16744053#comment-16744053
 ] 

ASF GitHub Bot commented on NUTCH-2686:
---

r0ann3l commented on pull request #428: NUTCH-2686 New property: 
"moreIndexingFilter.mapMimeTypes.field"
URL: https://github.com/apache/nutch/pull/428
 
 
   Includes a new property: "moreIndexingFilter.mapMimeTypes.field", which 
indicates the field's name where the mapped mime type must be written. If this 
property is NULL (default value) the field "type" will be replaced by the 
mapped mime type (current behavior).
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Separate field for mime types mapped by index-more plugin
> -
>
> Key: NUTCH-2686
> URL: https://issues.apache.org/jira/browse/NUTCH-2686
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.15
>Reporter: Roannel Fernández Hernández
>Assignee: Roannel Fernández Hernández
>Priority: Minor
> Fix For: 1.16
>
>
> Since [NUTCH-1262|https://issues.apache.org/jira/browse/NUTCH-1262], several 
> mime types can be mapped to a different value. By default, the behavior is to 
> replace the original value with the new one. But if we want to keep the 
> original mime type too? This issue pretends to accomplish this requirement.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)