[jira] [Created] (NUTCH-2676) Update to the latest selenium and add code to use chrome and firefox headless mode with the remote web driver

2018-11-14 Thread Stas Batururimi (JIRA)
Stas Batururimi created NUTCH-2676:
--

 Summary: Update to the latest selenium and add code to use chrome 
and firefox headless mode with the remote web driver
 Key: NUTCH-2676
 URL: https://issues.apache.org/jira/browse/NUTCH-2676
 Project: Nutch
  Issue Type: New Feature
Reporter: Stas Batururimi


* Selenium needs to be updated
 * missing remote web driver for chrome 
 * necessity to add headless mode for both remote WebDriverBase Firefox & Chrome
 * use case with Selenium grid using docker (1 hub docker container, several 
nodes in different docker containers, Nutch in another docker container, 
streaming to Apache Solr in docker container, that is at least 4 different 
docker containers)



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2630) Fetcher to log skipped records by robots.txt

2018-11-14 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686466#comment-16686466
 ] 

Hudson commented on NUTCH-2630:
---

FAILURE: Integrated in Jenkins build Nutch-trunk #3586 (See 
[https://builds.apache.org/job/Nutch-trunk/3586/])
NUTCH-2630 Fetcher to log skipped records by robots.txt - change (snagel: 
[https://github.com/apache/nutch/commit/54f156cf0deccbfb35bc34b59916f51f48e866d4])
* (edit) src/java/org/apache/nutch/fetcher/FetcherThread.java


> Fetcher to log skipped records by robots.txt
> 
>
> Key: NUTCH-2630
> URL: https://issues.apache.org/jira/browse/NUTCH-2630
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.15
>Reporter: Markus Jelsma
>Priority: Minor
> Fix For: 1.16
>
>
> To analyze problems it would be helpful if fetcher logs URLs which are 
> disallowed in the robots.txt - see [discussion on user mailing 
> list|https://lists.apache.org/thread.html/7fe5b02104ea866aba183d009a5fad59ad4e4daf8954593ef0123dd6@%3Cuser.nutch.apache.org%3E].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Build failed in Jenkins: Nutch-trunk #3588

2018-11-14 Thread Apache Jenkins Server
See 

--
[...truncated 9.44 KB...]
[javac]   location: class CustomDaoFactory
[javac] 
:47:
 error: cannot find symbol
[javac]   private  void register(Dao dao) {
[javac] ^
[javac]   symbol:   class Dao
[javac]   location: class CustomDaoFactory
[javac] 
:53:
 error: cannot find symbol
[javac]   public List> getCreatedDaos() {
[javac]   ^
[javac]   symbol:   class Dao
[javac]   location: class CustomDaoFactory
[javac] 
:22:
 error: cannot find symbol
[javac] import com.j256.ormlite.dao.BaseDaoImpl;
[javac]^
[javac]   symbol:   class BaseDaoImpl
[javac]   location: package com.j256.ormlite.dao
[javac] 
:23:
 error: cannot find symbol
[javac] import com.j256.ormlite.dao.Dao;
[javac]^
[javac]   symbol:   class Dao
[javac]   location: package com.j256.ormlite.dao
[javac] 
:25:
 error: cannot find symbol
[javac] import com.j256.ormlite.table.DatabaseTableConfig;
[javac]  ^
[javac]   symbol:   class DatabaseTableConfig
[javac]   location: package com.j256.ormlite.table
[javac] 
:26:
 error: cannot find symbol
[javac] import com.j256.ormlite.table.TableUtils;
[javac]  ^
[javac]   symbol:   class TableUtils
[javac]   location: package com.j256.ormlite.table
[javac] 
:30:
 error: cannot find symbol
[javac]   private ConnectionSource connectionSource;
[javac]   ^
[javac]   symbol:   class ConnectionSource
[javac]   location: class CustomTableCreator
[javac] 
:31:
 error: cannot find symbol
[javac]   private List> configuredDaos;
[javac]^
[javac]   symbol:   class Dao
[javac]   location: class CustomTableCreator
[javac] 
:33:
 error: cannot find symbol
[javac]   public CustomTableCreator(ConnectionSource connectionSource,
[javac] ^
[javac]   symbol:   class ConnectionSource
[javac]   location: class CustomTableCreator
[javac] 
:34:
 error: cannot find symbol
[javac]   List> configuredDaos) {
[javac]^
[javac]   symbol:   class Dao
[javac]   location: class CustomTableCreator
[javac] 
:51:
 error: cannot find symbol
[javac]   private void createTableForDao(Dao dao) {
[javac]  ^
[javac]   symbol:   class Dao
[javac]   location: class CustomTableCreator
[javac] 
:56:
 error: cannot find symbol
[javac]   private DatabaseTableConfig getTableConfig(Dao dao) {
[javac] ^
[javac]   symbol:   class Dao
[javac]   location: class CustomTableCreator
[javac] 
:56:
 error: cannot find symbol
[javac]   private DatabaseTableConfig getTableConfig(Dao dao) {
[javac]   ^
[javac]   symbol:   class DatabaseTableConfig
[javac]   location: class CustomTableCreator
[javac] 
:68:
 error: cannot find symbol
[javac]   private DatabaseTableConfig getConfigFromClass(Class clazz) 
{
[javac]   ^
[javac]   symbol:   class DatabaseTableConfig
[javac]   location: class CustomTableCreator
[javac] 

Build failed in Jenkins: Nutch-trunk #3587

2018-11-14 Thread Apache Jenkins Server
See 

--
[...truncated 9.44 KB...]
[javac]   location: class CustomDaoFactory
[javac] 
:47:
 error: cannot find symbol
[javac]   private  void register(Dao dao) {
[javac] ^
[javac]   symbol:   class Dao
[javac]   location: class CustomDaoFactory
[javac] 
:53:
 error: cannot find symbol
[javac]   public List> getCreatedDaos() {
[javac]   ^
[javac]   symbol:   class Dao
[javac]   location: class CustomDaoFactory
[javac] 
:22:
 error: cannot find symbol
[javac] import com.j256.ormlite.dao.BaseDaoImpl;
[javac]^
[javac]   symbol:   class BaseDaoImpl
[javac]   location: package com.j256.ormlite.dao
[javac] 
:23:
 error: cannot find symbol
[javac] import com.j256.ormlite.dao.Dao;
[javac]^
[javac]   symbol:   class Dao
[javac]   location: package com.j256.ormlite.dao
[javac] 
:25:
 error: cannot find symbol
[javac] import com.j256.ormlite.table.DatabaseTableConfig;
[javac]  ^
[javac]   symbol:   class DatabaseTableConfig
[javac]   location: package com.j256.ormlite.table
[javac] 
:26:
 error: cannot find symbol
[javac] import com.j256.ormlite.table.TableUtils;
[javac]  ^
[javac]   symbol:   class TableUtils
[javac]   location: package com.j256.ormlite.table
[javac] 
:30:
 error: cannot find symbol
[javac]   private ConnectionSource connectionSource;
[javac]   ^
[javac]   symbol:   class ConnectionSource
[javac]   location: class CustomTableCreator
[javac] 
:31:
 error: cannot find symbol
[javac]   private List> configuredDaos;
[javac]^
[javac]   symbol:   class Dao
[javac]   location: class CustomTableCreator
[javac] 
:33:
 error: cannot find symbol
[javac]   public CustomTableCreator(ConnectionSource connectionSource,
[javac] ^
[javac]   symbol:   class ConnectionSource
[javac]   location: class CustomTableCreator
[javac] 
:34:
 error: cannot find symbol
[javac]   List> configuredDaos) {
[javac]^
[javac]   symbol:   class Dao
[javac]   location: class CustomTableCreator
[javac] 
:51:
 error: cannot find symbol
[javac]   private void createTableForDao(Dao dao) {
[javac]  ^
[javac]   symbol:   class Dao
[javac]   location: class CustomTableCreator
[javac] 
:56:
 error: cannot find symbol
[javac]   private DatabaseTableConfig getTableConfig(Dao dao) {
[javac] ^
[javac]   symbol:   class Dao
[javac]   location: class CustomTableCreator
[javac] 
:56:
 error: cannot find symbol
[javac]   private DatabaseTableConfig getTableConfig(Dao dao) {
[javac]   ^
[javac]   symbol:   class DatabaseTableConfig
[javac]   location: class CustomTableCreator
[javac] 
:68:
 error: cannot find symbol
[javac]   private DatabaseTableConfig getConfigFromClass(Class clazz) 
{
[javac]   ^
[javac]   symbol:   class DatabaseTableConfig
[javac]   location: class CustomTableCreator
[javac] 

Build failed in Jenkins: Nutch-trunk #3586

2018-11-14 Thread Apache Jenkins Server
See 


Changes:

[snagel] NUTCH-2630 Fetcher to log skipped records by robots.txt - change

--
[...truncated 9.48 KB...]
[javac] 
:47:
 error: cannot find symbol
[javac]   private  void register(Dao dao) {
[javac] ^
[javac]   symbol:   class Dao
[javac]   location: class CustomDaoFactory
[javac] 
:53:
 error: cannot find symbol
[javac]   public List> getCreatedDaos() {
[javac]   ^
[javac]   symbol:   class Dao
[javac]   location: class CustomDaoFactory
[javac] 
:22:
 error: cannot find symbol
[javac] import com.j256.ormlite.dao.BaseDaoImpl;
[javac]^
[javac]   symbol:   class BaseDaoImpl
[javac]   location: package com.j256.ormlite.dao
[javac] 
:23:
 error: cannot find symbol
[javac] import com.j256.ormlite.dao.Dao;
[javac]^
[javac]   symbol:   class Dao
[javac]   location: package com.j256.ormlite.dao
[javac] 
:25:
 error: cannot find symbol
[javac] import com.j256.ormlite.table.DatabaseTableConfig;
[javac]  ^
[javac]   symbol:   class DatabaseTableConfig
[javac]   location: package com.j256.ormlite.table
[javac] 
:26:
 error: cannot find symbol
[javac] import com.j256.ormlite.table.TableUtils;
[javac]  ^
[javac]   symbol:   class TableUtils
[javac]   location: package com.j256.ormlite.table
[javac] 
:30:
 error: cannot find symbol
[javac]   private ConnectionSource connectionSource;
[javac]   ^
[javac]   symbol:   class ConnectionSource
[javac]   location: class CustomTableCreator
[javac] 
:31:
 error: cannot find symbol
[javac]   private List> configuredDaos;
[javac]^
[javac]   symbol:   class Dao
[javac]   location: class CustomTableCreator
[javac] 
:33:
 error: cannot find symbol
[javac]   public CustomTableCreator(ConnectionSource connectionSource,
[javac] ^
[javac]   symbol:   class ConnectionSource
[javac]   location: class CustomTableCreator
[javac] 
:34:
 error: cannot find symbol
[javac]   List> configuredDaos) {
[javac]^
[javac]   symbol:   class Dao
[javac]   location: class CustomTableCreator
[javac] 
:51:
 error: cannot find symbol
[javac]   private void createTableForDao(Dao dao) {
[javac]  ^
[javac]   symbol:   class Dao
[javac]   location: class CustomTableCreator
[javac] 
:56:
 error: cannot find symbol
[javac]   private DatabaseTableConfig getTableConfig(Dao dao) {
[javac] ^
[javac]   symbol:   class Dao
[javac]   location: class CustomTableCreator
[javac] 
:56:
 error: cannot find symbol
[javac]   private DatabaseTableConfig getTableConfig(Dao dao) {
[javac]   ^
[javac]   symbol:   class DatabaseTableConfig
[javac]   location: class CustomTableCreator
[javac] 
:68:
 error: cannot find symbol
[javac]   private DatabaseTableConfig getConfigFromClass(Class clazz) 
{
[javac]   ^
[javac]   symbol:   class DatabaseTableConfig
[javac]   location: class CustomTableCreator
[javac] 

[jira] [Resolved] (NUTCH-2630) Fetcher to log skipped records by robots.txt

2018-11-14 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2630.

Resolution: Fixed

Committed to master/1.x.

> Fetcher to log skipped records by robots.txt
> 
>
> Key: NUTCH-2630
> URL: https://issues.apache.org/jira/browse/NUTCH-2630
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.15
>Reporter: Markus Jelsma
>Priority: Minor
> Fix For: 1.16
>
>
> To analyze problems it would be helpful if fetcher logs URLs which are 
> disallowed in the robots.txt - see [discussion on user mailing 
> list|https://lists.apache.org/thread.html/7fe5b02104ea866aba183d009a5fad59ad4e4daf8954593ef0123dd6@%3Cuser.nutch.apache.org%3E].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2630) Fetcher to log skipped records by robots.txt

2018-11-14 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686424#comment-16686424
 ] 

ASF GitHub Bot commented on NUTCH-2630:
---

sebastian-nagel closed pull request #387: NUTCH-2630 Fetcher to log skipped 
records by robots.txt
URL: https://github.com/apache/nutch/pull/387
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/src/java/org/apache/nutch/fetcher/FetcherThread.java 
b/src/java/org/apache/nutch/fetcher/FetcherThread.java
index bfcc3741e..6ba920e87 100644
--- a/src/java/org/apache/nutch/fetcher/FetcherThread.java
+++ b/src/java/org/apache/nutch/fetcher/FetcherThread.java
@@ -302,9 +302,7 @@ public void run() {
 if (!rules.isAllowed(fit.url.toString())) {
   // unblock
   ((FetchItemQueues) fetchQueues).finishFetchItem(fit, true);
-  if (LOG.isDebugEnabled()) {
-LOG.debug("Denied by robots.txt: {}", fit.url);
-  }
+  LOG.info("Denied by robots.txt: {}", fit.url);
   output(fit.url, fit.datum, null,
   ProtocolStatus.STATUS_ROBOTS_DENIED,
   CrawlDatum.STATUS_FETCH_GONE);
@@ -315,7 +313,7 @@ public void run() {
   if (rules.getCrawlDelay() > maxCrawlDelay && maxCrawlDelay >= 0) 
{
 // unblock
 ((FetchItemQueues) fetchQueues).finishFetchItem(fit, true);
-LOG.debug("Crawl-Delay for {} too long ({}), skipping", 
fit.url,
+LOG.info("Crawl-Delay for {} too long ({}), skipping", fit.url,
 rules.getCrawlDelay());
 output(fit.url, fit.datum, null,
 ProtocolStatus.STATUS_ROBOTS_DENIED,


 


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fetcher to log skipped records by robots.txt
> 
>
> Key: NUTCH-2630
> URL: https://issues.apache.org/jira/browse/NUTCH-2630
> Project: Nutch
>  Issue Type: Improvement
>  Components: fetcher
>Affects Versions: 1.15
>Reporter: Markus Jelsma
>Priority: Minor
> Fix For: 1.16
>
>
> To analyze problems it would be helpful if fetcher logs URLs which are 
> disallowed in the robots.txt - see [discussion on user mailing 
> list|https://lists.apache.org/thread.html/7fe5b02104ea866aba183d009a5fad59ad4e4daf8954593ef0123dd6@%3Cuser.nutch.apache.org%3E].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2655) Update Solr schema.xml for Solr 7.x

2018-11-14 Thread Hudson (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686277#comment-16686277
 ] 

Hudson commented on NUTCH-2655:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3585 (See 
[https://builds.apache.org/job/Nutch-trunk/3585/])
NUTCH-2655 Update Solr schema.xml for Solr 7.x - add required field (snagel: 
[https://github.com/apache/nutch/commit/1a9f2e6735f4ff6f959b7a441811c09915acd85a])
* (edit) conf/schema.xml


> Update Solr schema.xml for Solr 7.x
> ---
>
> Key: NUTCH-2655
> URL: https://issues.apache.org/jira/browse/NUTCH-2655
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer, plugin
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.16
>
>
> The Solr schema.xml is not compatible with Solr 7.x which is used by Nutch 
> 1.15. I've tested Solr 7.3.1 and 7.5.0: when using the current schema.xml, 
> Solr fails and complains about unknown field types:
> {noformat}
> 2018-10-15 12:55:24.484 ERROR (qtp102617125-17) [ x:nutch] 
> o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Error 
> CREATEing SolrCore 'nutch': Unable to create core [nutch] Caused by: 
> fieldType 'pdates' not found in the schema
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2606) MIME detection is wrong for plain-text documents send as Content-Type "application/msword"

2018-11-14 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686245#comment-16686245
 ] 

Sebastian Nagel commented on NUTCH-2606:


Any objections? Otherwise I'll commit this fix during the next days.

> MIME detection is wrong for plain-text documents send as Content-Type 
> "application/msword"
> --
>
> Key: NUTCH-2606
> URL: https://issues.apache.org/jira/browse/NUTCH-2606
> Project: Nutch
>  Issue Type: Bug
>Affects Versions: 1.14
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> Plain-text documents send as Content-Type "application/msword" are tried to 
> parse as Word documents. The MIME detection should be fixed, so that these 
> are correctly identified as plain-text documents. See NUTCH-2603 and 
> https://www.atnf.csiro.au/computing/software/gipsy/doc/update.doc



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (NUTCH-2655) Update Solr schema.xml for Solr 7.x

2018-11-14 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2655.

Resolution: Fixed

Fixed/merged. Thanks for the reviews!

> Update Solr schema.xml for Solr 7.x
> ---
>
> Key: NUTCH-2655
> URL: https://issues.apache.org/jira/browse/NUTCH-2655
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer, plugin
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.16
>
>
> The Solr schema.xml is not compatible with Solr 7.x which is used by Nutch 
> 1.15. I've tested Solr 7.3.1 and 7.5.0: when using the current schema.xml, 
> Solr fails and complains about unknown field types:
> {noformat}
> 2018-10-15 12:55:24.484 ERROR (qtp102617125-17) [ x:nutch] 
> o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Error 
> CREATEing SolrCore 'nutch': Unable to create core [nutch] Caused by: 
> fieldType 'pdates' not found in the schema
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (NUTCH-2655) Update Solr schema.xml for Solr 7.x

2018-11-14 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2655:
--

Assignee: Sebastian Nagel

> Update Solr schema.xml for Solr 7.x
> ---
>
> Key: NUTCH-2655
> URL: https://issues.apache.org/jira/browse/NUTCH-2655
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer, plugin
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.16
>
>
> The Solr schema.xml is not compatible with Solr 7.x which is used by Nutch 
> 1.15. I've tested Solr 7.3.1 and 7.5.0: when using the current schema.xml, 
> Solr fails and complains about unknown field types:
> {noformat}
> 2018-10-15 12:55:24.484 ERROR (qtp102617125-17) [ x:nutch] 
> o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Error 
> CREATEing SolrCore 'nutch': Unable to create core [nutch] Caused by: 
> fieldType 'pdates' not found in the schema
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2655) Update Solr schema.xml for Solr 7.x

2018-11-14 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16686226#comment-16686226
 ] 

ASF GitHub Bot commented on NUTCH-2655:
---

sebastian-nagel closed pull request #395: NUTCH-2655 Update Solr schema.xml for 
Solr 7.x
URL: https://github.com/apache/nutch/pull/395
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/conf/schema.xml b/conf/schema.xml
index 6e7d5bfa5..2b095e59d 100644
--- a/conf/schema.xml
+++ b/conf/schema.xml
@@ -300,6 +300,19 @@
 
 
 
+
+
+
+
+
+
+
+
+
+
+
+
+
 
> Key: NUTCH-2655
> URL: https://issues.apache.org/jira/browse/NUTCH-2655
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer, plugin
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.16
>
>
> The Solr schema.xml is not compatible with Solr 7.x which is used by Nutch 
> 1.15. I've tested Solr 7.3.1 and 7.5.0: when using the current schema.xml, 
> Solr fails and complains about unknown field types:
> {noformat}
> 2018-10-15 12:55:24.484 ERROR (qtp102617125-17) [ x:nutch] 
> o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Error 
> CREATEing SolrCore 'nutch': Unable to create core [nutch] Caused by: 
> fieldType 'pdates' not found in the schema
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)