[jira] [Commented] (NUTCH-1842) crawl.gen.delay has a wrong default value in nutch-default.xml or is being parsed incorrectly

2018-10-15 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650022#comment-16650022
 ] 

Sebastian Nagel commented on NUTCH-1842:


[~yossi], I agree adapting the code to the documentation is the better 
decision. Would be different if the description was wrong only for a short time 
but now it's already 8 years. Are there any objections? Otherwise I would merge 
the PR and also add a warning to the change log.

> crawl.gen.delay has a wrong default value in nutch-default.xml or is being 
> parsed incorrectly 
> --
>
> Key: NUTCH-1842
> URL: https://issues.apache.org/jira/browse/NUTCH-1842
> Project: Nutch
>  Issue Type: Bug
>  Components: generator
>Affects Versions: 1.9
>Reporter: kaveh minooie
>Assignee: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> this is from nutch-default.xml:
> 
>   crawl.gen.delay
>   60480
>   
>This value, expressed in milliseconds, defines how long we should keep the 
> lock on records 
>in CrawlDb that were just selected for fetching. If these records are not 
> updated 
>in the meantime, the lock is canceled, i.e. they become eligible for 
> selecting. 
>Default value of this is 7 days (60480 ms).
>   
> 
> this is the from o.a.n.crawl.Generator.configure(JobConf job)
> genDelay = job.getLong(GENERATOR_DELAY, 7L) * 3600L * 24L * 1000L;
> the value in config file is in milliseconds but the code expect it to be in 
> days. I reported this couple of years ago on the mailing list as well. I 
> didn't post a patch becaue I am not sure which one needs to be fixed. 
> considering all the other values in config file are in milliseconds it can be 
> argued to that consistency matters, but 'day' is a much more reasonable unit 
> for this property.
> Also this value is not being used in 2.x ?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2652) Fetcher launches more fetch tasks than fetch lists

2018-10-15 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2652:
--

 Summary: Fetcher launches more fetch tasks than fetch lists
 Key: NUTCH-2652
 URL: https://issues.apache.org/jira/browse/NUTCH-2652
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.15
 Environment: Hadoop, distributed mode (cluster of 22 nodes), CDH 
5.15.1, Nutch built on recent master.

Seen the first time right now, although running since two months with Nutch 
1.15. But the constraints causing inputs to be split may change from run to run.
Reporter: Sebastian Nagel
 Fix For: 1.16


Fetcher may launch more fetcher tasks than there are fetch lists:
{noformat}
18/10/15 07:27:26 INFO input.FileInputFormat: Total input paths to process : 128
18/10/15 07:27:26 INFO mapreduce.JobSubmitter: number of splits:187
{noformat}
That's one design principle of Nutch as a MapRecude-based crawler: to ensure 
politeness and a guaranteed delay between requests to the same host/domain/ip 
all items of one host/domain/ip are put by Generator into the same fetch list. 
A fetch list may not be split because that would violate the politeness 
constraints - multiple fetcher tasks processing the splits of one fetch list 
then may send requests to the same host/domain/ip in parallel. See [~ab]'s 
chapter about Nutch in [Hadoop the definitive guide (3rd 
edition)|https://www.safaribooksonline.com/library/view/hadoop-the-definitive/9781449328917/ch16.html#NutchFetcher].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2652) Fetcher launches more fetch tasks than fetch lists

2018-10-15 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650111#comment-16650111
 ] 

ASF GitHub Bot commented on NUTCH-2652:
---

sebastian-nagel opened a new pull request #394: NUTCH-2652 Fetcher launches 
more fetch tasks than fetch lists
URL: https://github.com/apache/nutch/pull/394
 
 
   - properly override method [getSplits(JobContext context) of 
FileInputFormat](https://hadoop.apache.org/docs/r2.8.5/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html#getSplits(org.apache.hadoop.mapreduce.JobContext))
   
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Fetcher launches more fetch tasks than fetch lists
> --
>
> Key: NUTCH-2652
> URL: https://issues.apache.org/jira/browse/NUTCH-2652
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 1.15
> Environment: Hadoop, distributed mode (cluster of 22 nodes), CDH 
> 5.15.1, Nutch built on recent master.
> Seen the first time right now, although running since two months with Nutch 
> 1.15. But the constraints causing inputs to be split may change from run to 
> run.
>Reporter: Sebastian Nagel
>Priority: Critical
> Fix For: 1.16
>
>
> Fetcher may launch more fetcher tasks than there are fetch lists:
> {noformat}
> 18/10/15 07:27:26 INFO input.FileInputFormat: Total input paths to process : 
> 128
> 18/10/15 07:27:26 INFO mapreduce.JobSubmitter: number of splits:187
> {noformat}
> That's one design principle of Nutch as a MapRecude-based crawler: to ensure 
> politeness and a guaranteed delay between requests to the same host/domain/ip 
> all items of one host/domain/ip are put by Generator into the same fetch 
> list. A fetch list may not be split because that would violate the politeness 
> constraints - multiple fetcher tasks processing the splits of one fetch list 
> then may send requests to the same host/domain/ip in parallel. See [~ab]'s 
> chapter about Nutch in [Hadoop the definitive guide (3rd 
> edition)|https://www.safaribooksonline.com/library/view/hadoop-the-definitive/9781449328917/ch16.html#NutchFetcher].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2653) ProtocolFactory.getProtocol(url) creates separate plugin instances for http/https

2018-10-15 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2653:
--

 Summary: ProtocolFactory.getProtocol(url) creates separate plugin 
instances for http/https
 Key: NUTCH-2653
 URL: https://issues.apache.org/jira/browse/NUTCH-2653
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher, protocol
Affects Versions: 1.15
Reporter: Sebastian Nagel
 Fix For: 1.16


Fetcher creates two instances of the protocol-okhttp plugin, one to handle http 
requests, another for https. The plugin properties are logged during plugin 
instantiation when calling {{setConf(...)}}:
{noformat}
2018-10-11 13:28:34,417 INFO [FetcherThread] 
org.apache.nutch.fetcher.FetcherThread: FetcherThread 40 fetching http://...
...
2018-10-11 13:28:35,099 INFO [FetcherThread] 
org.apache.nutch.protocol.okhttp.OkHttp: http.proxy.host = null
2018-10-11 13:28:35,100 INFO [FetcherThread] 
org.apache.nutch.protocol.okhttp.OkHttp: http.proxy.port = 8080
...
2018-10-11 13:28:36,864 INFO [FetcherThread] 
org.apache.nutch.fetcher.FetcherThread: FetcherThread 87 fetching https://...
...
2018-10-11 13:28:36,864 INFO [FetcherThread] 
org.apache.nutch.protocol.okhttp.OkHttp: http.proxy.host = null
2018-10-11 13:28:36,864 INFO [FetcherThread] 
org.apache.nutch.protocol.okhttp.OkHttp: http.proxy.port = 8080
{noformat}

The question is whether this is the correct behavior for plugins supporting 
multiple protocols (http and https)? It may cause that connection pooling and 
other network optimizations do not work as expected. Of course, it's correct if 
different plugins are required, e.g., for ftp or the local file system.

(seen while reviewing the behavior of fetcher with fix for NUTCH-2625 applied)




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-1377) Add option to index via CloudSolrServer instead

2018-10-15 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650148#comment-16650148
 ] 

Sebastian Nagel commented on NUTCH-1377:


[~roannel], isn't this implemented by selecting the type "cloud" in the 
[indexer-solr 
config|https://wiki.apache.org/nutch/IndexWriters#Solr_indexer_properties]?

> Add option to index via CloudSolrServer instead
> ---
>
> Key: NUTCH-1377
> URL: https://issues.apache.org/jira/browse/NUTCH-1377
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Major
> Attachments: NUTCH-1377-1.8.patch, NUTCH-1377-1.8.patch
>
>
> Nutch indexes to a specific Solr server. With SolrCloud on its way we can 
> still use the current indexer and point to any server. However, the 
> SolrCloudServer can connect to ZooKeeper instead and automatically find the 
> correct server to index to.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2654) Remove obsolete index-writer configuration in conf/

2018-10-15 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2654:
--

 Summary: Remove obsolete index-writer configuration in conf/
 Key: NUTCH-2654
 URL: https://issues.apache.org/jira/browse/NUTCH-2654
 Project: Nutch
  Issue Type: Improvement
  Components: indexer
Affects Versions: 1.15
Reporter: Sebastian Nagel
 Fix For: 1.16


The configuration folder conf/ still contains stuff obsolete after NUTCH-1480:
- properties to configure indexer plugins in nutch-default.xml
- solrindex-mapping.xml (looks like obsolete)
- (still read) elasticsearch.conf

All obsolete files and properties should be removed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2655) Update Solr schema.xml for Solr 7.x

2018-10-15 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2655:
--

 Summary: Update Solr schema.xml for Solr 7.x
 Key: NUTCH-2655
 URL: https://issues.apache.org/jira/browse/NUTCH-2655
 Project: Nutch
  Issue Type: Bug
  Components: indexer, plugin
Affects Versions: 1.15
Reporter: Sebastian Nagel
 Fix For: 1.16


The Solr schema.xml is not compatible with Solr 7.x which is used by Nutch 
1.15. I've tested Solr 7.3.1 and 7.5.0: when using the current schema.xml, Solr 
fails and complains about unknown field types:
{noformat}
2018-10-15 12:55:24.484 ERROR (qtp102617125-17) [ x:nutch] 
o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Error 
CREATEing SolrCore 'nutch': Unable to create core [nutch] Caused by: fieldType 
'pdates' not found in the schema
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2656) Update description to configure Solr 7.x in tutorial

2018-10-15 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2656:
--

 Summary: Update description to configure Solr 7.x in tutorial
 Key: NUTCH-2656
 URL: https://issues.apache.org/jira/browse/NUTCH-2656
 Project: Nutch
  Issue Type: Bug
  Components: documentation
Affects Versions: 1.15
Reporter: Sebastian Nagel
 Fix For: 1.16


(reported byTimeka Cobb, see [discussion on the user mailing 
list|https://lists.apache.org/thread.html/f509e42d845b980a6e6a8130d70dffec8c8f52406908f27f0cf49b20@%3Cuser.nutch.apache.org%3E])
The description in the tutorial how to [setup Solr 6 and 
7|https://wiki.apache.org/nutch/NutchTutorial#Setup_Solr_for_search] needs to 
be updated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[Nutch Wiki] Update of "NutchTutorial" by SebastianNagel

2018-10-15 Thread Apache Wiki
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.

The "NutchTutorial" page has been changed by SebastianNagel:
https://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=93&rev2=94

Comment:
NUTCH-2656 Solr setup updated for Solr 7.x

  || 1.13  || 5.5.0  ||
  || 1.12  || 5.4.1  ||
  
- To install Solr:
+ To install Solr 7.x:
   * download binary file from 
[[http://www.apache.org/dyn/closer.cgi/lucene/solr/|here]]
   * unzip to `$HOME/apache-solr`, we will now refer to this as 
`${APACHE_SOLR_HOME}`
-  * create resources for a new nutch solr core `cp -r 
${APACHE_SOLR_HOME}/server/solr/configsets/basic_configs 
${APACHE_SOLR_HOME}/server/solr/configsets/nutch`
+  * create resources for a new nutch solr core {{{
+ mkdir -p ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/
+ cp -r ${APACHE_SOLR_HOME}/server/solr/configsets/_default/* 
${APACHE_SOLR_HOME}/server/solr/configsets/nutch/
+ }}}
+  * copy the nutch schema.xml into the `conf` directory {{{
-  * copy the nutch schema.xml into the `conf` directory `cp 
${NUTCH_RUNTIME_HOME}/conf/schema.xml 
${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf`
+ cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml 
${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf/
-  * make sure that there is no `managed-schema` "in the way": `rm 
${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf/managed-schema`
-  * start the solr server `${APACHE_SOLR_HOME}/bin/solr start`
+ }}}
+ You may try to use the most recent 
[[https://github.com/apache/nutch/blob/master/conf/schema.xml|schema.xml]] in 
case of issues launching Solr with this schema.
+  * make sure that there is no 
[[https://lucene.apache.org/solr/guide/7_5/schema-factory-definition-in-solrconfig.html#SchemaFactoryDefinitioninSolrConfig-SolrUsesManagedSchemabyDefault|managed-schema]]
 "in the way": {{{
+ rm ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf/managed-schema
+ }}}
+  * start the solr server {{{
+ ${APACHE_SOLR_HOME}/bin/solr start
+ }}}
+  * create the nutch core {{{
-  * create the nutch core `${APACHE_SOLR_HOME}/bin/solr create -c nutch -d 
server/solr/configsets/nutch/conf/`
+ ${APACHE_SOLR_HOME}/bin/solr create -c nutch -d 
${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf/
+ }}}
  
  After that you need to point Nutch to the Solr instance:
   * (Nutch 1.15 and later) edit the file `conf/index-writers.xml`, see 
IndexWriters


[jira] [Commented] (NUTCH-2656) Update description to configure Solr 7.x in tutorial

2018-10-15 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650209#comment-16650209
 ] 

Sebastian Nagel commented on NUTCH-2656:


The tutorial has been updated. Please review, thanks!

> Update description to configure Solr 7.x in tutorial
> 
>
> Key: NUTCH-2656
> URL: https://issues.apache.org/jira/browse/NUTCH-2656
> Project: Nutch
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.16
>
>
> (reported byTimeka Cobb, see [discussion on the user mailing 
> list|https://lists.apache.org/thread.html/f509e42d845b980a6e6a8130d70dffec8c8f52406908f27f0cf49b20@%3Cuser.nutch.apache.org%3E])
> The description in the tutorial how to [setup Solr 6 and 
> 7|https://wiki.apache.org/nutch/NutchTutorial#Setup_Solr_for_search] needs to 
> be updated.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (NUTCH-2356) Upgrade to Solr 6.x

2018-10-15 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2356:
---
Fix Version/s: 2.5

> Upgrade to Solr 6.x 
> 
>
> Key: NUTCH-2356
> URL: https://issues.apache.org/jira/browse/NUTCH-2356
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Affects Versions: 2.3.1, 1.12
>Reporter: Cihad Guzel
>Priority: Major
> Fix For: 2.5
>
>
> Nutch 2.x branch support solr 4.6 [1] and nutch master branch support solr 
> 5.5 [2]  according to ivy.xml of "solr-indexer" plugin .
> [1] https://github.com/apache/nutch/blob/2.x/src/plugin/indexer-solr/ivy.xml
> [2] 
> https://github.com/apache/nutch/blob/master/src/plugin/indexer-solr/ivy.xml
> Nutch should support Solr 6.x



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2613) Documentation for exchange component

2018-10-15 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650219#comment-16650219
 ] 

Sebastian Nagel commented on NUTCH-2613:


Hi [~roannel], lgtm. Thanks!

> Documentation for exchange component
> 
>
> Key: NUTCH-2613
> URL: https://issues.apache.org/jira/browse/NUTCH-2613
> Project: Nutch
>  Issue Type: Task
>  Components: documentation, indexer
>Affects Versions: 1.15
>Reporter: Roannel Fernández Hernández
>Assignee: Roannel Fernández Hernández
>Priority: Major
> Fix For: 1.16
>
>
> After [GitHub Pull Request #340|https://github.com/apache/nutch/pull/340] a 
> NutchTutorial wiki page for exchange component is necessary.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2356) Upgrade to Solr 6.x

2018-10-15 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650218#comment-16650218
 ] 

Sebastian Nagel commented on NUTCH-2356:


Nutch 1.15 already uses Solr 7.3.1

> Upgrade to Solr 6.x 
> 
>
> Key: NUTCH-2356
> URL: https://issues.apache.org/jira/browse/NUTCH-2356
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer, plugin
>Affects Versions: 2.3.1, 1.12
>Reporter: Cihad Guzel
>Priority: Major
> Fix For: 2.5
>
>
> Nutch 2.x branch support solr 4.6 [1] and nutch master branch support solr 
> 5.5 [2]  according to ivy.xml of "solr-indexer" plugin .
> [1] https://github.com/apache/nutch/blob/2.x/src/plugin/indexer-solr/ivy.xml
> [2] 
> https://github.com/apache/nutch/blob/master/src/plugin/indexer-solr/ivy.xml
> Nutch should support Solr 6.x



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (NUTCH-2654) Remove obsolete index-writer configuration in conf/

2018-10-15 Thread Sebastian Nagel (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2654:
--

Assignee: Roannel Fernández Hernández

> Remove obsolete index-writer configuration in conf/
> ---
>
> Key: NUTCH-2654
> URL: https://issues.apache.org/jira/browse/NUTCH-2654
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Assignee: Roannel Fernández Hernández
>Priority: Major
> Fix For: 1.16
>
>
> The configuration folder conf/ still contains stuff obsolete after NUTCH-1480:
> - properties to configure indexer plugins in nutch-default.xml
> - solrindex-mapping.xml (looks like obsolete)
> - (still read) elasticsearch.conf
> All obsolete files and properties should be removed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2655) Update Solr schema.xml for Solr 7.x

2018-10-15 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650223#comment-16650223
 ] 

ASF GitHub Bot commented on NUTCH-2655:
---

sebastian-nagel opened a new pull request #395: NUTCH-2655 Update Solr 
schema.xml for Solr 7.x
URL: https://github.com/apache/nutch/pull/395
 
 
   - add required field types to schema.xml
   - tested with Nutch 1.15 and Solr 7.3.1 and 7.5.0


This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Update Solr schema.xml for Solr 7.x
> ---
>
> Key: NUTCH-2655
> URL: https://issues.apache.org/jira/browse/NUTCH-2655
> Project: Nutch
>  Issue Type: Bug
>  Components: indexer, plugin
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Priority: Major
> Fix For: 1.16
>
>
> The Solr schema.xml is not compatible with Solr 7.x which is used by Nutch 
> 1.15. I've tested Solr 7.3.1 and 7.5.0: when using the current schema.xml, 
> Solr fails and complains about unknown field types:
> {noformat}
> 2018-10-15 12:55:24.484 ERROR (qtp102617125-17) [ x:nutch] 
> o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Error 
> CREATEing SolrCore 'nutch': Unable to create core [nutch] Caused by: 
> fieldType 'pdates' not found in the schema
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2625) ProtocolFactory.getProtocol(url) may create multiple plugin instances

2018-10-15 Thread Sebastian Nagel (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650276#comment-16650276
 ] 

Sebastian Nagel commented on NUTCH-2625:


Any comments or objections? I'm using it in production since two months without 
any issues together with protocol-okhttp which uses a connection pool 
internally.

> ProtocolFactory.getProtocol(url) may create multiple plugin instances
> -
>
> Key: NUTCH-2625
> URL: https://issues.apache.org/jira/browse/NUTCH-2625
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> The method ProtocolFactory.getProtocol(URL url) may create unnecessarily 
> multiple instances of protocol plugins given the same configuration. The 
> following snippets from a Fetcher using 100 FetcherThreads show that the 
> setConf(conf) method of the protocol-okhttp plugin is called 100 times (once 
> for each thread):
> {noformat}
> 2018-07-12 12:04:32,811 INFO [main] org.apache.nutch.fetcher.FetcherThread: 
> FetcherThread 1 Using queue mode : byHost
> ... (skipped 98 repeated messages)
> 2018-07-12 12:04:33,136 INFO [main] org.apache.nutch.fetcher.FetcherThread: 
> FetcherThread 1 Using queue mode : byHost
> ...
> 2018-07-12 12:04:37,493 INFO [FetcherThread] 
> org.apache.nutch.protocol.RobotRulesParser: robots.txt whitelist not 
> configured.
> 2018-07-12 12:04:37,493 INFO [FetcherThread] 
> org.apache.nutch.protocol.okhttp.OkHttp: http.proxy.host = null
> ...
> 2018-07-12 12:04:37,494 INFO [FetcherThread] 
> org.apache.nutch.protocol.okhttp.OkHttp: http.enable.cookie.header = false
> ... (skipped 98 blocks of repeated messages)
> 2018-07-12 12:04:39,080 INFO [FetcherThread] 
> org.apache.nutch.protocol.RobotRulesParser: robots.txt whitelist not 
> configured.
> 2018-07-12 12:04:39,080 INFO [FetcherThread] 
> org.apache.nutch.protocol.okhttp.OkHttp: http.proxy.host = null
> ...
> 2018-07-12 12:04:39,080 INFO [FetcherThread] 
> org.apache.nutch.protocol.okhttp.OkHttp: http.enable.cookie.header = false
> {noformat}
> The method ProtocolFactory.getProtocol(URL url) is synchronized, however each 
> FetcherThread holds its own instance of the ProtocolFactory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (NUTCH-2625) ProtocolFactory.getProtocol(url) may create multiple plugin instances

2018-10-15 Thread Markus Jelsma (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-2625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650282#comment-16650282
 ] 

Markus Jelsma commented on NUTCH-2625:
--

Seems reasonable, +1

> ProtocolFactory.getProtocol(url) may create multiple plugin instances
> -
>
> Key: NUTCH-2625
> URL: https://issues.apache.org/jira/browse/NUTCH-2625
> Project: Nutch
>  Issue Type: Improvement
>  Components: protocol
>Affects Versions: 1.15
>Reporter: Sebastian Nagel
>Priority: Minor
> Fix For: 1.16
>
>
> The method ProtocolFactory.getProtocol(URL url) may create unnecessarily 
> multiple instances of protocol plugins given the same configuration. The 
> following snippets from a Fetcher using 100 FetcherThreads show that the 
> setConf(conf) method of the protocol-okhttp plugin is called 100 times (once 
> for each thread):
> {noformat}
> 2018-07-12 12:04:32,811 INFO [main] org.apache.nutch.fetcher.FetcherThread: 
> FetcherThread 1 Using queue mode : byHost
> ... (skipped 98 repeated messages)
> 2018-07-12 12:04:33,136 INFO [main] org.apache.nutch.fetcher.FetcherThread: 
> FetcherThread 1 Using queue mode : byHost
> ...
> 2018-07-12 12:04:37,493 INFO [FetcherThread] 
> org.apache.nutch.protocol.RobotRulesParser: robots.txt whitelist not 
> configured.
> 2018-07-12 12:04:37,493 INFO [FetcherThread] 
> org.apache.nutch.protocol.okhttp.OkHttp: http.proxy.host = null
> ...
> 2018-07-12 12:04:37,494 INFO [FetcherThread] 
> org.apache.nutch.protocol.okhttp.OkHttp: http.enable.cookie.header = false
> ... (skipped 98 blocks of repeated messages)
> 2018-07-12 12:04:39,080 INFO [FetcherThread] 
> org.apache.nutch.protocol.RobotRulesParser: robots.txt whitelist not 
> configured.
> 2018-07-12 12:04:39,080 INFO [FetcherThread] 
> org.apache.nutch.protocol.okhttp.OkHttp: http.proxy.host = null
> ...
> 2018-07-12 12:04:39,080 INFO [FetcherThread] 
> org.apache.nutch.protocol.okhttp.OkHttp: http.enable.cookie.header = false
> {noformat}
> The method ProtocolFactory.getProtocol(URL url) is synchronized, however each 
> FetcherThread holds its own instance of the ProtocolFactory.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (NUTCH-2657) Protocol-http to store HTTP response header with "\r\n"

2018-10-15 Thread Sebastian Nagel (JIRA)
Sebastian Nagel created NUTCH-2657:
--

 Summary: Protocol-http to store HTTP response header with "\r\n"
 Key: NUTCH-2657
 URL: https://issues.apache.org/jira/browse/NUTCH-2657
 Project: Nutch
  Issue Type: Improvement
  Components: protocol
Affects Versions: 1.15
Reporter: Sebastian Nagel
 Fix For: 1.16


The plugins protocol-http and protocol-okhttp allow to store the HTTP request 
and/or response headers in the response metadata. However, there is no 
consensus which line breaks ("\r\n" or "\n") are used between header lines and 
whether there is a trailing second line break at the end of the headers: while 
request headers are stored by both plugins with "\r\n" and two trailing "\r\n", 
 the response headers are stored by protocol-http with "\n" and a single 
trailing line break. This is difficult to handle if the headers are required to 
be stored uniformly (I've created such a [nasty bug writing WARC 
files|https://github.com/commoncrawl/nutch/issues/5]).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


Re: [jira] [Created] (NUTCH-2657) Protocol-http to store HTTP response header with "\r\n"

2018-10-15 Thread Shaharia Azam
unsubscribe Regards, Shaharia Azam Preview Technologies Phone: +88 09611 738 
439           + 88 02 913 8532 URL: https://www.previewtechs.com  On Mon, 
15 Oct 2018 20:28:00 +0600 Sebastian Nagel (JIRA)  wrote  
Sebastian Nagel created NUTCH-2657: -- 
Summary: Protocol-http to store HTTP response header with "\r\n" Key: 
NUTCH-2657 URL: https://issues.apache.org/jira/browse/NUTCH-2657 Project: Nutch 
Issue Type: Improvement Components: protocol Affects Versions: 1.15 Reporter: 
Sebastian Nagel Fix For: 1.16 The plugins protocol-http and protocol-okhttp 
allow to store the HTTP request and/or response headers in the response 
metadata. However, there is no consensus which line breaks ("\r\n" or "\n") are 
used between header lines and whether there is a trailing second line break at 
the end of the headers: while request headers are stored by both plugins with 
"\r\n" and two trailing "\r\n", the response headers are stored by 
protocol-http with "\n" and a single trailing line break. This is difficult to 
handle if the headers are required to be stored uniformly (I've created such a 
[nasty bug writing WARC files|https://github.com/commoncrawl/nutch/issues/5]). 
-- This message was sent by Atlassian JIRA (v7.6.3#76005)