[jira] [Commented] (NUTCH-1842) crawl.gen.delay has a wrong default value in nutch-default.xml or is being parsed incorrectly
[ https://issues.apache.org/jira/browse/NUTCH-1842?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650022#comment-16650022 ] Sebastian Nagel commented on NUTCH-1842: [~yossi], I agree adapting the code to the documentation is the better decision. Would be different if the description was wrong only for a short time but now it's already 8 years. Are there any objections? Otherwise I would merge the PR and also add a warning to the change log. > crawl.gen.delay has a wrong default value in nutch-default.xml or is being > parsed incorrectly > -- > > Key: NUTCH-1842 > URL: https://issues.apache.org/jira/browse/NUTCH-1842 > Project: Nutch > Issue Type: Bug > Components: generator >Affects Versions: 1.9 >Reporter: kaveh minooie >Assignee: Sebastian Nagel >Priority: Minor > Fix For: 1.16 > > > this is from nutch-default.xml: > > crawl.gen.delay > 60480 > >This value, expressed in milliseconds, defines how long we should keep the > lock on records >in CrawlDb that were just selected for fetching. If these records are not > updated >in the meantime, the lock is canceled, i.e. they become eligible for > selecting. >Default value of this is 7 days (60480 ms). > > > this is the from o.a.n.crawl.Generator.configure(JobConf job) > genDelay = job.getLong(GENERATOR_DELAY, 7L) * 3600L * 24L * 1000L; > the value in config file is in milliseconds but the code expect it to be in > days. I reported this couple of years ago on the mailing list as well. I > didn't post a patch becaue I am not sure which one needs to be fixed. > considering all the other values in config file are in milliseconds it can be > argued to that consistency matters, but 'day' is a much more reasonable unit > for this property. > Also this value is not being used in 2.x ? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2652) Fetcher launches more fetch tasks than fetch lists
Sebastian Nagel created NUTCH-2652: -- Summary: Fetcher launches more fetch tasks than fetch lists Key: NUTCH-2652 URL: https://issues.apache.org/jira/browse/NUTCH-2652 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.15 Environment: Hadoop, distributed mode (cluster of 22 nodes), CDH 5.15.1, Nutch built on recent master. Seen the first time right now, although running since two months with Nutch 1.15. But the constraints causing inputs to be split may change from run to run. Reporter: Sebastian Nagel Fix For: 1.16 Fetcher may launch more fetcher tasks than there are fetch lists: {noformat} 18/10/15 07:27:26 INFO input.FileInputFormat: Total input paths to process : 128 18/10/15 07:27:26 INFO mapreduce.JobSubmitter: number of splits:187 {noformat} That's one design principle of Nutch as a MapRecude-based crawler: to ensure politeness and a guaranteed delay between requests to the same host/domain/ip all items of one host/domain/ip are put by Generator into the same fetch list. A fetch list may not be split because that would violate the politeness constraints - multiple fetcher tasks processing the splits of one fetch list then may send requests to the same host/domain/ip in parallel. See [~ab]'s chapter about Nutch in [Hadoop the definitive guide (3rd edition)|https://www.safaribooksonline.com/library/view/hadoop-the-definitive/9781449328917/ch16.html#NutchFetcher]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2652) Fetcher launches more fetch tasks than fetch lists
[ https://issues.apache.org/jira/browse/NUTCH-2652?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650111#comment-16650111 ] ASF GitHub Bot commented on NUTCH-2652: --- sebastian-nagel opened a new pull request #394: NUTCH-2652 Fetcher launches more fetch tasks than fetch lists URL: https://github.com/apache/nutch/pull/394 - properly override method [getSplits(JobContext context) of FileInputFormat](https://hadoop.apache.org/docs/r2.8.5/api/org/apache/hadoop/mapreduce/lib/input/FileInputFormat.html#getSplits(org.apache.hadoop.mapreduce.JobContext)) This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Fetcher launches more fetch tasks than fetch lists > -- > > Key: NUTCH-2652 > URL: https://issues.apache.org/jira/browse/NUTCH-2652 > Project: Nutch > Issue Type: Bug > Components: fetcher >Affects Versions: 1.15 > Environment: Hadoop, distributed mode (cluster of 22 nodes), CDH > 5.15.1, Nutch built on recent master. > Seen the first time right now, although running since two months with Nutch > 1.15. But the constraints causing inputs to be split may change from run to > run. >Reporter: Sebastian Nagel >Priority: Critical > Fix For: 1.16 > > > Fetcher may launch more fetcher tasks than there are fetch lists: > {noformat} > 18/10/15 07:27:26 INFO input.FileInputFormat: Total input paths to process : > 128 > 18/10/15 07:27:26 INFO mapreduce.JobSubmitter: number of splits:187 > {noformat} > That's one design principle of Nutch as a MapRecude-based crawler: to ensure > politeness and a guaranteed delay between requests to the same host/domain/ip > all items of one host/domain/ip are put by Generator into the same fetch > list. A fetch list may not be split because that would violate the politeness > constraints - multiple fetcher tasks processing the splits of one fetch list > then may send requests to the same host/domain/ip in parallel. See [~ab]'s > chapter about Nutch in [Hadoop the definitive guide (3rd > edition)|https://www.safaribooksonline.com/library/view/hadoop-the-definitive/9781449328917/ch16.html#NutchFetcher]. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2653) ProtocolFactory.getProtocol(url) creates separate plugin instances for http/https
Sebastian Nagel created NUTCH-2653: -- Summary: ProtocolFactory.getProtocol(url) creates separate plugin instances for http/https Key: NUTCH-2653 URL: https://issues.apache.org/jira/browse/NUTCH-2653 Project: Nutch Issue Type: Improvement Components: fetcher, protocol Affects Versions: 1.15 Reporter: Sebastian Nagel Fix For: 1.16 Fetcher creates two instances of the protocol-okhttp plugin, one to handle http requests, another for https. The plugin properties are logged during plugin instantiation when calling {{setConf(...)}}: {noformat} 2018-10-11 13:28:34,417 INFO [FetcherThread] org.apache.nutch.fetcher.FetcherThread: FetcherThread 40 fetching http://... ... 2018-10-11 13:28:35,099 INFO [FetcherThread] org.apache.nutch.protocol.okhttp.OkHttp: http.proxy.host = null 2018-10-11 13:28:35,100 INFO [FetcherThread] org.apache.nutch.protocol.okhttp.OkHttp: http.proxy.port = 8080 ... 2018-10-11 13:28:36,864 INFO [FetcherThread] org.apache.nutch.fetcher.FetcherThread: FetcherThread 87 fetching https://... ... 2018-10-11 13:28:36,864 INFO [FetcherThread] org.apache.nutch.protocol.okhttp.OkHttp: http.proxy.host = null 2018-10-11 13:28:36,864 INFO [FetcherThread] org.apache.nutch.protocol.okhttp.OkHttp: http.proxy.port = 8080 {noformat} The question is whether this is the correct behavior for plugins supporting multiple protocols (http and https)? It may cause that connection pooling and other network optimizations do not work as expected. Of course, it's correct if different plugins are required, e.g., for ftp or the local file system. (seen while reviewing the behavior of fetcher with fix for NUTCH-2625 applied) -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-1377) Add option to index via CloudSolrServer instead
[ https://issues.apache.org/jira/browse/NUTCH-1377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650148#comment-16650148 ] Sebastian Nagel commented on NUTCH-1377: [~roannel], isn't this implemented by selecting the type "cloud" in the [indexer-solr config|https://wiki.apache.org/nutch/IndexWriters#Solr_indexer_properties]? > Add option to index via CloudSolrServer instead > --- > > Key: NUTCH-1377 > URL: https://issues.apache.org/jira/browse/NUTCH-1377 > Project: Nutch > Issue Type: Improvement > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Major > Attachments: NUTCH-1377-1.8.patch, NUTCH-1377-1.8.patch > > > Nutch indexes to a specific Solr server. With SolrCloud on its way we can > still use the current indexer and point to any server. However, the > SolrCloudServer can connect to ZooKeeper instead and automatically find the > correct server to index to. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2654) Remove obsolete index-writer configuration in conf/
Sebastian Nagel created NUTCH-2654: -- Summary: Remove obsolete index-writer configuration in conf/ Key: NUTCH-2654 URL: https://issues.apache.org/jira/browse/NUTCH-2654 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 1.15 Reporter: Sebastian Nagel Fix For: 1.16 The configuration folder conf/ still contains stuff obsolete after NUTCH-1480: - properties to configure indexer plugins in nutch-default.xml - solrindex-mapping.xml (looks like obsolete) - (still read) elasticsearch.conf All obsolete files and properties should be removed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2655) Update Solr schema.xml for Solr 7.x
Sebastian Nagel created NUTCH-2655: -- Summary: Update Solr schema.xml for Solr 7.x Key: NUTCH-2655 URL: https://issues.apache.org/jira/browse/NUTCH-2655 Project: Nutch Issue Type: Bug Components: indexer, plugin Affects Versions: 1.15 Reporter: Sebastian Nagel Fix For: 1.16 The Solr schema.xml is not compatible with Solr 7.x which is used by Nutch 1.15. I've tested Solr 7.3.1 and 7.5.0: when using the current schema.xml, Solr fails and complains about unknown field types: {noformat} 2018-10-15 12:55:24.484 ERROR (qtp102617125-17) [ x:nutch] o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Error CREATEing SolrCore 'nutch': Unable to create core [nutch] Caused by: fieldType 'pdates' not found in the schema {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2656) Update description to configure Solr 7.x in tutorial
Sebastian Nagel created NUTCH-2656: -- Summary: Update description to configure Solr 7.x in tutorial Key: NUTCH-2656 URL: https://issues.apache.org/jira/browse/NUTCH-2656 Project: Nutch Issue Type: Bug Components: documentation Affects Versions: 1.15 Reporter: Sebastian Nagel Fix For: 1.16 (reported byTimeka Cobb, see [discussion on the user mailing list|https://lists.apache.org/thread.html/f509e42d845b980a6e6a8130d70dffec8c8f52406908f27f0cf49b20@%3Cuser.nutch.apache.org%3E]) The description in the tutorial how to [setup Solr 6 and 7|https://wiki.apache.org/nutch/NutchTutorial#Setup_Solr_for_search] needs to be updated. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[Nutch Wiki] Update of "NutchTutorial" by SebastianNagel
Dear Wiki user, You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change notification. The "NutchTutorial" page has been changed by SebastianNagel: https://wiki.apache.org/nutch/NutchTutorial?action=diff&rev1=93&rev2=94 Comment: NUTCH-2656 Solr setup updated for Solr 7.x || 1.13 || 5.5.0 || || 1.12 || 5.4.1 || - To install Solr: + To install Solr 7.x: * download binary file from [[http://www.apache.org/dyn/closer.cgi/lucene/solr/|here]] * unzip to `$HOME/apache-solr`, we will now refer to this as `${APACHE_SOLR_HOME}` - * create resources for a new nutch solr core `cp -r ${APACHE_SOLR_HOME}/server/solr/configsets/basic_configs ${APACHE_SOLR_HOME}/server/solr/configsets/nutch` + * create resources for a new nutch solr core {{{ + mkdir -p ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/ + cp -r ${APACHE_SOLR_HOME}/server/solr/configsets/_default/* ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/ + }}} + * copy the nutch schema.xml into the `conf` directory {{{ - * copy the nutch schema.xml into the `conf` directory `cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf` + cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf/ - * make sure that there is no `managed-schema` "in the way": `rm ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf/managed-schema` - * start the solr server `${APACHE_SOLR_HOME}/bin/solr start` + }}} + You may try to use the most recent [[https://github.com/apache/nutch/blob/master/conf/schema.xml|schema.xml]] in case of issues launching Solr with this schema. + * make sure that there is no [[https://lucene.apache.org/solr/guide/7_5/schema-factory-definition-in-solrconfig.html#SchemaFactoryDefinitioninSolrConfig-SolrUsesManagedSchemabyDefault|managed-schema]] "in the way": {{{ + rm ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf/managed-schema + }}} + * start the solr server {{{ + ${APACHE_SOLR_HOME}/bin/solr start + }}} + * create the nutch core {{{ - * create the nutch core `${APACHE_SOLR_HOME}/bin/solr create -c nutch -d server/solr/configsets/nutch/conf/` + ${APACHE_SOLR_HOME}/bin/solr create -c nutch -d ${APACHE_SOLR_HOME}/server/solr/configsets/nutch/conf/ + }}} After that you need to point Nutch to the Solr instance: * (Nutch 1.15 and later) edit the file `conf/index-writers.xml`, see IndexWriters
[jira] [Commented] (NUTCH-2656) Update description to configure Solr 7.x in tutorial
[ https://issues.apache.org/jira/browse/NUTCH-2656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650209#comment-16650209 ] Sebastian Nagel commented on NUTCH-2656: The tutorial has been updated. Please review, thanks! > Update description to configure Solr 7.x in tutorial > > > Key: NUTCH-2656 > URL: https://issues.apache.org/jira/browse/NUTCH-2656 > Project: Nutch > Issue Type: Bug > Components: documentation >Affects Versions: 1.15 >Reporter: Sebastian Nagel >Priority: Major > Fix For: 1.16 > > > (reported byTimeka Cobb, see [discussion on the user mailing > list|https://lists.apache.org/thread.html/f509e42d845b980a6e6a8130d70dffec8c8f52406908f27f0cf49b20@%3Cuser.nutch.apache.org%3E]) > The description in the tutorial how to [setup Solr 6 and > 7|https://wiki.apache.org/nutch/NutchTutorial#Setup_Solr_for_search] needs to > be updated. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (NUTCH-2356) Upgrade to Solr 6.x
[ https://issues.apache.org/jira/browse/NUTCH-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel updated NUTCH-2356: --- Fix Version/s: 2.5 > Upgrade to Solr 6.x > > > Key: NUTCH-2356 > URL: https://issues.apache.org/jira/browse/NUTCH-2356 > Project: Nutch > Issue Type: Improvement > Components: indexer, plugin >Affects Versions: 2.3.1, 1.12 >Reporter: Cihad Guzel >Priority: Major > Fix For: 2.5 > > > Nutch 2.x branch support solr 4.6 [1] and nutch master branch support solr > 5.5 [2] according to ivy.xml of "solr-indexer" plugin . > [1] https://github.com/apache/nutch/blob/2.x/src/plugin/indexer-solr/ivy.xml > [2] > https://github.com/apache/nutch/blob/master/src/plugin/indexer-solr/ivy.xml > Nutch should support Solr 6.x -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2613) Documentation for exchange component
[ https://issues.apache.org/jira/browse/NUTCH-2613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650219#comment-16650219 ] Sebastian Nagel commented on NUTCH-2613: Hi [~roannel], lgtm. Thanks! > Documentation for exchange component > > > Key: NUTCH-2613 > URL: https://issues.apache.org/jira/browse/NUTCH-2613 > Project: Nutch > Issue Type: Task > Components: documentation, indexer >Affects Versions: 1.15 >Reporter: Roannel Fernández Hernández >Assignee: Roannel Fernández Hernández >Priority: Major > Fix For: 1.16 > > > After [GitHub Pull Request #340|https://github.com/apache/nutch/pull/340] a > NutchTutorial wiki page for exchange component is necessary. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2356) Upgrade to Solr 6.x
[ https://issues.apache.org/jira/browse/NUTCH-2356?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650218#comment-16650218 ] Sebastian Nagel commented on NUTCH-2356: Nutch 1.15 already uses Solr 7.3.1 > Upgrade to Solr 6.x > > > Key: NUTCH-2356 > URL: https://issues.apache.org/jira/browse/NUTCH-2356 > Project: Nutch > Issue Type: Improvement > Components: indexer, plugin >Affects Versions: 2.3.1, 1.12 >Reporter: Cihad Guzel >Priority: Major > Fix For: 2.5 > > > Nutch 2.x branch support solr 4.6 [1] and nutch master branch support solr > 5.5 [2] according to ivy.xml of "solr-indexer" plugin . > [1] https://github.com/apache/nutch/blob/2.x/src/plugin/indexer-solr/ivy.xml > [2] > https://github.com/apache/nutch/blob/master/src/plugin/indexer-solr/ivy.xml > Nutch should support Solr 6.x -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (NUTCH-2654) Remove obsolete index-writer configuration in conf/
[ https://issues.apache.org/jira/browse/NUTCH-2654?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sebastian Nagel reassigned NUTCH-2654: -- Assignee: Roannel Fernández Hernández > Remove obsolete index-writer configuration in conf/ > --- > > Key: NUTCH-2654 > URL: https://issues.apache.org/jira/browse/NUTCH-2654 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 1.15 >Reporter: Sebastian Nagel >Assignee: Roannel Fernández Hernández >Priority: Major > Fix For: 1.16 > > > The configuration folder conf/ still contains stuff obsolete after NUTCH-1480: > - properties to configure indexer plugins in nutch-default.xml > - solrindex-mapping.xml (looks like obsolete) > - (still read) elasticsearch.conf > All obsolete files and properties should be removed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2655) Update Solr schema.xml for Solr 7.x
[ https://issues.apache.org/jira/browse/NUTCH-2655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650223#comment-16650223 ] ASF GitHub Bot commented on NUTCH-2655: --- sebastian-nagel opened a new pull request #395: NUTCH-2655 Update Solr schema.xml for Solr 7.x URL: https://github.com/apache/nutch/pull/395 - add required field types to schema.xml - tested with Nutch 1.15 and Solr 7.3.1 and 7.5.0 This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Update Solr schema.xml for Solr 7.x > --- > > Key: NUTCH-2655 > URL: https://issues.apache.org/jira/browse/NUTCH-2655 > Project: Nutch > Issue Type: Bug > Components: indexer, plugin >Affects Versions: 1.15 >Reporter: Sebastian Nagel >Priority: Major > Fix For: 1.16 > > > The Solr schema.xml is not compatible with Solr 7.x which is used by Nutch > 1.15. I've tested Solr 7.3.1 and 7.5.0: when using the current schema.xml, > Solr fails and complains about unknown field types: > {noformat} > 2018-10-15 12:55:24.484 ERROR (qtp102617125-17) [ x:nutch] > o.a.s.h.RequestHandlerBase org.apache.solr.common.SolrException: Error > CREATEing SolrCore 'nutch': Unable to create core [nutch] Caused by: > fieldType 'pdates' not found in the schema > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2625) ProtocolFactory.getProtocol(url) may create multiple plugin instances
[ https://issues.apache.org/jira/browse/NUTCH-2625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650276#comment-16650276 ] Sebastian Nagel commented on NUTCH-2625: Any comments or objections? I'm using it in production since two months without any issues together with protocol-okhttp which uses a connection pool internally. > ProtocolFactory.getProtocol(url) may create multiple plugin instances > - > > Key: NUTCH-2625 > URL: https://issues.apache.org/jira/browse/NUTCH-2625 > Project: Nutch > Issue Type: Improvement > Components: protocol >Affects Versions: 1.15 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.16 > > > The method ProtocolFactory.getProtocol(URL url) may create unnecessarily > multiple instances of protocol plugins given the same configuration. The > following snippets from a Fetcher using 100 FetcherThreads show that the > setConf(conf) method of the protocol-okhttp plugin is called 100 times (once > for each thread): > {noformat} > 2018-07-12 12:04:32,811 INFO [main] org.apache.nutch.fetcher.FetcherThread: > FetcherThread 1 Using queue mode : byHost > ... (skipped 98 repeated messages) > 2018-07-12 12:04:33,136 INFO [main] org.apache.nutch.fetcher.FetcherThread: > FetcherThread 1 Using queue mode : byHost > ... > 2018-07-12 12:04:37,493 INFO [FetcherThread] > org.apache.nutch.protocol.RobotRulesParser: robots.txt whitelist not > configured. > 2018-07-12 12:04:37,493 INFO [FetcherThread] > org.apache.nutch.protocol.okhttp.OkHttp: http.proxy.host = null > ... > 2018-07-12 12:04:37,494 INFO [FetcherThread] > org.apache.nutch.protocol.okhttp.OkHttp: http.enable.cookie.header = false > ... (skipped 98 blocks of repeated messages) > 2018-07-12 12:04:39,080 INFO [FetcherThread] > org.apache.nutch.protocol.RobotRulesParser: robots.txt whitelist not > configured. > 2018-07-12 12:04:39,080 INFO [FetcherThread] > org.apache.nutch.protocol.okhttp.OkHttp: http.proxy.host = null > ... > 2018-07-12 12:04:39,080 INFO [FetcherThread] > org.apache.nutch.protocol.okhttp.OkHttp: http.enable.cookie.header = false > {noformat} > The method ProtocolFactory.getProtocol(URL url) is synchronized, however each > FetcherThread holds its own instance of the ProtocolFactory. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (NUTCH-2625) ProtocolFactory.getProtocol(url) may create multiple plugin instances
[ https://issues.apache.org/jira/browse/NUTCH-2625?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16650282#comment-16650282 ] Markus Jelsma commented on NUTCH-2625: -- Seems reasonable, +1 > ProtocolFactory.getProtocol(url) may create multiple plugin instances > - > > Key: NUTCH-2625 > URL: https://issues.apache.org/jira/browse/NUTCH-2625 > Project: Nutch > Issue Type: Improvement > Components: protocol >Affects Versions: 1.15 >Reporter: Sebastian Nagel >Priority: Minor > Fix For: 1.16 > > > The method ProtocolFactory.getProtocol(URL url) may create unnecessarily > multiple instances of protocol plugins given the same configuration. The > following snippets from a Fetcher using 100 FetcherThreads show that the > setConf(conf) method of the protocol-okhttp plugin is called 100 times (once > for each thread): > {noformat} > 2018-07-12 12:04:32,811 INFO [main] org.apache.nutch.fetcher.FetcherThread: > FetcherThread 1 Using queue mode : byHost > ... (skipped 98 repeated messages) > 2018-07-12 12:04:33,136 INFO [main] org.apache.nutch.fetcher.FetcherThread: > FetcherThread 1 Using queue mode : byHost > ... > 2018-07-12 12:04:37,493 INFO [FetcherThread] > org.apache.nutch.protocol.RobotRulesParser: robots.txt whitelist not > configured. > 2018-07-12 12:04:37,493 INFO [FetcherThread] > org.apache.nutch.protocol.okhttp.OkHttp: http.proxy.host = null > ... > 2018-07-12 12:04:37,494 INFO [FetcherThread] > org.apache.nutch.protocol.okhttp.OkHttp: http.enable.cookie.header = false > ... (skipped 98 blocks of repeated messages) > 2018-07-12 12:04:39,080 INFO [FetcherThread] > org.apache.nutch.protocol.RobotRulesParser: robots.txt whitelist not > configured. > 2018-07-12 12:04:39,080 INFO [FetcherThread] > org.apache.nutch.protocol.okhttp.OkHttp: http.proxy.host = null > ... > 2018-07-12 12:04:39,080 INFO [FetcherThread] > org.apache.nutch.protocol.okhttp.OkHttp: http.enable.cookie.header = false > {noformat} > The method ProtocolFactory.getProtocol(URL url) is synchronized, however each > FetcherThread holds its own instance of the ProtocolFactory. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (NUTCH-2657) Protocol-http to store HTTP response header with "\r\n"
Sebastian Nagel created NUTCH-2657: -- Summary: Protocol-http to store HTTP response header with "\r\n" Key: NUTCH-2657 URL: https://issues.apache.org/jira/browse/NUTCH-2657 Project: Nutch Issue Type: Improvement Components: protocol Affects Versions: 1.15 Reporter: Sebastian Nagel Fix For: 1.16 The plugins protocol-http and protocol-okhttp allow to store the HTTP request and/or response headers in the response metadata. However, there is no consensus which line breaks ("\r\n" or "\n") are used between header lines and whether there is a trailing second line break at the end of the headers: while request headers are stored by both plugins with "\r\n" and two trailing "\r\n", the response headers are stored by protocol-http with "\n" and a single trailing line break. This is difficult to handle if the headers are required to be stored uniformly (I've created such a [nasty bug writing WARC files|https://github.com/commoncrawl/nutch/issues/5]). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
Re: [jira] [Created] (NUTCH-2657) Protocol-http to store HTTP response header with "\r\n"
unsubscribe Regards, Shaharia Azam Preview Technologies Phone: +88 09611 738 439 + 88 02 913 8532 URL: https://www.previewtechs.com On Mon, 15 Oct 2018 20:28:00 +0600 Sebastian Nagel (JIRA) wrote Sebastian Nagel created NUTCH-2657: -- Summary: Protocol-http to store HTTP response header with "\r\n" Key: NUTCH-2657 URL: https://issues.apache.org/jira/browse/NUTCH-2657 Project: Nutch Issue Type: Improvement Components: protocol Affects Versions: 1.15 Reporter: Sebastian Nagel Fix For: 1.16 The plugins protocol-http and protocol-okhttp allow to store the HTTP request and/or response headers in the response metadata. However, there is no consensus which line breaks ("\r\n" or "\n") are used between header lines and whether there is a trailing second line break at the end of the headers: while request headers are stored by both plugins with "\r\n" and two trailing "\r\n", the response headers are stored by protocol-http with "\n" and a single trailing line break. This is difficult to handle if the headers are required to be stored uniformly (I've created such a [nasty bug writing WARC files|https://github.com/commoncrawl/nutch/issues/5]). -- This message was sent by Atlassian JIRA (v7.6.3#76005)