[jira] [Commented] (NUTCH-1228) Change mapred.task.timeout to mapreduce.task.timeout in fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711616#comment-13711616 ] Markus Jelsma commented on NUTCH-1228: -- YEs but this is dependant on upgrades to the new Hadoop MapReduce api vs the current mapred. I did some modifications and issues but stumbled on some major problems. We should not fix this issue until we're sure the new MapReduce api has all features. I remember issues with MapFile api's etc. I think i left some comments about that here and on the (unanswered) Hadoop list. > Change mapred.task.timeout to mapreduce.task.timeout in fetcher > --- > > Key: NUTCH-1228 > URL: https://issues.apache.org/jira/browse/NUTCH-1228 > Project: Nutch > Issue Type: Task > Components: fetcher >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Trivial > Fix For: 1.9 > > Attachments: NUTCH-1228-2.1.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1612) Getting URl Malformed exception with Nutch 2.2 and Hadoop 1.0.3
[ https://issues.apache.org/jira/browse/NUTCH-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711612#comment-13711612 ] Markus Jelsma commented on NUTCH-1612: -- We tried 1.0.2 and had a miserable time there. 1.2.0 fixed all major issues with Hadoop. > Getting URl Malformed exception with Nutch 2.2 and Hadoop 1.0.3 > --- > > Key: NUTCH-1612 > URL: https://issues.apache.org/jira/browse/NUTCH-1612 > Project: Nutch > Issue Type: Bug > Environment: Ubuntu 64 bit, nutch 2.2, hadoop 1.0.3,hbase-0.90.3 >Reporter: Amit Yadav > > When I start crawling using bin/crawl I am getting "URLMalfomed Exception". > I am using Hbase as data store. I can see that the WebTable is created in the > Hbase. > I am able to run the same in local mode. > Any help on this would be appreciable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once
[ https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711529#comment-13711529 ] Ferdy Galema commented on NUTCH-1457: - Ok cool. Like Lewis said it would be best to create patches that we can apply to the trunk codebase, so that there can be no misconceptions when committing the changes. Thanks. > Nutch2 Refactor the update process so that fetched items are only processed > once > > > Key: NUTCH-1457 > URL: https://issues.apache.org/jira/browse/NUTCH-1457 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Fix For: 2.4 > > Attachments: CrawlStatus.java, DbUpdateReducer.java, > GeneratorMapper.java, GeneratorReducer.java > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once
[ https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711510#comment-13711510 ] Riyaz Shaik edited comment on NUTCH-1457 at 7/17/13 7:34 PM: - Hi Ferdy, The below mentioned scenario will not occur: *although there might be a problem with code that assumes STATUS_FETCHED, for example the ParserJob: It only processes STATUS_FETCHED entries. There may be more dependencies.* Since we are not allowing to put the *??GENERATE_MARK??* for the urls, whose *??fetchtime > currentTime??* in GeneratorReducer. So that those urls will not be processed in the Fetcher/Parser jobs. One of the drawaback of this solution(UNSCHEDULED status/mark in GeneratorMapper) could be "We are updating the few columns data of all the urls (SCHEDULED + UNSCHEDULED) in Hbase" from ??GeneratorReducer??, that might reduce the ??GeneratorReducer?? performance. We have done the changes suggested by you(Instead of UNSCHEDULED Status/Marker use SCHEDULED marker).Had added the SCHEDULED marker in *??GeneratorReducer??*. It is working fine and also it overcomes the drawback of our earlier solution. Will attach the code changes. Thanks Ferdy.. :) was (Author: riyaz): Hi Ferdy, The below mentioned scenario will not occur: *although there might be a problem with code that assumes STATUS_FETCHED, for example the ParserJob: It only processes STATUS_FETCHED entries. There may be more dependencies.* Since we are not allowing to put the *??GENERATE_MARK??* for the urls, whose *??fetchtime > currentTime??* in GeneratorReducer. So that those urls will not be processed in the Fetcher/Parser jobs. One of the drawaback of this solution could be "We are updating the few columns data of all the urls (SCHEDULED + UNSCHEDULED) in Hbase" from ??GeneratorReducer??, that might reduce the ??GeneratorReducer?? performance. We have done the changes suggested by you(Instead of UNSCHEDULED Status/Marker use SCHEDULED marker).Had added the SCHEDULED marker in *??GeneratorReducer??*. It is working fine and also it overcomes the drawback of our earlier solution. Will attach the code changes. Thanks Ferdy.. :) > Nutch2 Refactor the update process so that fetched items are only processed > once > > > Key: NUTCH-1457 > URL: https://issues.apache.org/jira/browse/NUTCH-1457 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Fix For: 2.4 > > Attachments: CrawlStatus.java, DbUpdateReducer.java, > GeneratorMapper.java, GeneratorReducer.java > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1457) Nutch2 Refactor the update process so that fetched items are only processed once
[ https://issues.apache.org/jira/browse/NUTCH-1457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711510#comment-13711510 ] Riyaz Shaik commented on NUTCH-1457: Hi Ferdy, The below mentioned scenario will not occur: *although there might be a problem with code that assumes STATUS_FETCHED, for example the ParserJob: It only processes STATUS_FETCHED entries. There may be more dependencies.* Since we are not allowing to put the *??GENERATE_MARK??* for the urls, whose *??fetchtime > currentTime??* in GeneratorReducer. So that those urls will not be processed in the Fetcher/Parser jobs. One of the drawaback of this solution could be "We are updating the few columns data of all the urls (SCHEDULED + UNSCHEDULED) in Hbase" from ??GeneratorReducer??, that might reduce the ??GeneratorReducer?? performance. We have done the changes suggested by you(Instead of UNSCHEDULED Status/Marker use SCHEDULED marker).Had added the SCHEDULED marker in *??GeneratorReducer??*. It is working fine and also it overcomes the drawback of our earlier solution. Will attach the code changes. Thanks Ferdy.. :) > Nutch2 Refactor the update process so that fetched items are only processed > once > > > Key: NUTCH-1457 > URL: https://issues.apache.org/jira/browse/NUTCH-1457 > Project: Nutch > Issue Type: Improvement >Reporter: Ferdy Galema > Fix For: 2.4 > > Attachments: CrawlStatus.java, DbUpdateReducer.java, > GeneratorMapper.java, GeneratorReducer.java > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1228) Change mapred.task.timeout to mapreduce.task.timeout in fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711496#comment-13711496 ] Lewis John McGibbney commented on NUTCH-1228: - Does thsi affect 2.x? There is no item description. > Change mapred.task.timeout to mapreduce.task.timeout in fetcher > --- > > Key: NUTCH-1228 > URL: https://issues.apache.org/jira/browse/NUTCH-1228 > Project: Nutch > Issue Type: Task > Components: fetcher >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Trivial > Fix For: 1.9 > > Attachments: NUTCH-1228-2.1.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Closed] (NUTCH-1612) Getting URl Malformed exception with Nutch 2.2 and Hadoop 1.0.3
[ https://issues.apache.org/jira/browse/NUTCH-1612?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney closed NUTCH-1612. --- Resolution: Cannot Reproduce Amit, please go to the user list for queries such as this. If, over there, we find that this is a bug, we will then come to the Jira tracker and file an issue. Please describe your set up and print as much useful content from your logs as possible. Thank you > Getting URl Malformed exception with Nutch 2.2 and Hadoop 1.0.3 > --- > > Key: NUTCH-1612 > URL: https://issues.apache.org/jira/browse/NUTCH-1612 > Project: Nutch > Issue Type: Bug > Environment: Ubuntu 64 bit, nutch 2.2, hadoop 1.0.3,hbase-0.90.3 >Reporter: Amit Yadav > > When I start crawling using bin/crawl I am getting "URLMalfomed Exception". > I am using Hbase as data store. I can see that the WebTable is created in the > Hbase. > I am able to run the same in local mode. > Any help on this would be appreciable. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1613) Timeouts in protocol-httpclient when crawling same host with >2 threads and added cookie strings for both http protocols
[ https://issues.apache.org/jira/browse/NUTCH-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney updated NUTCH-1613: Fix Version/s: 2.3 > Timeouts in protocol-httpclient when crawling same host with >2 threads and > added cookie strings for both http protocols > > > Key: NUTCH-1613 > URL: https://issues.apache.org/jira/browse/NUTCH-1613 > Project: Nutch > Issue Type: Bug > Components: protocol >Affects Versions: 2.2.1 >Reporter: Brian >Priority: Minor > Labels: patch > Fix For: 2.3 > > Attachments: NUTCH-1613.patch > > > 1.) When using protocol-httpclient to crawl a single website (the same host) > I would always get a bunch of timeout errors during fetching and the pages > with errors would not be fetched. E.g.: > 2013-07-09 17:57:13,717 WARN fetcher.FetcherJob - fetch of http://www > failed with: org.apache.commons.httpclient.ConnectionPoolTimeoutException: > Timeout waiting for connection > 2013-07-09 17:57:13,718 INFO fetcher.FetcherJob - fetching http://www > (queue crawl delay=0ms) > 2013-07-09 17:57:13,715 ERROR httpclient.Http - Failed with the following > error: > org.apache.commons.httpclient.ConnectionPoolTimeoutException: Timeout waiting > for connection > at > org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:497) > at > org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416) > at > org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153) > at > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) > at > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323) > at > org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.java:95) > at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:174) > at > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:133) > at > org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:518) > This is because by default the connection pool manager only allows 2 > connections per host so if more than 2 threads are used the others will tend > to time out waiting to get a connection. The code previously set max > connections correctly but not connection per host. > 2.) I also added at the same time simple modifications to both protocol-http > and protocol-httpclient to allow specifying a cookie string in the conf file > to include in request headers. > I use this to crawl site content requiring authentication - it is better for > me to specify the cookie string for the authentication than go through the > whole authentication process and specifying login info. > The nutch-site.xml property is the following: > > http.cookie_string > XX_AL=authorization_value_goes_here > String to use as the cookie value for HTTP > requests > > Although I use it for authentication it can be used to specify any single > cookie string for the crawl (httpclient does support different cookies for > different hosts but I did not get into that). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1300) Indexer to filter and normalize URL's
[ https://issues.apache.org/jira/browse/NUTCH-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma updated NUTCH-1300: - Summary: Indexer to filter and normalize URL's (was: Indexer to normalize URL's) > Indexer to filter and normalize URL's > - > > Key: NUTCH-1300 > URL: https://issues.apache.org/jira/browse/NUTCH-1300 > Project: Nutch > Issue Type: New Feature > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-1300-1.5-1.patch > > > Indexers should be able to normalize URL's. This is useful when a new > normalizer is applied to the entire CrawlDB. Without it, some or all records > in a segment cannot be indexed at all. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1614) Plugin to exclude URLs matching regex list from indexing - to enable crawl but do not index
[ https://issues.apache.org/jira/browse/NUTCH-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711435#comment-13711435 ] Markus Jelsma commented on NUTCH-1614: -- Most if not all filters/normalizers support setting a config file so you can use a different filter/normalize config file per stage. This way you can use a different set of regex rules during fetch/update and another during indexing. You'll have to check each plugin's code to know the exact configuration parameter you need to point it to a different config file. I think this was never ported to 2.x so i think it's better to first port the pluggable indexing backends from 1.x to 2.x and then let it also support filtering and normalizing. Also, NUTCH-1300's title is wrong, it should be normalizing AND filtering. If you check the patch you'll see it's actually about both. > Plugin to exclude URLs matching regex list from indexing - to enable crawl > but do not index > --- > > Key: NUTCH-1614 > URL: https://issues.apache.org/jira/browse/NUTCH-1614 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 2.2.1 >Reporter: Brian >Priority: Minor > Labels: plugin > Attachments: NUTCH-1614.patch > > > Some pages we need to crawl (such as some main pages and different views of a > main page) to get all the other pages, but we don't want to index those pages > themselves. Therefore we cannot use the url filter approach. > This plugin uses a file containing regex strings (see included sample file). > If one of the regex strings matches with an entire URL, that URL will be > excluded form indexing. > The file to use is specified by the following property in nutch-site.xml: > > indexer.url.filter.exclude.regex.file > regex-indexer-exclude-urls.txt > > Holds the file name containing the regex strings. Any URL > matching one of these strings will be excluded from indexing. > "#" indicates a comment line and will be ignored. > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Reopened] (NUTCH-1300) Indexer to normalize URL's
[ https://issues.apache.org/jira/browse/NUTCH-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma reopened NUTCH-1300: -- > Indexer to normalize URL's > -- > > Key: NUTCH-1300 > URL: https://issues.apache.org/jira/browse/NUTCH-1300 > Project: Nutch > Issue Type: New Feature > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-1300-1.5-1.patch > > > Indexers should be able to normalize URL's. This is useful when a new > normalizer is applied to the entire CrawlDB. Without it, some or all records > in a segment cannot be indexed at all. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Resolved] (NUTCH-1300) Indexer to filter and normalize URL's
[ https://issues.apache.org/jira/browse/NUTCH-1300?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Markus Jelsma resolved NUTCH-1300. -- Resolution: Fixed renamed issue for clarity. > Indexer to filter and normalize URL's > - > > Key: NUTCH-1300 > URL: https://issues.apache.org/jira/browse/NUTCH-1300 > Project: Nutch > Issue Type: New Feature > Components: indexer >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Minor > Fix For: 1.6 > > Attachments: NUTCH-1300-1.5-1.patch > > > Indexers should be able to normalize URL's. This is useful when a new > normalizer is applied to the entire CrawlDB. Without it, some or all records > in a segment cannot be indexed at all. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Comment Edited] (NUTCH-1614) Plugin to exclude URLs matching regex list from indexing - to enable crawl but do not index
[ https://issues.apache.org/jira/browse/NUTCH-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711365#comment-13711365 ] Brian edited comment on NUTCH-1614 at 7/17/13 6:29 PM: --- Can you please tell me how to do this? I couldn't find anything about how to do this. From what I can tell URL filters apply to crawling not just indexing I couldn't see how to apply it to only indexing. I don't see how normalizing a URL would help in this case if it still filters the URL from the crawl and not just indexing. I see an option with the solrindex command, but it appears to be only available in nutch 1.x. Even if it were in 2.x it is not clear how to use the option to achieve the desired effect from the documentation: http://wiki.apache.org/nutch/bin/nutch%20solrindex was (Author: brian44): Can you please tell me how to do this? I couldn't find anything about how to do this. From what I can tell URL filters apply to crawling not just indexing I couldn't see how to apply it to only indexing. I don't see how normalizing a URL would help in this case if it still filters the URL from the crawl and not just indexing. > Plugin to exclude URLs matching regex list from indexing - to enable crawl > but do not index > --- > > Key: NUTCH-1614 > URL: https://issues.apache.org/jira/browse/NUTCH-1614 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 2.2.1 >Reporter: Brian >Priority: Minor > Labels: plugin > Attachments: NUTCH-1614.patch > > > Some pages we need to crawl (such as some main pages and different views of a > main page) to get all the other pages, but we don't want to index those pages > themselves. Therefore we cannot use the url filter approach. > This plugin uses a file containing regex strings (see included sample file). > If one of the regex strings matches with an entire URL, that URL will be > excluded form indexing. > The file to use is specified by the following property in nutch-site.xml: > > indexer.url.filter.exclude.regex.file > regex-indexer-exclude-urls.txt > > Holds the file name containing the regex strings. Any URL > matching one of these strings will be excluded from indexing. > "#" indicates a comment line and will be ignored. > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1614) Plugin to exclude URLs matching regex list from indexing - to enable crawl but do not index
[ https://issues.apache.org/jira/browse/NUTCH-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711365#comment-13711365 ] Brian commented on NUTCH-1614: -- Can you please tell me how to do this? I couldn't find anything about how to do this. From what I can tell URL filters apply to crawling not just indexing I couldn't see how to apply it to only indexing. I don't see how normalizing a URL would help in this case if it still filters the URL from the crawl and not just indexing. > Plugin to exclude URLs matching regex list from indexing - to enable crawl > but do not index > --- > > Key: NUTCH-1614 > URL: https://issues.apache.org/jira/browse/NUTCH-1614 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 2.2.1 >Reporter: Brian >Priority: Minor > Labels: plugin > Attachments: NUTCH-1614.patch > > > Some pages we need to crawl (such as some main pages and different views of a > main page) to get all the other pages, but we don't want to index those pages > themselves. Therefore we cannot use the url filter approach. > This plugin uses a file containing regex strings (see included sample file). > If one of the regex strings matches with an entire URL, that URL will be > excluded form indexing. > The file to use is specified by the following property in nutch-site.xml: > > indexer.url.filter.exclude.regex.file > regex-indexer-exclude-urls.txt > > Holds the file name containing the regex strings. Any URL > matching one of these strings will be excluded from indexing. > "#" indicates a comment line and will be ignored. > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1614) Plugin to exclude URLs matching regex list from indexing - to enable crawl but do not index
[ https://issues.apache.org/jira/browse/NUTCH-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711308#comment-13711308 ] Markus Jelsma commented on NUTCH-1614: -- You can already do this since Nutch 1.5. It doesn't need any special plugins and just reuses both filtering and normalizing systems in Nutch. > Plugin to exclude URLs matching regex list from indexing - to enable crawl > but do not index > --- > > Key: NUTCH-1614 > URL: https://issues.apache.org/jira/browse/NUTCH-1614 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 2.2.1 >Reporter: Brian >Priority: Minor > Labels: plugin > Attachments: NUTCH-1614.patch > > > Some pages we need to crawl (such as some main pages and different views of a > main page) to get all the other pages, but we don't want to index those pages > themselves. Therefore we cannot use the url filter approach. > This plugin uses a file containing regex strings (see included sample file). > If one of the regex strings matches with an entire URL, that URL will be > excluded form indexing. > The file to use is specified by the following property in nutch-site.xml: > > indexer.url.filter.exclude.regex.file > regex-indexer-exclude-urls.txt > > Holds the file name containing the regex strings. Any URL > matching one of these strings will be excluded from indexing. > "#" indicates a comment line and will be ignored. > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1614) Plugin to exclude URLs matching regex list from indexing - to enable crawl but do not index
[ https://issues.apache.org/jira/browse/NUTCH-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Brian updated NUTCH-1614: - Attachment: NUTCH-1614.patch > Plugin to exclude URLs matching regex list from indexing - to enable crawl > but do not index > --- > > Key: NUTCH-1614 > URL: https://issues.apache.org/jira/browse/NUTCH-1614 > Project: Nutch > Issue Type: Improvement > Components: indexer >Affects Versions: 2.2.1 >Reporter: Brian >Priority: Minor > Labels: plugin > Attachments: NUTCH-1614.patch > > > Some pages we need to crawl (such as some main pages and different views of a > main page) to get all the other pages, but we don't want to index those pages > themselves. Therefore we cannot use the url filter approach. > This plugin uses a file containing regex strings (see included sample file). > If one of the regex strings matches with an entire URL, that URL will be > excluded form indexing. > The file to use is specified by the following property in nutch-site.xml: > > indexer.url.filter.exclude.regex.file > regex-indexer-exclude-urls.txt > > Holds the file name containing the regex strings. Any URL > matching one of these strings will be excluded from indexing. > "#" indicates a comment line and will be ignored. > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (NUTCH-1614) Plugin to exclude URLs matching regex list from indexing - to enable crawl but do not index
Brian created NUTCH-1614: Summary: Plugin to exclude URLs matching regex list from indexing - to enable crawl but do not index Key: NUTCH-1614 URL: https://issues.apache.org/jira/browse/NUTCH-1614 Project: Nutch Issue Type: Improvement Components: indexer Affects Versions: 2.2.1 Reporter: Brian Priority: Minor Some pages we need to crawl (such as some main pages and different views of a main page) to get all the other pages, but we don't want to index those pages themselves. Therefore we cannot use the url filter approach. This plugin uses a file containing regex strings (see included sample file). If one of the regex strings matches with an entire URL, that URL will be excluded form indexing. The file to use is specified by the following property in nutch-site.xml: indexer.url.filter.exclude.regex.file regex-indexer-exclude-urls.txt Holds the file name containing the regex strings. Any URL matching one of these strings will be excluded from indexing. "#" indicates a comment line and will be ignored. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1613) Timeouts in protocol-httpclient when crawling same host with >2 threads and added cookie strings for both http protocols
[ https://issues.apache.org/jira/browse/NUTCH-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711223#comment-13711223 ] Brian commented on NUTCH-1613: -- Yes, if set it will be included in requests for all URLs. > Timeouts in protocol-httpclient when crawling same host with >2 threads and > added cookie strings for both http protocols > > > Key: NUTCH-1613 > URL: https://issues.apache.org/jira/browse/NUTCH-1613 > Project: Nutch > Issue Type: Bug > Components: protocol >Affects Versions: 2.2.1 >Reporter: Brian >Priority: Minor > Labels: patch > Attachments: NUTCH-1613.patch > > > 1.) When using protocol-httpclient to crawl a single website (the same host) > I would always get a bunch of timeout errors during fetching and the pages > with errors would not be fetched. E.g.: > 2013-07-09 17:57:13,717 WARN fetcher.FetcherJob - fetch of http://www > failed with: org.apache.commons.httpclient.ConnectionPoolTimeoutException: > Timeout waiting for connection > 2013-07-09 17:57:13,718 INFO fetcher.FetcherJob - fetching http://www > (queue crawl delay=0ms) > 2013-07-09 17:57:13,715 ERROR httpclient.Http - Failed with the following > error: > org.apache.commons.httpclient.ConnectionPoolTimeoutException: Timeout waiting > for connection > at > org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:497) > at > org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416) > at > org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153) > at > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) > at > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323) > at > org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.java:95) > at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:174) > at > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:133) > at > org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:518) > This is because by default the connection pool manager only allows 2 > connections per host so if more than 2 threads are used the others will tend > to time out waiting to get a connection. The code previously set max > connections correctly but not connection per host. > 2.) I also added at the same time simple modifications to both protocol-http > and protocol-httpclient to allow specifying a cookie string in the conf file > to include in request headers. > I use this to crawl site content requiring authentication - it is better for > me to specify the cookie string for the authentication than go through the > whole authentication process and specifying login info. > The nutch-site.xml property is the following: > > http.cookie_string > XX_AL=authorization_value_goes_here > String to use as the cookie value for HTTP > requests > > Although I use it for authentication it can be used to specify any single > cookie string for the crawl (httpclient does support different cookies for > different hosts but I did not get into that). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1613) Timeouts in protocol-httpclient when crawling same host with >2 threads and added cookie strings for both http protocols
[ https://issues.apache.org/jira/browse/NUTCH-1613?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13711150#comment-13711150 ] lufeng commented on NUTCH-1613: --- Does this specified cookie string will effect all crawling urls? > Timeouts in protocol-httpclient when crawling same host with >2 threads and > added cookie strings for both http protocols > > > Key: NUTCH-1613 > URL: https://issues.apache.org/jira/browse/NUTCH-1613 > Project: Nutch > Issue Type: Bug > Components: protocol >Affects Versions: 2.2.1 >Reporter: Brian >Priority: Minor > Labels: patch > Attachments: NUTCH-1613.patch > > > 1.) When using protocol-httpclient to crawl a single website (the same host) > I would always get a bunch of timeout errors during fetching and the pages > with errors would not be fetched. E.g.: > 2013-07-09 17:57:13,717 WARN fetcher.FetcherJob - fetch of http://www > failed with: org.apache.commons.httpclient.ConnectionPoolTimeoutException: > Timeout waiting for connection > 2013-07-09 17:57:13,718 INFO fetcher.FetcherJob - fetching http://www > (queue crawl delay=0ms) > 2013-07-09 17:57:13,715 ERROR httpclient.Http - Failed with the following > error: > org.apache.commons.httpclient.ConnectionPoolTimeoutException: Timeout waiting > for connection > at > org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.doGetConnection(MultiThreadedHttpConnectionManager.java:497) > at > org.apache.commons.httpclient.MultiThreadedHttpConnectionManager.getConnectionWithTimeout(MultiThreadedHttpConnectionManager.java:416) > at > org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:153) > at > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) > at > org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323) > at > org.apache.nutch.protocol.httpclient.HttpResponse.(HttpResponse.java:95) > at org.apache.nutch.protocol.httpclient.Http.getResponse(Http.java:174) > at > org.apache.nutch.protocol.http.api.HttpBase.getProtocolOutput(HttpBase.java:133) > at > org.apache.nutch.fetcher.FetcherReducer$FetcherThread.run(FetcherReducer.java:518) > This is because by default the connection pool manager only allows 2 > connections per host so if more than 2 threads are used the others will tend > to time out waiting to get a connection. The code previously set max > connections correctly but not connection per host. > 2.) I also added at the same time simple modifications to both protocol-http > and protocol-httpclient to allow specifying a cookie string in the conf file > to include in request headers. > I use this to crawl site content requiring authentication - it is better for > me to specify the cookie string for the authentication than go through the > whole authentication process and specifying login info. > The nutch-site.xml property is the following: > > http.cookie_string > XX_AL=authorization_value_goes_here > String to use as the cookie value for HTTP > requests > > Although I use it for authentication it can be used to specify any single > cookie string for the crawl (httpclient does support different cookies for > different hosts but I did not get into that). -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1228) Change mapred.task.timeout to mapreduce.task.timeout in fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gül Ahmet Türkoğlu updated NUTCH-1228: -- Attachment: NUTCH-1228-2.1.patch I change mapred.task.timeout to mapreduce.task.timeout in fetcher. > Change mapred.task.timeout to mapreduce.task.timeout in fetcher > --- > > Key: NUTCH-1228 > URL: https://issues.apache.org/jira/browse/NUTCH-1228 > Project: Nutch > Issue Type: Task > Components: fetcher >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Trivial > Fix For: 1.9 > > Attachments: NUTCH-1228-2.1.patch > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1228) Change mapred.task.timeout to mapreduce.task.timeout in fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gül Ahmet Türkoğlu updated NUTCH-1228: -- Attachment: NUTCH-1228-2.1.patch > Change mapred.task.timeout to mapreduce.task.timeout in fetcher > --- > > Key: NUTCH-1228 > URL: https://issues.apache.org/jira/browse/NUTCH-1228 > Project: Nutch > Issue Type: Task > Components: fetcher >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Trivial > Fix For: 1.9 > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1228) Change mapred.task.timeout to mapreduce.task.timeout in fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gül Ahmet Türkoğlu updated NUTCH-1228: -- Attachment: (was: NUTCH-1228-2.1.patch) > Change mapred.task.timeout to mapreduce.task.timeout in fetcher > --- > > Key: NUTCH-1228 > URL: https://issues.apache.org/jira/browse/NUTCH-1228 > Project: Nutch > Issue Type: Task > Components: fetcher >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Trivial > Fix For: 1.9 > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1228) Change mapred.task.timeout to mapreduce.task.timeout in fetcher
[ https://issues.apache.org/jira/browse/NUTCH-1228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gül Ahmet Türkoğlu updated NUTCH-1228: -- Attachment: NUTCH-1228-2.1.patch > Change mapred.task.timeout to mapreduce.task.timeout in fetcher > --- > > Key: NUTCH-1228 > URL: https://issues.apache.org/jira/browse/NUTCH-1228 > Project: Nutch > Issue Type: Task > Components: fetcher >Reporter: Markus Jelsma >Assignee: Markus Jelsma >Priority: Trivial > Fix For: 1.9 > > -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira