[jira] [Updated] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1031: --- Attachment: CC.robots.multiple.agents.patch I looked at the source code of CC to understand how it works. I have identified the change to be done to CC so that it supports multiple user agents. While testing the same, I have found that there a semantic difference in the way CC works as compared to legacy nutch parser. *What CC does:* It will split the _http.robots.agents_ over comma (the change that i did locally) It scans the robots file line by line, each time finding if there is a match of the current User-Agent from file with any one of from _http.robots.agents_. If match is found it will take all the corresponding rules for that agent and stop further parsing. {noformat}robots file User-Agent: Agent1 #foo Disallow: /a User-Agent: Agent2 Agent3 Disallow: /d http.robots.agents: Agent2,Agent1 Path: /a{noformat} For the example above, as soon as first line of robots file is scanned, a match for Agent1 is found. It will scan all the corresponding rules for that agent and will store only this information: {noformat}User-Agent: Agent1 Disallow: /a{noformat} Rest all stuff is neglected. *What nutch robots parser does:* It will split the _http.robots.agents_ over comma. It scans ALL the lines of the robots file and evaluates the matches in terms of the precedence of the user agents. For above example, the rules corresponding to both Agent2 and Agent1 have a match in robots file, but as Agent2 comes first in _http.robots.agents_, it is given priority and the rules stored will be: {noformat}User-Agent: Agent2 Disallow: /d{noformat} If we want to leave behind the precendence based thing and adopt the model in CC, then I have a small patch for crawler-commons (CC.robots.multiple.agents.patch). Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Julien Nioche Priority: Minor Labels: robots.txt Fix For: 1.7 Attachments: CC.robots.multiple.agents.patch, NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil reassigned NUTCH-1031: -- Assignee: Tejas Patil (was: Julien Nioche) Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 1.7 Attachments: CC.robots.multiple.agents.patch, NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (NUTCH-1513) Support Robots.txt for Ftp urls
[ https://issues.apache.org/jira/browse/NUTCH-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil reassigned NUTCH-1513: -- Assignee: Tejas Patil Support Robots.txt for Ftp urls --- Key: NUTCH-1513 URL: https://issues.apache.org/jira/browse/NUTCH-1513 Project: Nutch Issue Type: Improvement Affects Versions: 1.7, 2.2 Reporter: Tejas Patil Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 1.7, 2.2 As per [0], a FTP website can have robots.txt like [1]. In the nutch code, Ftp plugin is not parsing the robots file and accepting all urls. In _src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java_ {noformat} public RobotRules getRobotRules(Text url, CrawlDatum datum) { return EmptyRobotRules.RULES; }{noformat} Its not clear of this was part of design or if its a bug. [0] : https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt [1] : ftp://example.com/robots.txt -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.
[ https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil updated NUTCH-1284: --- Attachment: NUTCH-1284-trunk.v1.patch Hi Lewis, If I recall correctly, we want the crawl delay for the url (and hence its queues' delay) to be logged with the urls' fetching begins. Right ? Add site fetcher.max.crawl.delay as log output by default. -- Key: NUTCH-1284 URL: https://issues.apache.org/jira/browse/NUTCH-1284 Project: Nutch Issue Type: New Feature Components: fetcher Affects Versions: nutchgora, 1.5 Reporter: Lewis John McGibbney Assignee: Tejas Patil Priority: Trivial Fix For: 1.7, 2.2 Attachments: NUTCH-1284.patch, NUTCH-1284-trunk.v1.patch Currently, when manually scanning our log output we cannot infer which pages are governed by a crawl delay between successive fetch attempts of any given page within the site. The value should be made available as something like: {code} 2012-02-19 12:33:33,031 INFO fetcher.Fetcher - fetching http://nutch.apache.org/ (crawl.delay=XXXms) {code} This way we can easily and quickly determine whether the fetcher is having to use this functionality or not. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (NUTCH-1042) Fetcher.max.crawl.delay property not taken into account correctly when set to -1
[ https://issues.apache.org/jira/browse/NUTCH-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tejas Patil reassigned NUTCH-1042: -- Assignee: Tejas Patil Fetcher.max.crawl.delay property not taken into account correctly when set to -1 Key: NUTCH-1042 URL: https://issues.apache.org/jira/browse/NUTCH-1042 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.3 Reporter: Nutch User - 1 Assignee: Tejas Patil Fix For: 1.7, 2.2 [Originally: (http://lucene.472066.n3.nabble.com/A-possible-bug-or-misleading-documentation-td3162397.html).] From nutch-default.xml: property namefetcher.max.crawl.delay/name value30/value description If the Crawl-Delay in robots.txt is set to greater than this value (in seconds) then the fetcher will skip this page, generating an error report. If set to -1 the fetcher will never skip such pages and will wait the amount of time retrieved from robots.txt Crawl-Delay, however long that might be. /description /property Fetcher.java: (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup). The line 554 in Fetcher.java: this.maxCrawlDelay = conf.getInt(fetcher.max.crawl.delay, 30) * 1000; . The lines 615-616 in Fetcher.java: if (rules.getCrawlDelay() 0) { if (rules.getCrawlDelay() maxCrawlDelay) { Now, the documentation states that, if fetcher.max.crawl.delay is set to -1, the crawler will always wait the amount of time the Crawl-Delay parameter specifies. However, as you can see, if it really is negative the condition on the line 616 is always true, which leads to skipping the page whose Crawl-Delay is set. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1042) Fetcher.max.crawl.delay property not taken into account correctly when set to -1
[ https://issues.apache.org/jira/browse/NUTCH-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558225#comment-13558225 ] Tejas Patil commented on NUTCH-1042: The patch for [NUTCH-1284|https://issues.apache.org/jira/browse/NUTCH-1284] fixes this issue. I did not knew that until Lewis pointed it out. Thanks Lewis :) Fetcher.max.crawl.delay property not taken into account correctly when set to -1 Key: NUTCH-1042 URL: https://issues.apache.org/jira/browse/NUTCH-1042 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.3 Reporter: Nutch User - 1 Assignee: Tejas Patil Fix For: 1.7, 2.2 [Originally: (http://lucene.472066.n3.nabble.com/A-possible-bug-or-misleading-documentation-td3162397.html).] From nutch-default.xml: property namefetcher.max.crawl.delay/name value30/value description If the Crawl-Delay in robots.txt is set to greater than this value (in seconds) then the fetcher will skip this page, generating an error report. If set to -1 the fetcher will never skip such pages and will wait the amount of time retrieved from robots.txt Crawl-Delay, however long that might be. /description /property Fetcher.java: (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup). The line 554 in Fetcher.java: this.maxCrawlDelay = conf.getInt(fetcher.max.crawl.delay, 30) * 1000; . The lines 615-616 in Fetcher.java: if (rules.getCrawlDelay() 0) { if (rules.getCrawlDelay() maxCrawlDelay) { Now, the documentation states that, if fetcher.max.crawl.delay is set to -1, the crawler will always wait the amount of time the Crawl-Delay parameter specifies. However, as you can see, if it really is negative the condition on the line 616 is always true, which leads to skipping the page whose Crawl-Delay is set. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1329) parser not extract outlinks to external web sites
[ https://issues.apache.org/jira/browse/NUTCH-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558228#comment-13558228 ] Tejas Patil commented on NUTCH-1329: I am not able to reproduce this bug with the default config. Are there any specific configs that you were using ? parser not extract outlinks to external web sites - Key: NUTCH-1329 URL: https://issues.apache.org/jira/browse/NUTCH-1329 Project: Nutch Issue Type: Bug Components: parser Affects Versions: 1.4 Reporter: behnam nikbakht Labels: parse Fix For: 1.7, 2.2 found a bug in /src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java, that outlinks like www.example2.com from www.example1.com are inserted as www.example1.com/www.example2.com i correct this bug by testing that if outlink (www.example2.com) is a valid url, else inserted with it's base url so i replace these lines: URL url = URLUtil.resolveURL(base, target); outlinks.add(new Outlink(url.toString(), linkText.toString().trim())); with: String host_temp=null; try{ host_temp=URLUtil.getDomainName(new URL(target)); } catch(Exception eiuy){ host_temp=null; } URL url=null; if(host_temp==null)// it is an internal outlink url = URLUtil.resolveURL(base, target); else //it is an external link url=new URL(target); outlinks.add(new Outlink(url.toString(), linkText.toString().trim())); -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Assigned] (NUTCH-1042) Fetcher.max.crawl.delay property not taken into account correctly when set to -1
[ https://issues.apache.org/jira/browse/NUTCH-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Lewis John McGibbney reassigned NUTCH-1042: --- Assignee: Lewis John McGibbney (was: Tejas Patil) Fetcher.max.crawl.delay property not taken into account correctly when set to -1 Key: NUTCH-1042 URL: https://issues.apache.org/jira/browse/NUTCH-1042 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.3 Reporter: Nutch User - 1 Assignee: Lewis John McGibbney Fix For: 1.7, 2.2 [Originally: (http://lucene.472066.n3.nabble.com/A-possible-bug-or-misleading-documentation-td3162397.html).] From nutch-default.xml: property namefetcher.max.crawl.delay/name value30/value description If the Crawl-Delay in robots.txt is set to greater than this value (in seconds) then the fetcher will skip this page, generating an error report. If set to -1 the fetcher will never skip such pages and will wait the amount of time retrieved from robots.txt Crawl-Delay, however long that might be. /description /property Fetcher.java: (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup). The line 554 in Fetcher.java: this.maxCrawlDelay = conf.getInt(fetcher.max.crawl.delay, 30) * 1000; . The lines 615-616 in Fetcher.java: if (rules.getCrawlDelay() 0) { if (rules.getCrawlDelay() maxCrawlDelay) { Now, the documentation states that, if fetcher.max.crawl.delay is set to -1, the crawler will always wait the amount of time the Crawl-Delay parameter specifies. However, as you can see, if it really is negative the condition on the line 616 is always true, which leads to skipping the page whose Crawl-Delay is set. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1042) Fetcher.max.crawl.delay property not taken into account correctly when set to -1
[ https://issues.apache.org/jira/browse/NUTCH-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558313#comment-13558313 ] Lewis John McGibbney commented on NUTCH-1042: - Hi Tejas, can you please link the issues? I am on mobile browser and it is nearly impossible to do. Thnka you Fetcher.max.crawl.delay property not taken into account correctly when set to -1 Key: NUTCH-1042 URL: https://issues.apache.org/jira/browse/NUTCH-1042 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.3 Reporter: Nutch User - 1 Assignee: Lewis John McGibbney Fix For: 1.7, 2.2 [Originally: (http://lucene.472066.n3.nabble.com/A-possible-bug-or-misleading-documentation-td3162397.html).] From nutch-default.xml: property namefetcher.max.crawl.delay/name value30/value description If the Crawl-Delay in robots.txt is set to greater than this value (in seconds) then the fetcher will skip this page, generating an error report. If set to -1 the fetcher will never skip such pages and will wait the amount of time retrieved from robots.txt Crawl-Delay, however long that might be. /description /property Fetcher.java: (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup). The line 554 in Fetcher.java: this.maxCrawlDelay = conf.getInt(fetcher.max.crawl.delay, 30) * 1000; . The lines 615-616 in Fetcher.java: if (rules.getCrawlDelay() 0) { if (rules.getCrawlDelay() maxCrawlDelay) { Now, the documentation states that, if fetcher.max.crawl.delay is set to -1, the crawler will always wait the amount of time the Crawl-Delay parameter specifies. However, as you can see, if it really is negative the condition on the line 616 is always true, which leads to skipping the page whose Crawl-Delay is set. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1042) Fetcher.max.crawl.delay property not taken into account correctly when set to -1
[ https://issues.apache.org/jira/browse/NUTCH-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558321#comment-13558321 ] Tejas Patil commented on NUTCH-1042: linked with NUTCH-1284 Fetcher.max.crawl.delay property not taken into account correctly when set to -1 Key: NUTCH-1042 URL: https://issues.apache.org/jira/browse/NUTCH-1042 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 1.3 Reporter: Nutch User - 1 Assignee: Lewis John McGibbney Fix For: 1.7, 2.2 [Originally: (http://lucene.472066.n3.nabble.com/A-possible-bug-or-misleading-documentation-td3162397.html).] From nutch-default.xml: property namefetcher.max.crawl.delay/name value30/value description If the Crawl-Delay in robots.txt is set to greater than this value (in seconds) then the fetcher will skip this page, generating an error report. If set to -1 the fetcher will never skip such pages and will wait the amount of time retrieved from robots.txt Crawl-Delay, however long that might be. /description /property Fetcher.java: (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup). The line 554 in Fetcher.java: this.maxCrawlDelay = conf.getInt(fetcher.max.crawl.delay, 30) * 1000; . The lines 615-616 in Fetcher.java: if (rules.getCrawlDelay() 0) { if (rules.getCrawlDelay() maxCrawlDelay) { Now, the documentation states that, if fetcher.max.crawl.delay is set to -1, the crawler will always wait the amount of time the Crawl-Delay parameter specifies. However, as you can see, if it really is negative the condition on the line 616 is always true, which leads to skipping the page whose Crawl-Delay is set. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558340#comment-13558340 ] Ken Krugler commented on NUTCH-1031: Hi Tejas - I've looked at your patch, and (assuming there's not a requirement to support precedence in the user agent name list) it seems like a valid change. Based on the RFC (http://www.robotstxt.org/norobots-rfc.txt) robot names shouldn't have commas, so splitting on that seems safe. Do you have a unit test to verify proper behavior? If so, I'd be happy to roll that into CC. -- Ken Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 1.7 Attachments: CC.robots.multiple.agents.patch, NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons
[ https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558349#comment-13558349 ] Tejas Patil commented on NUTCH-1031: Hi Ken, Thanks for reviewing the patch. I will include a test case in patch. Before that, a bigger question is whether Nutch should adopt the parsing model in CC and forget about the precedence. BTW: Did you find any error in my understanding about how CC parses robots ? Delegate parsing of robots.txt to crawler-commons - Key: NUTCH-1031 URL: https://issues.apache.org/jira/browse/NUTCH-1031 Project: Nutch Issue Type: Task Reporter: Julien Nioche Assignee: Tejas Patil Priority: Minor Labels: robots.txt Fix For: 1.7 Attachments: CC.robots.multiple.agents.patch, NUTCH-1031.v1.patch We're about to release the first version of Crawler-Commons [http://code.google.com/p/crawler-commons/] which contains a parser for robots.txt files. This parser should also be better than the one we currently have in Nutch. I will delegate this functionality to CC as soon as it is available publicly -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (NUTCH-1219) Upgrade all jobs to new MapReduce API
[ https://issues.apache.org/jira/browse/NUTCH-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558476#comment-13558476 ] lufeng commented on NUTCH-1219: --- Hi Markus, i see that Injector, Generator and fetchor are still use old mapReduce API too, Should them also upgrade to new MR API. Upgrade all jobs to new MapReduce API - Key: NUTCH-1219 URL: https://issues.apache.org/jira/browse/NUTCH-1219 Project: Nutch Issue Type: Task Reporter: Markus Jelsma Priority: Critical Fix For: 1.7 We should upgrade to the new Hadoop API for Nutch trunk as already has been done for the Nutchgora branch. If i'm not mistaken we can already upgrade to the latest 0.20.5 version that still carries the legacy API so we can, without immediately upgrading to 0.21 or higher, port the jobs to the new API without having the need for a separate branch to work on. To the committers who created/ported jobs in NutchGora, please write down your advice and experience. http://www.slideshare.net/sh1mmer/upgrading-to-the-new-map-reduce-api -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Updated] (NUTCH-1223) Migrate WebGraph to MapReduce API
[ https://issues.apache.org/jira/browse/NUTCH-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] lufeng updated NUTCH-1223: -- Attachment: WebGraph_new_MR_API.patch migrate WebGraph to new MR API patch Migrate WebGraph to MapReduce API - Key: NUTCH-1223 URL: https://issues.apache.org/jira/browse/NUTCH-1223 Project: Nutch Issue Type: Sub-task Reporter: Markus Jelsma Assignee: lufeng Fix For: 1.7 Attachments: WebGraph_new_MR_API.patch -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira