date:20130120

[jira] [Updated] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-20 Thread Tejas Patil (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1031:
---

Attachment: CC.robots.multiple.agents.patch

I looked at the source code of CC to understand how it works. I have identified 
the change to be done to CC so that it supports multiple user agents. While 
testing the same, I have found that there a semantic difference in the way CC 
works as compared to legacy nutch parser.

*What CC does:*
It will split the _http.robots.agents_ over comma (the change that i did 
locally)
It scans the robots file line by line, each time finding if there is a match of 
the current User-Agent from file with any one of from  _http.robots.agents_. 
If match is found it will take all the corresponding rules for that agent and 
stop further parsing. 

{noformat}robots file
User-Agent: Agent1 #foo
Disallow: /a

User-Agent: Agent2 Agent3
Disallow: /d

http.robots.agents: Agent2,Agent1

Path: /a{noformat}

For the example above, as soon as first line of robots file is scanned, a match 
for Agent1 is found. It will scan all the corresponding rules for that agent 
and will store only this information:
{noformat}User-Agent: Agent1
Disallow: /a{noformat}

Rest all stuff is neglected.

*What nutch robots parser does:*
It will split the _http.robots.agents_ over comma. It scans ALL the lines of 
the robots file and evaluates the matches in terms of the precedence of the 
user agents.
For above example, the rules corresponding to both Agent2 and Agent1 have a 
match in robots file, but as Agent2 comes first in _http.robots.agents_, it is 
given priority and the rules stored will be:
{noformat}User-Agent: Agent2
Disallow: /d{noformat}

If we want to leave behind the precendence based thing and adopt the model in 
CC, then I have a small patch for crawler-commons 
(CC.robots.multiple.agents.patch).

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Julien Nioche
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7

 Attachments: CC.robots.multiple.agents.patch, NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-20 Thread Tejas Patil (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil reassigned NUTCH-1031:
--

Assignee: Tejas Patil  (was: Julien Nioche)

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Tejas Patil
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7

 Attachments: CC.robots.multiple.agents.patch, NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (NUTCH-1513) Support Robots.txt for Ftp urls

2013-01-20 Thread Tejas Patil (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil reassigned NUTCH-1513:
--

Assignee: Tejas Patil

 Support Robots.txt for Ftp urls
 ---

 Key: NUTCH-1513
 URL: https://issues.apache.org/jira/browse/NUTCH-1513
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.7, 2.2
Reporter: Tejas Patil
Assignee: Tejas Patil
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7, 2.2


 As per [0], a FTP website can have robots.txt like [1]. In the nutch code, 
 Ftp plugin is not parsing the robots file and accepting all urls.
 In _src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/Ftp.java_
 {noformat}   public RobotRules getRobotRules(Text url, CrawlDatum datum) {
 return EmptyRobotRules.RULES;
   }{noformat} 
 Its not clear of this was part of design or if its a bug. 
 [0] : 
 https://developers.google.com/webmasters/control-crawl-index/docs/robots_txt
 [1] : ftp://example.com/robots.txt

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

2013-01-20 Thread Tejas Patil (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil updated NUTCH-1284:
---

Attachment: NUTCH-1284-trunk.v1.patch

Hi Lewis,
If I recall correctly, we want the crawl delay for the url (and hence its 
queues' delay) to be logged with the urls' fetching begins. Right ?

 Add site fetcher.max.crawl.delay as log output by default.
 --

 Key: NUTCH-1284
 URL: https://issues.apache.org/jira/browse/NUTCH-1284
 Project: Nutch
  Issue Type: New Feature
  Components: fetcher
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Tejas Patil
Priority: Trivial
 Fix For: 1.7, 2.2

 Attachments: NUTCH-1284.patch, NUTCH-1284-trunk.v1.patch


 Currently, when manually scanning our log output we cannot infer which pages 
 are governed by a crawl delay between successive fetch attempts of any given 
 page within the site. The value should be made available as something like:
 {code}
 2012-02-19 12:33:33,031 INFO  fetcher.Fetcher - fetching 
 http://nutch.apache.org/ (crawl.delay=XXXms)
 {code}
 This way we can easily and quickly determine whether the fetcher is having to 
 use this functionality or not. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (NUTCH-1042) Fetcher.max.crawl.delay property not taken into account correctly when set to -1

2013-01-20 Thread Tejas Patil (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tejas Patil reassigned NUTCH-1042:
--

Assignee: Tejas Patil

 Fetcher.max.crawl.delay property not taken into account correctly when set to 
 -1
 

 Key: NUTCH-1042
 URL: https://issues.apache.org/jira/browse/NUTCH-1042
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.3
Reporter: Nutch User - 1
Assignee: Tejas Patil
 Fix For: 1.7, 2.2


 [Originally: 
 (http://lucene.472066.n3.nabble.com/A-possible-bug-or-misleading-documentation-td3162397.html).]
 From nutch-default.xml:
 
 property
  namefetcher.max.crawl.delay/name
  value30/value
  description
  If the Crawl-Delay in robots.txt is set to greater than this value (in
  seconds) then the fetcher will skip this page, generating an error report.
  If set to -1 the fetcher will never skip such pages and will wait the
  amount of time retrieved from robots.txt Crawl-Delay, however long that
  might be.
  /description
 /property
 
 Fetcher.java:
 (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup).
 The line 554 in Fetcher.java: this.maxCrawlDelay =
 conf.getInt(fetcher.max.crawl.delay, 30) * 1000; .
 The lines 615-616 in Fetcher.java:
 
 if (rules.getCrawlDelay()  0) {
   if (rules.getCrawlDelay()  maxCrawlDelay) {
 
 Now, the documentation states that, if fetcher.max.crawl.delay is set to
 -1, the crawler will always wait the amount of time the Crawl-Delay
 parameter specifies. However, as you can see, if it really is negative
 the condition on the line 616 is always true, which leads to skipping
 the page whose Crawl-Delay is set.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1042) Fetcher.max.crawl.delay property not taken into account correctly when set to -1

2013-01-20 Thread Tejas Patil (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558225#comment-13558225
 ] 

Tejas Patil commented on NUTCH-1042:


The patch for [NUTCH-1284|https://issues.apache.org/jira/browse/NUTCH-1284] 
fixes this issue. I did not knew that until Lewis pointed it out. Thanks Lewis 
:)

 Fetcher.max.crawl.delay property not taken into account correctly when set to 
 -1
 

 Key: NUTCH-1042
 URL: https://issues.apache.org/jira/browse/NUTCH-1042
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.3
Reporter: Nutch User - 1
Assignee: Tejas Patil
 Fix For: 1.7, 2.2


 [Originally: 
 (http://lucene.472066.n3.nabble.com/A-possible-bug-or-misleading-documentation-td3162397.html).]
 From nutch-default.xml:
 
 property
  namefetcher.max.crawl.delay/name
  value30/value
  description
  If the Crawl-Delay in robots.txt is set to greater than this value (in
  seconds) then the fetcher will skip this page, generating an error report.
  If set to -1 the fetcher will never skip such pages and will wait the
  amount of time retrieved from robots.txt Crawl-Delay, however long that
  might be.
  /description
 /property
 
 Fetcher.java:
 (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup).
 The line 554 in Fetcher.java: this.maxCrawlDelay =
 conf.getInt(fetcher.max.crawl.delay, 30) * 1000; .
 The lines 615-616 in Fetcher.java:
 
 if (rules.getCrawlDelay()  0) {
   if (rules.getCrawlDelay()  maxCrawlDelay) {
 
 Now, the documentation states that, if fetcher.max.crawl.delay is set to
 -1, the crawler will always wait the amount of time the Crawl-Delay
 parameter specifies. However, as you can see, if it really is negative
 the condition on the line 616 is always true, which leads to skipping
 the page whose Crawl-Delay is set.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1329) parser not extract outlinks to external web sites

2013-01-20 Thread Tejas Patil (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558228#comment-13558228
 ] 

Tejas Patil commented on NUTCH-1329:


I am not able to reproduce this bug with the default config. Are there any 
specific configs that you were using ?

 parser not extract outlinks to external web sites
 -

 Key: NUTCH-1329
 URL: https://issues.apache.org/jira/browse/NUTCH-1329
 Project: Nutch
  Issue Type: Bug
  Components: parser
Affects Versions: 1.4
Reporter: behnam nikbakht
  Labels: parse
 Fix For: 1.7, 2.2


 found a bug in 
 /src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java,
  that outlinks like www.example2.com from www.example1.com are inserted as 
 www.example1.com/www.example2.com
 i correct this bug by testing that if outlink (www.example2.com) is a valid 
 url, else inserted with it's base url
 so i replace these lines:
 URL url = URLUtil.resolveURL(base, target);
 outlinks.add(new Outlink(url.toString(),
  linkText.toString().trim()));
 with:
 String host_temp=null;
 try{
 host_temp=URLUtil.getDomainName(new URL(target));
 }
 catch(Exception eiuy){
 host_temp=null;
 }
 URL url=null;
 if(host_temp==null)// it is an internal outlink
 url = URLUtil.resolveURL(base, target);
 else //it is an external link
 url=new URL(target);
 outlinks.add(new Outlink(url.toString(),
  linkText.toString().trim()));

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (NUTCH-1042) Fetcher.max.crawl.delay property not taken into account correctly when set to -1

2013-01-20 Thread Lewis John McGibbney (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney reassigned NUTCH-1042:
---

Assignee: Lewis John McGibbney  (was: Tejas Patil)

 Fetcher.max.crawl.delay property not taken into account correctly when set to 
 -1
 

 Key: NUTCH-1042
 URL: https://issues.apache.org/jira/browse/NUTCH-1042
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.3
Reporter: Nutch User - 1
Assignee: Lewis John McGibbney
 Fix For: 1.7, 2.2


 [Originally: 
 (http://lucene.472066.n3.nabble.com/A-possible-bug-or-misleading-documentation-td3162397.html).]
 From nutch-default.xml:
 
 property
  namefetcher.max.crawl.delay/name
  value30/value
  description
  If the Crawl-Delay in robots.txt is set to greater than this value (in
  seconds) then the fetcher will skip this page, generating an error report.
  If set to -1 the fetcher will never skip such pages and will wait the
  amount of time retrieved from robots.txt Crawl-Delay, however long that
  might be.
  /description
 /property
 
 Fetcher.java:
 (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup).
 The line 554 in Fetcher.java: this.maxCrawlDelay =
 conf.getInt(fetcher.max.crawl.delay, 30) * 1000; .
 The lines 615-616 in Fetcher.java:
 
 if (rules.getCrawlDelay()  0) {
   if (rules.getCrawlDelay()  maxCrawlDelay) {
 
 Now, the documentation states that, if fetcher.max.crawl.delay is set to
 -1, the crawler will always wait the amount of time the Crawl-Delay
 parameter specifies. However, as you can see, if it really is negative
 the condition on the line 616 is always true, which leads to skipping
 the page whose Crawl-Delay is set.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1042) Fetcher.max.crawl.delay property not taken into account correctly when set to -1

2013-01-20 Thread Lewis John McGibbney (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558313#comment-13558313
 ] 

Lewis John McGibbney commented on NUTCH-1042:
-

Hi Tejas, can you please link the issues? I am on mobile browser and it is 
nearly impossible to do. Thnka you

 Fetcher.max.crawl.delay property not taken into account correctly when set to 
 -1
 

 Key: NUTCH-1042
 URL: https://issues.apache.org/jira/browse/NUTCH-1042
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.3
Reporter: Nutch User - 1
Assignee: Lewis John McGibbney
 Fix For: 1.7, 2.2


 [Originally: 
 (http://lucene.472066.n3.nabble.com/A-possible-bug-or-misleading-documentation-td3162397.html).]
 From nutch-default.xml:
 
 property
  namefetcher.max.crawl.delay/name
  value30/value
  description
  If the Crawl-Delay in robots.txt is set to greater than this value (in
  seconds) then the fetcher will skip this page, generating an error report.
  If set to -1 the fetcher will never skip such pages and will wait the
  amount of time retrieved from robots.txt Crawl-Delay, however long that
  might be.
  /description
 /property
 
 Fetcher.java:
 (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup).
 The line 554 in Fetcher.java: this.maxCrawlDelay =
 conf.getInt(fetcher.max.crawl.delay, 30) * 1000; .
 The lines 615-616 in Fetcher.java:
 
 if (rules.getCrawlDelay()  0) {
   if (rules.getCrawlDelay()  maxCrawlDelay) {
 
 Now, the documentation states that, if fetcher.max.crawl.delay is set to
 -1, the crawler will always wait the amount of time the Crawl-Delay
 parameter specifies. However, as you can see, if it really is negative
 the condition on the line 616 is always true, which leads to skipping
 the page whose Crawl-Delay is set.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1042) Fetcher.max.crawl.delay property not taken into account correctly when set to -1

2013-01-20 Thread Tejas Patil (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558321#comment-13558321
 ] 

Tejas Patil commented on NUTCH-1042:


linked with NUTCH-1284

 Fetcher.max.crawl.delay property not taken into account correctly when set to 
 -1
 

 Key: NUTCH-1042
 URL: https://issues.apache.org/jira/browse/NUTCH-1042
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 1.3
Reporter: Nutch User - 1
Assignee: Lewis John McGibbney
 Fix For: 1.7, 2.2


 [Originally: 
 (http://lucene.472066.n3.nabble.com/A-possible-bug-or-misleading-documentation-td3162397.html).]
 From nutch-default.xml:
 
 property
  namefetcher.max.crawl.delay/name
  value30/value
  description
  If the Crawl-Delay in robots.txt is set to greater than this value (in
  seconds) then the fetcher will skip this page, generating an error report.
  If set to -1 the fetcher will never skip such pages and will wait the
  amount of time retrieved from robots.txt Crawl-Delay, however long that
  might be.
  /description
 /property
 
 Fetcher.java:
 (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/fetcher/Fetcher.java?view=markup).
 The line 554 in Fetcher.java: this.maxCrawlDelay =
 conf.getInt(fetcher.max.crawl.delay, 30) * 1000; .
 The lines 615-616 in Fetcher.java:
 
 if (rules.getCrawlDelay()  0) {
   if (rules.getCrawlDelay()  maxCrawlDelay) {
 
 Now, the documentation states that, if fetcher.max.crawl.delay is set to
 -1, the crawler will always wait the amount of time the Crawl-Delay
 parameter specifies. However, as you can see, if it really is negative
 the condition on the line 616 is always true, which leads to skipping
 the page whose Crawl-Delay is set.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-20 Thread Ken Krugler (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558340#comment-13558340
 ] 

Ken Krugler commented on NUTCH-1031:


Hi Tejas - I've looked at your patch, and (assuming there's not a requirement 
to support precedence in the user agent name list) it seems like a valid 
change. Based on the RFC (http://www.robotstxt.org/norobots-rfc.txt) robot 
names shouldn't have commas, so splitting on that seems safe. Do you have a 
unit test to verify proper behavior? If so, I'd be happy to roll that into CC.

-- Ken

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Tejas Patil
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7

 Attachments: CC.robots.multiple.agents.patch, NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

2013-01-20 Thread Tejas Patil (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558349#comment-13558349
 ] 

Tejas Patil commented on NUTCH-1031:


Hi Ken, 
Thanks for reviewing the patch. I will include a test case in patch. Before 
that, a bigger question is whether Nutch should adopt the parsing model in CC 
and forget about the precedence.
BTW: Did you find any error in my understanding about how CC parses robots ?

 Delegate parsing of robots.txt to crawler-commons
 -

 Key: NUTCH-1031
 URL: https://issues.apache.org/jira/browse/NUTCH-1031
 Project: Nutch
  Issue Type: Task
Reporter: Julien Nioche
Assignee: Tejas Patil
Priority: Minor
  Labels: robots.txt
 Fix For: 1.7

 Attachments: CC.robots.multiple.agents.patch, NUTCH-1031.v1.patch


 We're about to release the first version of Crawler-Commons 
 [http://code.google.com/p/crawler-commons/] which contains a parser for 
 robots.txt files. This parser should also be better than the one we currently 
 have in Nutch. I will delegate this functionality to CC as soon as it is 
 available publicly

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (NUTCH-1219) Upgrade all jobs to new MapReduce API

2013-01-20 Thread lufeng (JIRA)


[ 
https://issues.apache.org/jira/browse/NUTCH-1219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13558476#comment-13558476
 ] 

lufeng commented on NUTCH-1219:
---

Hi Markus, i see that Injector, Generator and fetchor are still use old 
mapReduce API too, Should them also upgrade to new MR API. 

 Upgrade all jobs to new MapReduce API
 -

 Key: NUTCH-1219
 URL: https://issues.apache.org/jira/browse/NUTCH-1219
 Project: Nutch
  Issue Type: Task
Reporter: Markus Jelsma
Priority: Critical
 Fix For: 1.7


 We should upgrade to the new Hadoop API for Nutch trunk as already has been 
 done for the Nutchgora branch. If i'm not mistaken we can already upgrade to 
 the latest 0.20.5 version that still carries the legacy API so we can, 
 without immediately upgrading to 0.21 or higher, port the jobs to the new API 
 without having the need for a separate branch to work on.
 To the committers who created/ported jobs in NutchGora, please write down 
 your advice and experience.
 http://www.slideshare.net/sh1mmer/upgrading-to-the-new-map-reduce-api

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1223) Migrate WebGraph to MapReduce API

2013-01-20 Thread lufeng (JIRA)


 [ 
https://issues.apache.org/jira/browse/NUTCH-1223?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

lufeng updated NUTCH-1223:
--

Attachment: WebGraph_new_MR_API.patch

migrate WebGraph to new MR API patch

 Migrate WebGraph to MapReduce API
 -

 Key: NUTCH-1223
 URL: https://issues.apache.org/jira/browse/NUTCH-1223
 Project: Nutch
  Issue Type: Sub-task
Reporter: Markus Jelsma
Assignee: lufeng
 Fix For: 1.7

 Attachments: WebGraph_new_MR_API.patch




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Updated] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

[jira] [Assigned] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

[jira] [Assigned] (NUTCH-1513) Support Robots.txt for Ftp urls

[jira] [Updated] (NUTCH-1284) Add site fetcher.max.crawl.delay as log output by default.

[jira] [Assigned] (NUTCH-1042) Fetcher.max.crawl.delay property not taken into account correctly when set to -1

[jira] [Commented] (NUTCH-1042) Fetcher.max.crawl.delay property not taken into account correctly when set to -1

[jira] [Commented] (NUTCH-1329) parser not extract outlinks to external web sites

[jira] [Assigned] (NUTCH-1042) Fetcher.max.crawl.delay property not taken into account correctly when set to -1

[jira] [Commented] (NUTCH-1042) Fetcher.max.crawl.delay property not taken into account correctly when set to -1

[jira] [Commented] (NUTCH-1042) Fetcher.max.crawl.delay property not taken into account correctly when set to -1

[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

[jira] [Commented] (NUTCH-1031) Delegate parsing of robots.txt to crawler-commons

[jira] [Commented] (NUTCH-1219) Upgrade all jobs to new MapReduce API

[jira] [Updated] (NUTCH-1223) Migrate WebGraph to MapReduce API

14 matches

Site Navigation

Mail list logo

Footer information