As you scan see from the below the %age complete is very low until all of a sudden it jumps to fully complete. This started happening with some segments about a week ago. Others go through their full list of ~10 000 urls. It appears to occur whether I use a generate.max.per.host directive or if I leave it out. Plugins are as defined by default.
There are no errors logged at either the jobtracker or tasktracker. Happens whether I use a datanode/namenode configuration or local filesystem. A full log for this task is attached. 051110 214542 task_m_8pwl0q Parsing [http://www.nebrodibandb.it/chiesemonum.html] with [EMAIL PROTECTED] 051110 214543 task_m_8pwl0q Parsing [http://www.nyc-architecture.com/SOH/SOH017.htm] with [EMAIL PROTECTED] 051110 214543 task_m_8pwl0q Parsing [http://www.town.ocean-city.md.us/Recreation/Forms/CampRegistrationForm.html] with [EMAIL PROTECTED] 051110 214543 task_m_8pwl0q 0.0022044207% 470 pages, 71 errors, 9.4 pages/s, 781 kb/s, 051110 214544 task_m_8pwl0q 0.0022044207% 470 pages, 71 errors, 9.2 pages/s, 766 kb/s, 051110 214545 task_m_8pwl0q 0.0022044207% 470 pages, 71 errors, 9.0 pages/s, 751 kb/s, 051110 214546 task_m_8pwl0q 0.0022044207% 470 pages, 71 errors, 8.9 pages/s, 737 kb/s, 051110 214547 task_m_8pwl0q org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 051110 214547 task_m_8pwl0q at org.apache.nutch.protocol.httpclient.Http.blockAddr(Http.java:133) 051110 214547 task_m_8pwl0q at org.apache.nutch.protocol.httpclient.Http.getProtocolOutput(Http.java:201) 051110 214547 task_m_8pwl0q at org.apache.nutch.protocol.httpclient.Http.getProtocolOutput(Http.java:182) 051110 214547 task_m_8pwl0q at org.apache.nutch.crawl.Fetcher$FetcherThread.run(Fetcher.java:114) 051110 214547 task_m_8pwl0q fetch of http://www.thisisjersey.com/section/familynotices.html failed with: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 051110 214547 task_m_8pwl0q org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 051110 214547 task_m_8pwl0q at org.apache.nutch.protocol.httpclient.Http.blockAddr(Http.java:133) 051110 214547 task_m_8pwl0q at org.apache.nutch.protocol.httpclient.Http.getProtocolOutput(Http.java:201) 051110 214547 task_m_8pwl0q at org.apache.nutch.protocol.httpclient.Http.getProtocolOutput(Http.java:182) 051110 214547 task_m_8pwl0q at org.apache.nutch.crawl.Fetcher$FetcherThread.run(Fetcher.java:114) 051110 214547 task_m_8pwl0q fetch of http://www.thisisjersey.com/section/sale.html failed with: org.apache.nutch.protocol.RetryLater: Exceeded http.max.delays: retry later. 051110 214547 task_m_8pwl0q Parsing [http://www.thisisjersey.com/itprofessionals/] with [EMAIL PROTECTED] 051110 214548 task_m_8pwl0q 0.0022044207% 471 pages, 73 errors, 8.7 pages/s, 727 kb/s, 051110 214549 task_m_8pwl0q 0.0022044207% 471 pages, 73 errors, 8.6 pages/s, 713 kb/s, 051110 214550 task_m_8pwl0q Parsing [http://www.geocities.com/redzombies/] with [EMAIL PROTECTED] 051110 214550 task_m_8pwl0q 0.0022044207% 471 pages, 73 errors, 8.6 pages/s, 713 kb/s, 051110 214551 task_m_8pwl0q 0.0022044207% 472 pages, 73 errors, 8.3 pages/s, 689 kb/s, 051110 214551 task_m_8pwl0q Parsing [http://www.communitytransport.com/events/2005/pdfs/brochure05.pdf] with [EMAIL PROTECTED] 051110 214552 task_m_8pwl0q 0.0022044207% 473 pages, 73 errors, 8.2 pages/s, 680 kb/s, 051110 214552 task_m_8pwl0q 0.0022044207% 473 pages, 73 errors, 8.2 pages/s, 680 kb/s, 051110 214552 Task task_m_8pwl0q is done. -- Rod Taylor <[EMAIL PROTECTED]>
task.log.gz
Description: GNU Zip compressed data
