Re: [Nutch-dev] 0.8.1
I just wrapped up 0.8.1 release, sneak preview is temporarily available at http://people.apache.org/~siren/nutch-0.8.1/ I'll update the website and announce it after it has hit the mirrors and nothing serious is not found in it in the following 48 hrs. -- Sami Siren Andrzej Bialecki wrote: Sami Siren wrote: FYI I'll roll out nutch 0.8.1 later this week to release fix for couple of severe problems in 0.8. There are a couple issues that have to make it into this release, related to serious bugs in scoring - I plan to commit them by the end of the week, so please hold on until I'm done. - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Closed: (NUTCH-266) hadoop bug when doing updatedb
[ http://issues.apache.org/jira/browse/NUTCH-266?page=all ] Sami Siren closed NUTCH-266. hadoop bug when doing updatedb -- Key: NUTCH-266 URL: http://issues.apache.org/jira/browse/NUTCH-266 Project: Nutch Issue Type: Bug Affects Versions: 0.8 Environment: windows xp, JDK 1.4.2_04 Reporter: Eugen Kochuev Fix For: 0.9.0, 0.8.1 Attachments: patch.diff, patch_hadoop-0.5.0.diff I constantly get the following error message 060508 230637 Running job: job_pbhn3t 060508 230637 c:/nutch/crawl-20060508230625/crawldb/current/part-0/data:0+245 060508 230637 c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_fetch/part-0/data:0+296 060508 230637 c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_parse/part-0:0+5258 060508 230637 job_pbhn3t java.io.IOException: Target /tmp/hadoop/mapred/local/reduce_qnd5sx/map_qjp7tf.out already exists at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:162) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:62) at org.apache.hadoop.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:191) at org.apache.hadoop.fs.FileSystem.rename(FileSystem.java:306) at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:101) Exception in thread main java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341) at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:54) at org.apache.nutch.crawl.Crawl.main(Crawl.java:114) -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Closed: (NUTCH-105) Network error during robots.txt fetch causes file to be ignored
[ http://issues.apache.org/jira/browse/NUTCH-105?page=all ] Sami Siren closed NUTCH-105. Network error during robots.txt fetch causes file to be ignored --- Key: NUTCH-105 URL: http://issues.apache.org/jira/browse/NUTCH-105 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8, 0.8.1, 0.9.0 Reporter: Rod Taylor Priority: Critical Fix For: 0.8.1, 0.9.0 Attachments: RobotRulesParser.patch Earlier we had a small network glitch which prevented us from retrieving the robots.txt file for a site we were crawling at the time: nutch-root-tasktracker-sbider1.sitebuildit.com.log:051005 193021 task_m_h02y5t Couldn't get robots.txt for http://www.japanesetranslator.co.uk/portfolio/: org.apache.commons.httpclient.ConnectTimeoutException: The host did not accept the connection within timeout of 1 ms nutch-root-tasktracker-sbider1.sitebuildit.com.log:051005 193031 task_m_h02y5t Couldn't get robots.txt for http://www.japanesetranslator.co.uk/translation/: org.apache.commons.httpclient.ConnectTimeoutException: The host did not accept the connection within timeout of 1 ms Nutch then assumed that because we were unable to retrieve the file due to network issues, that it didn't exist and we could crawl the entire website. Nutch then successfully grabbed a few pages which were listed in the robots.txt as being disallowed. I think Nutch should continue attempting to retrieve the robots.txt file until, at very least, we are able to establish a connection to the host, otherwise the host should be ignored until the next round of fetches. The webmaster of japanesetranslator.co.uk filed a complaint informing us of the issue. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Closed: (NUTCH-338) Remove the text parser as an option for parsing PDF files in parse-plugins.xml
[ http://issues.apache.org/jira/browse/NUTCH-338?page=all ] Sami Siren closed NUTCH-338. Remove the text parser as an option for parsing PDF files in parse-plugins.xml -- Key: NUTCH-338 URL: http://issues.apache.org/jira/browse/NUTCH-338 Project: Nutch Issue Type: Improvement Components: fetcher Affects Versions: 0.8 Environment: Mac Book Pro Dual Core Intel 2.1 Ghz, although improvement is independent of environment Reporter: Chris A. Mattmann Assigned To: Chris A. Mattmann Priority: Trivial Fix For: 0.9.0, 0.8.1 Attachments: NUTCH-338.Mattmann.patch.txt After some discussion on the mailing list, it was decided that parse-text should not really be an option to parse PDF content. So, this issue includes a trivial patch to remove the parse text plugin from being mapped to PDF content in parse-pugins.xml. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Closed: (NUTCH-344) Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks
[ http://issues.apache.org/jira/browse/NUTCH-344?page=all ] Sami Siren closed NUTCH-344. Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks - Key: NUTCH-344 URL: http://issues.apache.org/jira/browse/NUTCH-344 Project: Nutch Issue Type: Bug Components: fetcher Affects Versions: 0.8.1, 0.9.0, 0.8 Environment: All Reporter: Greg Kim Fix For: 0.8.1, 0.9.0 Attachments: cleanExpiredServerBlocks.patch, HttpBase.patch With the recent change to the following code in HttpBase.java has tendencies to block fetcher threads while one thread busy waits... private static void cleanExpiredServerBlocks() { synchronized (BLOCKED_ADDR_TO_TIME) { while (!BLOCKED_ADDR_QUEUE.isEmpty()) { = LINE 3: String host = (String) BLOCKED_ADDR_QUEUE.getLast(); long time = ((Long) BLOCKED_ADDR_TO_TIME.get(host)).longValue(); if (time = System.currentTimeMillis()) { BLOCKED_ADDR_TO_TIME.remove(host); BLOCKED_ADDR_QUEUE.removeLast(); } } } } LINE3: As long as there are *any* entries in the BLOCKED_ADDR_QUEUE, the thread that first enters this block busy-waits until it becomes empty while all other threads block on the synchronized block. This leads to extremely poor fetcher performance. Since the checkin to respect crawlDelay in robots.txt, we are no longer guranteed that BLOCKED_ADDR_TO_TIME queue is a fifo list. The simple fix is to iterate the queue once rather than busy waiting... -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Closed: (NUTCH-318) log4j not proper configured, readdb doesnt give any information
[ http://issues.apache.org/jira/browse/NUTCH-318?page=all ] Sami Siren closed NUTCH-318. log4j not proper configured, readdb doesnt give any information --- Key: NUTCH-318 URL: http://issues.apache.org/jira/browse/NUTCH-318 Project: Nutch Issue Type: Bug Affects Versions: 0.8 Reporter: Stefan Groschupf Assigned To: Sami Siren Priority: Critical Fix For: 0.9.0, 0.8.1 In the latest .8 sources the readdb command doesn't dump any information anymore. This is realeated to the miss configured log4j.properties file. changing: log4j.rootLogger=INFO,DRFA to: log4j.rootLogger=INFO,DRFA,stdout dumps the information to the console, but not in a nice way. What makes me wonder is that these information should be also in the log file, but the arn't, so there are may be even here problems. Also what is the different between hadoop-XXX-jobtracker-XXX.out and hadoop-XXX-jobtracker-XXX.log ?? Shouldn't there just one of them? -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers
[Nutch-dev] [jira] Closed: (NUTCH-370) Generator looses urls when run with LocalJobRunner
[ http://issues.apache.org/jira/browse/NUTCH-370?page=all ] Sami Siren closed NUTCH-370. Resolution: Duplicate actually this is a duplicate of #361 Generator looses urls when run with LocalJobRunner -- Key: NUTCH-370 URL: http://issues.apache.org/jira/browse/NUTCH-370 Project: Nutch Issue Type: Bug Components: generator Affects Versions: 0.8, 0.9.0, 0.8.1 Environment: linux Reporter: Sami Siren Assigned To: Sami Siren When generator is run with LocalJobRunner part of generated urls get lost. This is because two map outputs are created and only one of them is processed in reduce phase. When -numFetchers 1 is provided as command line parameters problem goes away. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira - Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV ___ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers