Re: [Nutch-dev] 0.8.1

2006-09-24 Thread Sami Siren
I just wrapped up 0.8.1 release, sneak preview is temporarily available 
at http://people.apache.org/~siren/nutch-0.8.1/

I'll update the website and announce it after it has hit the mirrors and
nothing serious is not found in it in the following 48 hrs.

--
  Sami Siren

Andrzej Bialecki wrote:
 Sami Siren wrote:
 FYI

 I'll roll out nutch 0.8.1 later this week to release fix for couple of 
 severe problems in 0.8.
 
 There are a couple issues that have to make it into this release, 
 related to serious bugs in scoring - I plan to commit them by the end of 
 the week, so please hold on until I'm done.
 


-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT  business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


[Nutch-dev] [jira] Closed: (NUTCH-266) hadoop bug when doing updatedb

2006-09-24 Thread Sami Siren (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-266?page=all ]

Sami Siren closed NUTCH-266.



 hadoop bug when doing updatedb
 --

 Key: NUTCH-266
 URL: http://issues.apache.org/jira/browse/NUTCH-266
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
 Environment: windows xp, JDK 1.4.2_04
Reporter: Eugen Kochuev
 Fix For: 0.9.0, 0.8.1

 Attachments: patch.diff, patch_hadoop-0.5.0.diff


 I constantly get the following error message
 060508 230637 Running job: job_pbhn3t
 060508 230637 
 c:/nutch/crawl-20060508230625/crawldb/current/part-0/data:0+245
 060508 230637 
 c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_fetch/part-0/data:0+296
 060508 230637 
 c:/nutch/crawl-20060508230625/segments/20060508230628/crawl_parse/part-0:0+5258
 060508 230637 job_pbhn3t
 java.io.IOException: Target 
 /tmp/hadoop/mapred/local/reduce_qnd5sx/map_qjp7tf.out already exists
 at org.apache.hadoop.fs.FileUtil.checkDest(FileUtil.java:162)
 at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:62)
 at 
 org.apache.hadoop.fs.LocalFileSystem.renameRaw(LocalFileSystem.java:191)
 at org.apache.hadoop.fs.FileSystem.rename(FileSystem.java:306)
 at 
 org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:101)
 Exception in thread main java.io.IOException: Job failed!
 at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:341)
 at org.apache.nutch.crawl.CrawlDb.update(CrawlDb.java:54)
 at org.apache.nutch.crawl.Crawl.main(Crawl.java:114)

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT  business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


[Nutch-dev] [jira] Closed: (NUTCH-105) Network error during robots.txt fetch causes file to be ignored

2006-09-24 Thread Sami Siren (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-105?page=all ]

Sami Siren closed NUTCH-105.



 Network error during robots.txt fetch causes file to be ignored
 ---

 Key: NUTCH-105
 URL: http://issues.apache.org/jira/browse/NUTCH-105
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8, 0.8.1, 0.9.0
Reporter: Rod Taylor
Priority: Critical
 Fix For: 0.8.1, 0.9.0

 Attachments: RobotRulesParser.patch


 Earlier we had a small network glitch which prevented us from retrieving
 the robots.txt file for a site we were crawling at the time:
 nutch-root-tasktracker-sbider1.sitebuildit.com.log:051005 193021
 task_m_h02y5t  Couldn't get robots.txt for
 http://www.japanesetranslator.co.uk/portfolio/:
 org.apache.commons.httpclient.ConnectTimeoutException: The host
 did not accept the connection within timeout of 1 ms
 nutch-root-tasktracker-sbider1.sitebuildit.com.log:051005 193031
 task_m_h02y5t  Couldn't get robots.txt for
 http://www.japanesetranslator.co.uk/translation/:
 org.apache.commons.httpclient.ConnectTimeoutException: The host
 did not accept the connection within timeout of 1 ms
 Nutch then assumed that because we were unable to retrieve the file due
 to network issues, that it didn't exist and we could crawl the entire
 website. Nutch then successfully grabbed a few pages which were listed
 in the robots.txt as being disallowed.
 I think Nutch should continue attempting to retrieve the robots.txt file
 until, at very least, we are able to establish a connection to the host,
 otherwise the host should be ignored until the next round of fetches.
 The webmaster of japanesetranslator.co.uk filed a complaint informing us
 of the issue.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT  business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


[Nutch-dev] [jira] Closed: (NUTCH-338) Remove the text parser as an option for parsing PDF files in parse-plugins.xml

2006-09-24 Thread Sami Siren (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-338?page=all ]

Sami Siren closed NUTCH-338.



 Remove the text parser as an option for parsing PDF files in parse-plugins.xml
 --

 Key: NUTCH-338
 URL: http://issues.apache.org/jira/browse/NUTCH-338
 Project: Nutch
  Issue Type: Improvement
  Components: fetcher
Affects Versions: 0.8
 Environment: Mac Book Pro Dual Core Intel 2.1 Ghz, although 
 improvement is independent of environment
Reporter: Chris A. Mattmann
 Assigned To: Chris A. Mattmann
Priority: Trivial
 Fix For: 0.9.0, 0.8.1

 Attachments: NUTCH-338.Mattmann.patch.txt


 After some discussion on the mailing list, it was decided that parse-text 
 should not really be an option to parse PDF content. So, this issue includes 
 a trivial patch to remove the parse text plugin from being mapped to PDF 
 content in parse-pugins.xml.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT  business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


[Nutch-dev] [jira] Closed: (NUTCH-344) Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks

2006-09-24 Thread Sami Siren (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-344?page=all ]

Sami Siren closed NUTCH-344.



 Fetcher threads blocked on synchronized block in cleanExpiredServerBlocks
 -

 Key: NUTCH-344
 URL: http://issues.apache.org/jira/browse/NUTCH-344
 Project: Nutch
  Issue Type: Bug
  Components: fetcher
Affects Versions: 0.8.1, 0.9.0, 0.8
 Environment: All
Reporter: Greg Kim
 Fix For: 0.8.1, 0.9.0

 Attachments: cleanExpiredServerBlocks.patch, HttpBase.patch


 With the recent change to the following code in HttpBase.java has tendencies 
 to block fetcher threads while one thread busy waits... 
   private static void cleanExpiredServerBlocks() {
 synchronized (BLOCKED_ADDR_TO_TIME) {
   while (!BLOCKED_ADDR_QUEUE.isEmpty()) {   = LINE 3:   
 String host = (String) BLOCKED_ADDR_QUEUE.getLast();
 long time = ((Long) BLOCKED_ADDR_TO_TIME.get(host)).longValue();
 if (time = System.currentTimeMillis()) {   
   BLOCKED_ADDR_TO_TIME.remove(host);
   BLOCKED_ADDR_QUEUE.removeLast();
 }
   }
 }
   }
 LINE3:  As long as there are *any* entries in the BLOCKED_ADDR_QUEUE, the 
 thread that first enters this block busy-waits until it becomes empty while 
 all other threads block on the synchronized block.  This leads to extremely 
 poor fetcher performance.  
 Since the checkin to respect crawlDelay in robots.txt, we are no longer 
 guranteed that BLOCKED_ADDR_TO_TIME queue is a fifo list. The simple fix is 
 to iterate the queue once rather than busy waiting...

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT  business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


[Nutch-dev] [jira] Closed: (NUTCH-318) log4j not proper configured, readdb doesnt give any information

2006-09-24 Thread Sami Siren (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-318?page=all ]

Sami Siren closed NUTCH-318.



 log4j not proper configured, readdb doesnt give any information
 ---

 Key: NUTCH-318
 URL: http://issues.apache.org/jira/browse/NUTCH-318
 Project: Nutch
  Issue Type: Bug
Affects Versions: 0.8
Reporter: Stefan Groschupf
 Assigned To: Sami Siren
Priority: Critical
 Fix For: 0.9.0, 0.8.1


 In the latest .8 sources the readdb command doesn't dump any information 
 anymore. 
 This is realeated to the miss configured log4j.properties file. 
 changing:
 log4j.rootLogger=INFO,DRFA
 to:
 log4j.rootLogger=INFO,DRFA,stdout
 dumps the information to the console, but not in a nice way. 
 What makes me wonder  is that these information should be also in the log 
 file, but the arn't, so there are may be even here problems.
 Also what is the different between hadoop-XXX-jobtracker-XXX.out and 
 hadoop-XXX-jobtracker-XXX.log ?? Shouldn't there just one of them?

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT  business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers


[Nutch-dev] [jira] Closed: (NUTCH-370) Generator looses urls when run with LocalJobRunner

2006-09-24 Thread Sami Siren (JIRA)
 [ http://issues.apache.org/jira/browse/NUTCH-370?page=all ]

Sami Siren closed NUTCH-370.


Resolution: Duplicate

actually this is a duplicate of #361

 Generator looses urls when run with LocalJobRunner
 --

 Key: NUTCH-370
 URL: http://issues.apache.org/jira/browse/NUTCH-370
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 0.8, 0.9.0, 0.8.1
 Environment: linux
Reporter: Sami Siren
 Assigned To: Sami Siren

 When generator is run with LocalJobRunner part of generated urls get lost. 
 This is because two map outputs are created and only one of them is processed 
 in reduce phase.
 When -numFetchers 1 is provided as command line parameters problem goes away.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT  business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV
___
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers