[Nutch-general] Re: Exception from crawl command

sudhendra seshachala Thu, 02 Mar 2006 06:46:04 -0800

Okay.
  Have you tried, the 0.8 version. Seems like it is more stable than the 0.7.X. 
(The one you are using)
  It is a bit different too.. with Hadoop and nutch being separate..
  I had few issues using 0.7X. But nightly-build (0.8), I was upto speed 
comparatively sooner.
   
  I hope this helps.. I am not trying to go away from the problem, just that 
next release is more stable and more ever, there is no backward compatibility 
for 0,8X. (That is what I read in one of the mails achieve) You are better off 
using 0.8..
   
  Thanks
  Sudhi

[EMAIL PROTECTED] wrote:
  Hi,

sorry for the fumbled reply, I've tried deleting the
directory and starting the crawl from scratch a number
of times, with very similar results.

The system seems to be generating the exception after
the fetch block of the output after an apparently
arbitrary depth. It leaves the directory with a db
folder containing:

Mar 2 09:30 dbreadlock
Mar 2 09:31 dbwritelock
Mar 2 09:30 webdb
Mar 2 09:31 webdb.new

The webdb.new folder contains:

Mar 2 09:30 pagesByURL
Mar 2 09:30 stats
Mar 2 09:31 tmp

I have the following set in my nutch-site.xml file:

urlnormalizer.class

org.apache.nutch.net.RegexUrlNormalizer
Name of the class used to normalize
URLs.

urlnormalizer.regex.file
regex-normalize.xml
Name of the config file used by the
RegexUrlNormalizer class.

http.content.limit
-1
The length limit for downloaded
content, in bytes.
If this value is nonnegative (>=0), content longer
than it will be truncated;
otherwise, no truncation at all.

plugin.includes

nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html|pdf)|index-basic|query-(basic|site|url)
Regular expression naming plugin
directory names to
include. Any plugin not matching this expression is
excluded.
In any case you need at least include the
nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain
text via HTTP,
and basic indexing and search plugins.

I don't think any of this should cause the problem. 
I'm going to try reinstalling and setting everything
up again, but if anyone has any idea what the problem
might be then please let me know.

cheers,

Julian.

--- sudhendra seshachala wrote:

> Delete the folder/database and then re-issue the
> crawl command.
> The database/folder gets created when Crawl is
> used. 
> I am recent user too... But, I did get the same
> message and I corrected by deleting the folder. IF
> any one has better ideas, please share.
> 
> Thanks
> 
> [EMAIL PROTECTED] wrote:
> Hi,
> 
> I've been experimenting with nutch and lucene,
> everything was working fine, but now I'm getting an
> exception thrown from the crawl command.
> 
> The command manages a few fetch cycles but then I
> get
> the following message:
> 
> 060301 161128 status: segment 20060301161046, 38
> pages, 0 errors, 856591 bytes, 41199 ms
> 060301 161128 status: 0.92235243 pages/s, 162.43396
> kb/s, 22541.87 bytes/page
> 060301 161129 Updating C:\PF\nutch-0.7.1\LIVE\db
> 060301 161129 Updating for
> C:\PF\nutch-0.7.1\LIVE\segments\20060301161046
> 060301 161129 Processing document 0
> 060301 161130 Finishing update
> 060301 161130 Processing pagesByURL: Sorted 952
> instructions in 0.02 seconds.
> 060301 161130 Processing pagesByURL: Sorted 47600.0
> instructions/second
> java.io.IOException: already exists:
> C:\PF\nutch-0.7.1\LIVE\db\webdb.new\pagesByURL
> at
> org.apache.nutch.io.MapFile$Writer.(MapFile.java:86)
> at
>
org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:549)
> at
>
org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
> at
>
org.apache.nutch.tools.UpdateDatabaseTool.close(UpdateDatabaseTool.java:321)
> at
>
org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:371)
> at
>
org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141)
> Exception in thread "main" 
> 
> Does anyone have any ideas what the problem is
> likely
> to be. I am running nutch 0.7.1
> 
> thanks,
> 
> 
> Julian.
> 
> 
> 
> Sudhi Seshachala
> http://sudhilogs.blogspot.com/
> 
> 
> 
> 
> ---------------------------------
> Yahoo! Mail
> Use Photomail to share photos without annoying
attachments.

  Sudhi Seshachala
  http://sudhilogs.blogspot.com/

---------------------------------
Yahoo! Mail
Bring photos to life! New PhotoMail  makes sharing a breeze.

[Nutch-general] Re: Exception from crawl command

Reply via email to