[Nutch-general] Re: Exception from crawl command

throwawayuseridfor-nutch Thu, 02 Mar 2006 03:42:04 -0800

Hi,

sorry for the fumbled reply, I've tried deleting the
directory and starting the crawl from scratch a number
of times, with very similar results.


The system seems to be generating the exception after
the fetch block of the output after an apparently
arbitrary depth.  It leaves the directory with a db
folder containing:

Mar  2 09:30 dbreadlock
Mar  2 09:31 dbwritelock
Mar  2 09:30 webdb
Mar  2 09:31 webdb.new

The webdb.new folder contains:

Mar  2 09:30 pagesByURL
Mar  2 09:30 stats
Mar  2 09:31 tmp

I have the following set in my nutch-site.xml file:

<property>
  <name>urlnormalizer.class</name>
 
<value>org.apache.nutch.net.RegexUrlNormalizer</value>
  <description>Name of the class used to normalize
URLs.</description>
</property>

<property>
  <name>urlnormalizer.regex.file</name>
  <value>regex-normalize.xml</value>
  <description>Name of the config file used by the
RegexUrlNormalizer class.</description>
</property>

<property>
  <name>http.content.limit</name>
  <value>-1</value>
  <description>The length limit for downloaded
content, in bytes.
  If this value is nonnegative (>=0), content longer
than it will be truncated;
  otherwise, no truncation at all.
  </description>
</property>

<property>
  <name>plugin.includes</name>
 
<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html|pdf)|index-basic|query-(basic|site|url)</value>
  <description>Regular expression naming plugin
directory names to
  include.  Any plugin not matching this expression is
excluded.
  In any case you need at least include the
nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain
text via HTTP,
  and basic indexing and search plugins.
  </description>
</property>

I don't think any of this should cause the problem. 
I'm going to try reinstalling and setting everything
up again, but if anyone has any idea what the problem
might be then please let me know.

cheers,


Julian.


--- sudhendra seshachala <[EMAIL PROTECTED]> wrote:

> Delete the folder/database and then re-issue the
> crawl command.
>   The database/folder gets created when Crawl is
> used. 
>   I am recent user too... But, I did get the same
> message and I corrected by deleting the folder. IF
> any one has better ideas, please share.
>    
>   Thanks
>    
>   [EMAIL PROTECTED] wrote:
>   Hi,
> 
> I've been experimenting with nutch and lucene,
> everything was working fine, but now I'm getting an
> exception thrown from the crawl command.
> 
> The command manages a few fetch cycles but then I
> get
> the following message:
> 
> 060301 161128 status: segment 20060301161046, 38
> pages, 0 errors, 856591 bytes, 41199 ms
> 060301 161128 status: 0.92235243 pages/s, 162.43396
> kb/s, 22541.87 bytes/page
> 060301 161129 Updating C:\PF\nutch-0.7.1\LIVE\db
> 060301 161129 Updating for
> C:\PF\nutch-0.7.1\LIVE\segments\20060301161046
> 060301 161129 Processing document 0
> 060301 161130 Finishing update
> 060301 161130 Processing pagesByURL: Sorted 952
> instructions in 0.02 seconds.
> 060301 161130 Processing pagesByURL: Sorted 47600.0
> instructions/second
> java.io.IOException: already exists:
> C:\PF\nutch-0.7.1\LIVE\db\webdb.new\pagesByURL
> at
> org.apache.nutch.io.MapFile$Writer.(MapFile.java:86)
> at
>
org.apache.nutch.db.WebDBWriter$CloseProcessor.closeDown(WebDBWriter.java:549)
> at
>
org.apache.nutch.db.WebDBWriter.close(WebDBWriter.java:1544)
> at
>
org.apache.nutch.tools.UpdateDatabaseTool.close(UpdateDatabaseTool.java:321)
> at
>
org.apache.nutch.tools.UpdateDatabaseTool.main(UpdateDatabaseTool.java:371)
> at
>
org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:141)
> Exception in thread "main" 
> 
> Does anyone have any ideas what the problem is
> likely
> to be. I am running nutch 0.7.1
> 
> thanks,
> 
> 
> Julian.
> 
> 
> 
>   Sudhi Seshachala
>   http://sudhilogs.blogspot.com/
>    
> 
> 
>               
> ---------------------------------
>  Yahoo! Mail
>  Use Photomail to share photos without annoying
attachments.



-------------------------------------------------------
This SF.Net email is sponsored by xPML, a groundbreaking scripting language
that extends applications into web and mobile media. Attend the live webcast
and join the prime developer group breaking into this new coding territory!
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=110944&bid=241720&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

[Nutch-general] Re: Exception from crawl command

Reply via email to