I made a urls file, yah didn't realize that was waht crawl referred to. I thought it would simply grab the url from the conf\urlfilter.txt file.

bin/nutch crawl urls -dir crawled -depth 2 >& crawl.log

Well at least now it ran, but with zero results. file urls contains:
#+^http://([a-z0-9]*\.)*apache.org/
+^http://www.calpoly.edu/~acadprog/2005course.html

<attached crawl.log>

Whenever I try to do -topN I get this error:
bin/nutch crawl urls -dir crawled -depth 2 -topN 1000

returns:
{blah}
060223 132142 crawl started in: crawled
060223 132142 rootUrlFile = 1000
060223 132142 threads = 10
060223 132142 depth = 2
060223 132142 Created webdb at LocalFS,C:\cygwin\home\falieson\nutch\crawled\db
Exception in thread "main" java.io.FileNotFoundException: 1000 (The system cannot find the file specified)
    at java.io.FileInputStream.open(Native Method)
{blah}



On 2/23/06, [EMAIL PROTECTED] <[EMAIL PROTECTED]> wrote:
Hi Florian,


Where is your urls file located?. If you created urls in the conf folder
then you have to call:

bin/nutch crawl conf/urls -dir crawlresults/ -depth 2 - topN 1000

Good luck

Detlev



I am running cygwin (I know), with jdk1.5.0 and tomcat 4.1
From cygwin I run:

bin/nutch crawl urls -dir crawlresults/ -depth 2 - topN 1000

results:

run java in C:/program files/java/jdk1.5.0/
060223 123010 parsing file:/c:/cygwin/home/falieson/nutch/conf/nutch-
default.xml

060223 123010 parsing file:/c:/cygwin/home/falieson/nutch/conf/crawl-
tool.xml
060223 123010 parsing file:/c:/cygwin/home/falieson/nutch/conf/nutch-
site.xml
060223 123010 No FS indicated, using default:local
060223 123010 rootUrlFile = 10000
060223 123010 thread = 10
060223 123010 depth = 2
060223 123011 Created webdb at
LocalFS,C:\cygwin\home\falieson\crawlresults\db
Exception in thread "main" java.io.FileNotFoundException: 10000 <the system
cannot find the file specified>
    at java.io.FileInputStream.open(Native Method)
    at java.io.FileInputStream.<init>( FileInputStream.java:106)
    at java.io.FileReader.<init>(FileReader.java:55)
    at org.apache.nutch.db.WebDBInjector.injectURLFile(WebDBInjector.java
:372)
    at org.apache.nutch.db.WebDBInjector.main (WebDBInjector.java:535)
    at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:134)


~~~
bin/nutch crawl urls -dir crawled -depth 3

results:

run java in C:/program files/java/jdk1.5.0/
060223 123832 parsing file:/C:/cygwin/home/falieson/nutch/conf/nutch-
default.xml
060223 123832 parsing file:/C:/cygwin/home/falieson/nutch/conf/crawl-
tool.xml
060223 123832 parsing file:/C:/cygwin/home/falieson/nutch/conf/nutch-
site.xml
060223 123832 No FS indicated, using default:local
060223 123832 crawl started in: crawled
060223 123832 rootUrlFile = urls
060223 123832 threads = 10
060223 123832 depth = 3
060223 123832 Created webdb at LocalFS,C:\cygwin\home\falieson\crawled\db
Exception in thread "main" java.io.FileNotFoundException: urls (The system
cannot find the file specified)
    at java.io.FileInputStream.open(Native Method)
    at java.io.FileInputStream.<init>( FileInputStream.java:106)
    at java.io.FileReader.<init>(FileReader.java:55)
    at org.apache.nutch.db.WebDBInjector.injectURLFile(WebDBInjector.java
:372)
    at org.apache.nutch.db.WebDBInjector.main (WebDBInjector.java:535)
    at org.apache.nutch.tools.CrawlTool.main(CrawlTool.java:134)


~~
TIA
--
Best Regards,
Florian Mettetal




--
Best Regards,
Florian Mettetal

Reply via email to