Hello,

I almost got it to work, it starts to crawl but after sometime it finishes. I have something like 200000 little html pages, (2-3k) under /nutch/data.
I am only getting like 100, or 200 pages, then it stops.

The command I am giving is:

# bin/nutch crawl urls -dir crawl -threads 10 -depth 2

directory urls, contain a urls file, which contains:

file:///nutch/data/

Any Ideas?

-Thanks a bunch.

When I look at the logs, I see:

2006-09-03 18:27:04,231 INFO indexer.Indexer - Indexing [file:///nutch/data] with analyzer [EMAIL PROTECTED] (null) 2006-09-03 18:27:04,381 INFO indexer.Indexer - maxFieldLength 10000 reached, ignoring following tokens
2006-09-03 18:27:04,454 INFO  indexer.Indexer - Optimizing index.
2006-09-03 18:27:04,529 INFO indexer.Indexer - merging segments _0 (1 docs) into _1 (1 docs)
2006-09-03 18:27:05,442 INFO  indexer.Indexer - Indexer: done
2006-09-03 18:27:05,444 INFO  indexer.DeleteDuplicates - Dedup: starting
2006-09-03 18:27:05,459 INFO indexer.DeleteDuplicates - Dedup: adding indexes in: crawl/indexes
2006-09-03 18:27:08,042 INFO  indexer.DeleteDuplicates - Dedup: done
2006-09-03 18:27:08,043 INFO indexer.IndexMerger - Adding crawl/indexes/part-00000
2006-09-03 18:27:08,080 INFO  crawl.Crawl - crawl finished: crawl
2006-09-03 18:28:07,452 INFO crawl.CrawlDbReader - CrawlDb statistics start: crawl/crawldb 2006-09-03 18:28:09,387 INFO crawl.CrawlDbReader - Statistics for CrawlDb: crawl/crawldb
2006-09-03 18:28:09,387 INFO  crawl.CrawlDbReader - TOTAL urls: 101
2006-09-03 18:28:09,388 INFO  crawl.CrawlDbReader - avg score:  1.008
2006-09-03 18:28:09,388 INFO  crawl.CrawlDbReader - max score:  1.009
2006-09-03 18:28:09,388 INFO  crawl.CrawlDbReader - min score:  1.0
2006-09-03 18:28:09,388 INFO  crawl.CrawlDbReader - retry 0:    1
2006-09-03 18:28:09,388 INFO  crawl.CrawlDbReader - retry 1:    100
2006-09-03 18:28:09,388 INFO crawl.CrawlDbReader - status 1 (DB_unfetched): 100 2006-09-03 18:28:09,388 INFO crawl.CrawlDbReader - status 2 (DB_fetched): 1
2006-09-03 18:28:09,388 INFO  crawl.CrawlDbReader - CrawlDb statistics: done



Thomas Delnoij wrote:
Just add scoring-opic to your plugin.includes in nutch-site.xml.

Rgrds, Thomas

On 9/1/06, Cam Bazz <[EMAIL PROTECTED]> wrote:
Hello,

I wanted to index my files so I followed the instructions at

http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch

I get : Exception in thread "main" java.io.IOException: Job failed!

and looking at the log file:

2006-09-01 01:49:43,166 WARN  mapred.LocalJobRunner - job_p2pnnk
java.lang.RuntimeException: No scoring plugins - at least one scoring
plugin is required!
        at
org.apache.nutch.scoring.ScoringFilters.<init>(ScoringFilters.java:84)
        at
org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:57)
at org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:443) at org.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33) at org.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:443)
        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:125)
        at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)


my plugin.includes is like:

<property>
  <name>plugin.includes</name>

<value>protocol-file|protocol-http|urlfilter-regex|parse-(xml|text|html|js|pdf)|index-basic|query-(basic|site|url)</value>
</property>

how can I add a scoring plugin. by default, we dont have to add a
scoring plugin, so I dont know where to go.

Any ideas appreciated,

Best Regards,
-C.B.






Reply via email to