Re: nutch protocol-file

Cam Bazz Sun, 03 Sep 2006 08:26:32 -0700

Hello,

I almost got it to work, it starts to crawl but after sometime itfinishes. I have something like 200000 little html pages, (2-3k) under/nutch/data.

I am only getting like 100, or 200 pages, then it stops.


The command I am giving is:

# bin/nutch crawl urls -dir crawl -threads 10 -depth 2

directory urls, contain a urls file, which contains:

file:///nutch/data/

Any Ideas?

-Thanks a bunch.

When I look at the logs, I see:

2006-09-03 18:27:04,231 INFO indexer.Indexer - Indexing[file:///nutch/data] with analyzer[EMAIL PROTECTED] (null)2006-09-03 18:27:04,381 INFO indexer.Indexer - maxFieldLength 10000reached, ignoring following tokens

2006-09-03 18:27:04,454 INFO  indexer.Indexer - Optimizing index.

2006-09-03 18:27:04,529 INFO indexer.Indexer - merging segments _0 (1docs) into _1 (1 docs)

2006-09-03 18:27:05,442 INFO  indexer.Indexer - Indexer: done
2006-09-03 18:27:05,444 INFO  indexer.DeleteDuplicates - Dedup: starting

2006-09-03 18:27:05,459 INFO indexer.DeleteDuplicates - Dedup: addingindexes in: crawl/indexes

2006-09-03 18:27:08,042 INFO  indexer.DeleteDuplicates - Dedup: done

2006-09-03 18:27:08,043 INFO indexer.IndexMerger - Addingcrawl/indexes/part-00000

2006-09-03 18:27:08,080 INFO  crawl.Crawl - crawl finished: crawl

2006-09-03 18:28:07,452 INFO crawl.CrawlDbReader - CrawlDb statisticsstart: crawl/crawldb2006-09-03 18:28:09,387 INFO crawl.CrawlDbReader - Statistics forCrawlDb: crawl/crawldb

2006-09-03 18:28:09,387 INFO  crawl.CrawlDbReader - TOTAL urls: 101
2006-09-03 18:28:09,388 INFO  crawl.CrawlDbReader - avg score:  1.008
2006-09-03 18:28:09,388 INFO  crawl.CrawlDbReader - max score:  1.009
2006-09-03 18:28:09,388 INFO  crawl.CrawlDbReader - min score:  1.0
2006-09-03 18:28:09,388 INFO  crawl.CrawlDbReader - retry 0:    1
2006-09-03 18:28:09,388 INFO  crawl.CrawlDbReader - retry 1:    100

2006-09-03 18:28:09,388 INFO crawl.CrawlDbReader - status 1(DB_unfetched): 1002006-09-03 18:28:09,388 INFO crawl.CrawlDbReader - status 2(DB_fetched): 1

2006-09-03 18:28:09,388 INFO  crawl.CrawlDbReader - CrawlDb statistics: done



Thomas Delnoij wrote:

Just add scoring-opic to your plugin.includes in nutch-site.xml.

Rgrds, Thomas

On 9/1/06, Cam Bazz <[EMAIL PROTECTED]> wrote:

Hello,

I wanted to index my files so I followed the instructions at

http://www.folge2.de/tp/search/1/crawling-the-local-filesystem-with-nutch


I get : Exception in thread "main" java.io.IOException: Job failed!

and looking at the log file:

2006-09-01 01:49:43,166 WARN  mapred.LocalJobRunner - job_p2pnnk
java.lang.RuntimeException: No scoring plugins - at least one scoring
plugin is required!
        at
org.apache.nutch.scoring.ScoringFilters.<init>(ScoringFilters.java:84)
        at
org.apache.nutch.crawl.Injector$InjectMapper.configure(Injector.java:57)

atorg.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:443)atorg.apache.hadoop.mapred.MapRunner.configure(MapRunner.java:33)atorg.apache.hadoop.mapred.JobConf.newInstance(JobConf.java:443)

        at org.apache.hadoop.mapred.MapTask.run(MapTask.java:125)
        at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:90)


my plugin.includes is like:

<property>
  <name>plugin.includes</name>

</property>

how can I add a scoring plugin. by default, we dont have to add a
scoring plugin, so I dont know where to go.

Any ideas appreciated,

Best Regards,
-C.B.

Re: nutch protocol-file

Reply via email to