Uroš Gruber wrote:
Andrzej Bialecki wrote:
Uroš Gruber wrote:
Hi,

I've made some changes in CrawlDbReader to read from fetchlist made from generate command. First I thought that I have problems with this script because some urls from inject were missing. Then I test on only 6 urls. I've manualy check file generated with inject and by generate and generate made only 3 urls in fetch list.

I don't quite understand this. As far as I understand generate command it collects urls from crawdb, do some sorting by score and puts it to crawl_generate directory.

Are you running in a local mode, or in map-reduce mode with several tasktrackers? what is the number of reduce tasks in this "generate" job?

I'm running local mode with mapred.reduce.tasks as default (1) and (2) map.tasks.

Debuging through map and reduce job (Generator$Selector [line: 147] - reduce, Generator$Selector [line: 99] - map) looks ok and It collects all urls from CrawlDB. I can't figure it out why data is lost when moving it from /tmp to crawl/segments/***/crawl_generate

If anyone could point me in right direction where to look

regards

Uros
regards

Uros


Reply via email to