Uroš Gruber wrote:
Andrzej Bialecki wrote:
Uroš Gruber wrote:
Hi,
I've made some changes in CrawlDbReader to read from fetchlist made
from generate command. First I thought that I have problems with
this script because some urls from inject were missing. Then I test
on only 6 urls. I've manualy check file generated with inject and by
generate and generate made only 3 urls in fetch list.
I don't quite understand this. As far as I understand generate
command it collects urls from crawdb, do some sorting by score and
puts it to crawl_generate directory.
Are you running in a local mode, or in map-reduce mode with several
tasktrackers? what is the number of reduce tasks in this "generate" job?
I'm running local mode with mapred.reduce.tasks as default (1) and (2)
map.tasks.
Debuging through map and reduce job (Generator$Selector [line: 147] -
reduce, Generator$Selector [line: 99] - map) looks ok and It collects
all urls from CrawlDB. I can't figure it out why data is lost when
moving it from /tmp to crawl/segments/***/crawl_generate
If anyone could point me in right direction where to look
regards
Uros
regards
Uros