On 2010-08-02 22:59, Scott Gonyea wrote:
By the way, can anyone tell me if there is a way to explicitly limit how
many pages should be fetched, per fetcher-task?
I believe that in general case it would be a very complex problem to
solve so that you get exact results. The reason is that Nutch doesn't
use any global lock manager, so the only way to ensure a proper per-host
locking is to assign all URL-s from any given host to the same map task.
This may (and often will) create an imbalance in the number of allocated
URL-s per task.
One method to mitigate this imbalance is to set generate.max.count (in
trunk, generate.max.per.host in 1.1) - this will limit the number of
URL-s from any given host to X, thus helping in a more balanced mixing
of these N per-host chunks across M maps.
I think part of the problem is that, seemingly, Nutch seems to be
generating some really unbalanced fetcher tasks.
The task (task_201008021617_0026_m_000000) had 6859 pages to fetch.
Each higher-numbered task had fewer pages to fetch. Task 000180 only
had 44 pages to fetch.
There's no specific tool to examine the composition of fetchlist
parts... try running this in the segments/2010*/crawl_generate/:
for i in part-00*
do
echo "---- part $i -----"
strings $i | grep http://
done
to print URL-s per map task. Most likely you will see that there was no
other way to allocate the URLs per task to satisfy the constraint that I
explained above. If it's not the case, then it's a bug. :)
This *huge* imbalance, I think, creates tasks that are seemingly
unpredictable. All of my other resources just sit around, wasting
resources, until one task grabs some crazy number of sites.
Again, generate.max.count is your friend - even though you won't be able
to get all pages from a big site in one go, at least your crawls will
finish quickly and you will quickly progress breadth-wise, if not
depth-wise.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com