On 2010-04-30 20:09, Emmanuel de Castro Santana wrote:
> Hi All
> 
> We are using Nutch to crawl ~500K pages with a 3 node cluster, each node
> features a dual core processor running with 4Gb RAM and circa 100Gb storage.
> All nodes run on CentOS.
> 
> These 500K pages are scattered into several sites, each one of them having
> from 5k up to 200k pages. For each site we start a different crawl process
> (using bin/nutch crawl), but they are all almost simultaneously started.
> 
> We are trying to tune Hadoop's configurations in order to have a reliable
> daily crawling process. After a while of crawling we see some problems
> occurring, mainly on the TaskTracker nodes, most of them are related to
> access to the HDFS. We often see "Bad response 1 for block" and "Filesystem
> closed", among others. When these errors start to get more frequent, the
> JobTracker gets stuck and we have to run stop-all. If we adjust the maximum
> of map and reduce tasks to lower values, the process takes longer to get
> stuck, but we haven't found the adequate configuration yet.
> 
> Given that setup, there are some question we have been struggling to find an
> answer
> 
> 1. What could be the most probable reason for the hdfs problems ?

I suspect the following issues in this order:

* too small number of file handles on your machines (run ulimit -n, this
should be set to 16k or more, the default is 1k).

* do you use a SAN or other type of NAS as your storage?

* network equipment, such as router or switch, or network card: quite
often low-end equipment cannot handle high volume of traffic, even
though it's equipped with all the right ports ;)

> 
> 2. Is it better to start a unique crawl with all sites inside or to just
> keep it the way we are doing (i.e start a different crawl process for each
> site) ?

It's much much better to crawl all sites at the same time. This allows
you to benefit from parallel crawling - otherwise your fetcher will be
always stuck in the politeness crawl delay.

> 
> 3. When it all goes down, is there a way to restart crawling from where the
> process stopped ?

Unfortunately, no. You should at least crawl without parsing, so that
when you download the content you can run the parsing separately, and
repeat it if it fails.

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to