On 2010-04-30 20:09, Emmanuel de Castro Santana wrote: > Hi All > > We are using Nutch to crawl ~500K pages with a 3 node cluster, each node > features a dual core processor running with 4Gb RAM and circa 100Gb storage. > All nodes run on CentOS. > > These 500K pages are scattered into several sites, each one of them having > from 5k up to 200k pages. For each site we start a different crawl process > (using bin/nutch crawl), but they are all almost simultaneously started. > > We are trying to tune Hadoop's configurations in order to have a reliable > daily crawling process. After a while of crawling we see some problems > occurring, mainly on the TaskTracker nodes, most of them are related to > access to the HDFS. We often see "Bad response 1 for block" and "Filesystem > closed", among others. When these errors start to get more frequent, the > JobTracker gets stuck and we have to run stop-all. If we adjust the maximum > of map and reduce tasks to lower values, the process takes longer to get > stuck, but we haven't found the adequate configuration yet. > > Given that setup, there are some question we have been struggling to find an > answer > > 1. What could be the most probable reason for the hdfs problems ?
I suspect the following issues in this order: * too small number of file handles on your machines (run ulimit -n, this should be set to 16k or more, the default is 1k). * do you use a SAN or other type of NAS as your storage? * network equipment, such as router or switch, or network card: quite often low-end equipment cannot handle high volume of traffic, even though it's equipped with all the right ports ;) > > 2. Is it better to start a unique crawl with all sites inside or to just > keep it the way we are doing (i.e start a different crawl process for each > site) ? It's much much better to crawl all sites at the same time. This allows you to benefit from parallel crawling - otherwise your fetcher will be always stuck in the politeness crawl delay. > > 3. When it all goes down, is there a way to restart crawling from where the > process stopped ? Unfortunately, no. You should at least crawl without parsing, so that when you download the content you can run the parsing separately, and repeat it if it fails. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com