Thank you for answering, these tips are gonna be valuable
"too small number of file handles on your machines (run ulimit -n, this should be set to 16k or more, the default is 1k)." Set it to 16k. The process has been runing for a day now and it did not get stuck yet. I still often see errors related to the dfs though. Anyways, are there other properties I should pay attention to ? "do you use a SAN or other type of NAS as your storage?" No, this is not the case "network equipment, such as router or switch, or network card: quite often low-end equipment cannot handle high volume of traffic, even though it's equipped with all the right ports ;)" This should be one of the reasons "It's much much better to crawl all sites at the same time. This allows you to benefit from parallel crawling - otherwise your fetcher will be always stuck in the politeness crawl delay." Good, that means we are on the right track "Unfortunately, no. You should at least crawl without parsing, so that when you download the content you can run the parsing separately, and repeat it if it fails." I've just found this in the FAQ, can it be done ? http://wiki.apache.org/nutch/FAQ#How_can_I_recover_an_aborted_fetch_process.3F By the way, about not parsing, isn't necessary to parse the content anyway in order to generate links for the next segment ? If this is true, one would have to run parse separatedly, which would result the same. 2010/4/30 Andrzej Bialecki <a...@getopt.org> > On 2010-04-30 20:09, Emmanuel de Castro Santana wrote: > > Hi All > > > > We are using Nutch to crawl ~500K pages with a 3 node cluster, each node > > features a dual core processor running with 4Gb RAM and circa 100Gb > storage. > > All nodes run on CentOS. > > > > These 500K pages are scattered into several sites, each one of them > having > > from 5k up to 200k pages. For each site we start a different crawl > process > > (using bin/nutch crawl), but they are all almost simultaneously started. > > > > We are trying to tune Hadoop's configurations in order to have a reliable > > daily crawling process. After a while of crawling we see some problems > > occurring, mainly on the TaskTracker nodes, most of them are related to > > access to the HDFS. We often see "Bad response 1 for block" and > "Filesystem > > closed", among others. When these errors start to get more frequent, the > > JobTracker gets stuck and we have to run stop-all. If we adjust the > maximum > > of map and reduce tasks to lower values, the process takes longer to get > > stuck, but we haven't found the adequate configuration yet. > > > > Given that setup, there are some question we have been struggling to find > an > > answer > > > > 1. What could be the most probable reason for the hdfs problems ? > > I suspect the following issues in this order: > > * too small number of file handles on your machines (run ulimit -n, this > should be set to 16k or more, the default is 1k). > > * do you use a SAN or other type of NAS as your storage? > > * network equipment, such as router or switch, or network card: quite > often low-end equipment cannot handle high volume of traffic, even > though it's equipped with all the right ports ;) > > > > > 2. Is it better to start a unique crawl with all sites inside or to just > > keep it the way we are doing (i.e start a different crawl process for > each > > site) ? > > It's much much better to crawl all sites at the same time. This allows > you to benefit from parallel crawling - otherwise your fetcher will be > always stuck in the politeness crawl delay. > > > > > 3. When it all goes down, is there a way to restart crawling from where > the > > process stopped ? > > Unfortunately, no. You should at least crawl without parsing, so that > when you download the content you can run the parsing separately, and > repeat it if it fails. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > -- Emmanuel de Castro Santana