Hi All We are using Nutch to crawl ~500K pages with a 3 node cluster, each node features a dual core processor running with 4Gb RAM and circa 100Gb storage. All nodes run on CentOS.
These 500K pages are scattered into several sites, each one of them having from 5k up to 200k pages. For each site we start a different crawl process (using bin/nutch crawl), but they are all almost simultaneously started. We are trying to tune Hadoop's configurations in order to have a reliable daily crawling process. After a while of crawling we see some problems occurring, mainly on the TaskTracker nodes, most of them are related to access to the HDFS. We often see "Bad response 1 for block" and "Filesystem closed", among others. When these errors start to get more frequent, the JobTracker gets stuck and we have to run stop-all. If we adjust the maximum of map and reduce tasks to lower values, the process takes longer to get stuck, but we haven't found the adequate configuration yet. Given that setup, there are some question we have been struggling to find an answer 1. What could be the most probable reason for the hdfs problems ? 2. Is it better to start a unique crawl with all sites inside or to just keep it the way we are doing (i.e start a different crawl process for each site) ? 3. When it all goes down, is there a way to restart crawling from where the process stopped ? Thanks in advance Emmanuel de Castro Santana
