Thank you very much, Adrzej. I'm really hoping some people can share some non-sensitive details of their setup. I'm really curious about the following:
The ratio of Maps to Reduces for their nutch jobs? The amount of memory that they allocate to each job task? The number of simultaneous Maps/Reduces on any given host? The number of fetcher threads they execute? Any config setup people can share would be great, so I can have a different perspective on how people setup their nutch-site and mapred-site files. At the moment, I'm experimenting with the following configs: http://gist.github.com/505065 I'm giving each task 2048m of memory. Up to 5 Maps and 2 Reduces run at any given time. I have Nutch firing off 181 Maps and 41 Reduces. Those are both prime numbers, but I don't know if that really matters. I've seen Hadoop say that the number of reducers should be around the number of nodes you have (the nearest prime). I've seen, somewhere, some suggestions that Nutch maps/reduces be anywhere from 1:0.93-1:1.25. Does anyone have insight to share on that? Thank you, Andrzej for the SIGQUIT suggestion. I forgot about that. I'm waiting for it to return to the 4th fetch step, so I can see why Nutch hates me so much. sg On Mon, Aug 2, 2010 at 3:47 AM, Andrzej Bialecki <[email protected]> wrote: > On 2010-08-02 10:17, Scott Gonyea wrote: > >> The big problem that I am facing, thus far, occurs on the 4th fetch. >> All but 1 or 2 maps complete. All of the running reduces stall (0.00 >> MB/s), presumably because they are waiting on that map to finish? I >> really don't know and it's frustrating. >> > > Yes, all map tasks need to finish before reduce tasks are able to proceed. > The reason is that each reduce task receives a portion of the keyspace (and > values) according to the Partitioner, and in order to prepare a nice <key, > list(value)> in your reducer it needs to, well, get all the values under > this key first, whichever map task produced the tuples, and then sort them. > > The failing tasks probably fail due to some other factor, and very likely > (based on my experience) the failure is related to some particular URLs. > E.g. regex URL filtering can choke on some pathological URLs, like URLs 20kB > long, or containing '\0' etc, etc. In my experience, it's best to keep regex > filtering to a minimum if you can, and use other urlfilters (prefix, domain, > suffix, custom) to limit your crawling frontier. There are simply too many > ways where a regex engine can lock up. > > Please check the logs of the failing tasks. If you see that a task is > stalled you could also log in to the node, and generate a thread dump a few > times in a row (kill -SIGQUIT <pid>) - if each thread dump shows the regex > processing then it's likely this is your problem. > > > My scenario: # Sites: 10,000-30,000 per crawl Depth: ~5 Content: Text >> is all that I care for. (HTML/RSS/XML) Nodes: Amazon EC2 (ugh) >> Storage: I've performed crawls with HDFS and with amazon S3. I >> thought S3 would be more performant, yet it doesn't appear to affect >> matters. Cost vs Speed: I don't mind throwing EC2 instances at this >> to get it done quickly... But I can't imagine I need much more than >> 10-20 mid-size instances for this. >> > > That's correct - with this number of unique sites the max. throughput of > your crawl will be ultimately limited by the politeness limits (# of > requests/site/sec). > > > >> Can anyone share their own experiences in the performance they've >> seen? >> > > There is a very simple benchmark in trunk/ that you could use to measure > the raw performance (data processing throughput) of your EC2 cluster. The > real-life performance, though, will depend on many other factors, such as > the number of unique sites, their individual speed, and (rarely) the total > bandwidth at your end. > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > >

