Hello i still have limited knowledge about nutch, but i can share some of my experience, as i am crawling similiar number of urls.
On Nov 8, 2007 7:46 PM, Daniel Clark <[EMAIL PROTECTED]> wrote: > I have a nine box cluster using hadoop and I want to get the optimum > performance. I'm crawling 5 million sites. There are three settings in > the > hadoop-site.xml that I'm not clear on. Please, help. > > > > Mapred Map & Reduce Tasks > > ======================== > > I have the following based on the description note, but the wiki said to > use > multiples of the number of slave hosts. Can I up this to 36 or even more > to > speed up my crawl? What is recommended? > > > > <property> > > <name>mapred.map.tasks</name> > <value>9</value> > <description> > define mapred.map tasks to be number of slave hosts > </description> > </property> > After some trial and error i found out that setting this to a high number helps a lot with fetching large number of urls. Map tasks are failing quite often, and when it happens only small part of data needs to be refetched. Also it seems that memory limit might become a problem when there's few map tasks. I've set it up to 99 on a cluster of 3 machines. <property> > <name>mapred.reduce.tasks</name> > <value>9</value> > <description> > define mapred.reduce tasks to be number of slave hosts > </description> > </property> > I've set it to 2* number of nodes because all my machines have dual core processors and each map task runs on single thread. > > > Replication > > =========== > > The wiki said to use 2 or 3. Why? What is recommended for the best > performance? > > <property> > <name>dfs.replication</name> > <value>2</value> > </property> > > This parameter defines how many copies of each data segment will be stored on dfs. So for performance best would be 1, but you risk loosing your index if one of nodes fails. > > ~~~~~~~~~~~~~~~~~~~~~ > > Daniel Clark, President > > DAC Systems, Inc. > > (703) 403-0340 > > ~~~~~~~~~~~~~~~~~~~~~ > > > > -- Karol Rybak Programista / Programmer Sekcja aplikacji / Applications section Wyższa Szkoła Informatyki i Zarządzania / University of Internet Technology and Management +48(17)8661277
