Hi !
I've been running nutch for a while in a 4-node cluster, and I'm quite
disappointed with my results... I'm quite sure that I'm doing
something wrong, but I've re-readed/tested tons of related
documentation to no avail :_(
Problem is that crawling in a single node setup is actually more
efficient than using clustered nutch+hadoop, for instance, given the
same URL input set:
standalone nutch+hadoop install (single node): dumped parsed_text is
425MB big, 2 days.
4-node cluster: 55MB, 2 days :_/
I'm attaching my {hadoop|nutch}-site.xml files... if you are able to
pinpoint the problem that would be really useful to me. What really
annoys me is the time it takes to do some of the tasks: crawldb taking
3+ hours while in standalone was a matter of minutes :/
More details:
/state/partition1/hdfs is present on all nodes with actual data on it:
[EMAIL PROTECTED] ~]$ cluster-fork du -hs /state/partition1/hdfs
compute-0-1:
197M /state/partition1/hdfs
compute-0-2:
156M /state/partition1/hdfs
compute-0-3:
288M /state/partition1/hdfs
Nutch+hadoop trunk is checkout'd on /home/hadoop and exported via NFS
to all nodes (note that DFS is on different *local* space, not
exported (/state...)).
Thanks in advance
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>/state/partition1/hdfs/${user.name}</value>
<description>A base for other temporary directories.</description>
</property>
<property>
<name>dfs.datanode.address</name>
<value>0.0.0.0:50010</value>
<description>The port number that the dfs datanode server uses as a
starting point to look for a free port to listen on.
</description>
</property>
<property>
<name>fs.default.name</name>
<value>hdfs://cluster.local:9000</value>
<description>The name of the default file system. A URI whose
scheme and authority determine the FileSystem implementation. The
uri's scheme determines the config property (fs.SCHEME.impl) naming
the FileSystem implementation class. The uri's authority is used to
determine the host, port, etc. for a filesystem.</description>
</property>
<property>
<name>mapred.job.tracker</name>
<value>cluster.local:9001</value>
<description>The host and port that the MapReduce job tracker runs
at. If "local", then jobs are run in-process as a single map
and reduce task.
</description>
</property>
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>2</value>
<description>
The maximum number of tasks that will be run simultaneously by
a task tracker. This should be adjusted according to the heap size
per task, the amount of RAM available, and CPU consumption of each task.
</description>
</property>
<property>
<name>dfs.replication</name>
<value>1</value>
<description>Default block replication.
The actual number of replications can be specified when the file is created.
The default is used if replication is not specified in create time.
</description>
</property>
<!-- do NOT put these properties in hadoop-site.xml (this file)-->
<!--
<property>
<name>mapred.map.tasks</name>
<value>1</value>
<description>The default number of map tasks per job. Typically set
to a prime several times greater than number of available hosts.
Ignored when mapred.job.tracker is "local".
</description>
</property>
<property>
<name>mapred.reduce.tasks</name>
<value>1</value>
<description>The default number of reduce tasks per job. Typically set
to a prime close to the number of available hosts. Ignored when
mapred.job.tracker is "local".
</description>
</property>
-->
</configuration>
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<!-- ECXI requirements are 1500 each page -->
<property>
<name>db.max.outlinks.per.page</name>
<value>100</value>
<description>The maximum number of outlinks that we'll process for a page.
If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
will be processed for a page; otherwise, all outlinks will be processed.
</description>
</property>
<property>
<name>fetcher.threads.per.host</name>
<value>10</value>
<description>This number is the maximum number of threads that
should be allowed to access a host at one time.</description>
</property>
<property>
<name>db.fetch.retry.max</name>
<value>2</value>
<description>The maximum number of times a url that has encountered
recoverable errors is generated for fetch.</description>
</property>
<property>
<name>fetcher.server.delay</name>
<value>1.5</value>
<description>The number of seconds the fetcher will delay between
successive requests to the same server.</description>
</property>
<property>
<name>http.max.delays</name>
<value>10</value>
<description>The number of times a thread will delay when trying to
fetch a page. Each time it finds that a host is busy, it will wait
fetcher.server.delay. After http.max.delays attepts, it will give
up on the page for now.</description>
</property>
<property>
<name>http.robots.agents</name>
<value>ecxi,*</value>
<description>The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
</description>
</property>
<property>
<name>http.agent.name</name>
<value>ecxi</value>
<description>esCERT-UPC web crawling project</description>
</property>
<property>
<name>http.agent.description</name>
<value>esCERT-UPC-ecxi</value>
<description>Searching malware-infected websites...</description>
</property>
<property>
<name>http.agent.url</name>
<value>http://escert.upc.edu/</value>
<description>http://escert.upc.edu/</description>
</property>
<property>
<name>http.agent.email</name>
<value>admin escert edu</value>
<description>admin escert edu</description>
</property>
</configuration>