Standalone vs distributed Nutch

brainstorm Thu, 17 Jul 2008 08:44:42 -0700

Hi !

I've been running nutch for a while in a 4-node cluster, and I'm quite
disappointed with my results... I'm quite sure that I'm doing
something wrong, but I've re-readed/tested tons of related
documentation to no avail :_(


Problem is that crawling in a single node setup is actually more
efficient than using clustered nutch+hadoop, for instance, given the
same URL input set:

standalone nutch+hadoop install (single node): dumped parsed_text is
425MB big, 2 days.
4-node cluster: 55MB, 2 days :_/

I'm attaching my {hadoop|nutch}-site.xml files... if you are able to
pinpoint the problem that would be really useful to me. What really
annoys me is the time it takes to do some of the tasks: crawldb taking
3+ hours while in standalone was a matter of minutes :/

More details:

/state/partition1/hdfs is present on all nodes with actual data on it:

[EMAIL PROTECTED] ~]$ cluster-fork du -hs /state/partition1/hdfs
compute-0-1:
197M    /state/partition1/hdfs
compute-0-2:
156M    /state/partition1/hdfs
compute-0-3:
288M    /state/partition1/hdfs

Nutch+hadoop trunk is checkout'd on /home/hadoop and exported via NFS
to all nodes (note that DFS is on different *local* space, not
exported (/state...)).

Thanks in advance

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<property>
  <name>hadoop.tmp.dir</name>
  <value>/state/partition1/hdfs/${user.name}</value>
  <description>A base for other temporary directories.</description>
</property>

<property>
  <name>dfs.datanode.address</name>
  <value>0.0.0.0:50010</value>
  <description>The port number that the dfs datanode server uses as a
 starting point to look for a free port to listen on.
  </description>
</property>

<property>
  <name>fs.default.name</name>
  <value>hdfs://cluster.local:9000</value>
  <description>The name of the default file system.  A URI whose
  scheme and authority determine the FileSystem implementation.  The
  uri's scheme determines the config property (fs.SCHEME.impl) naming
  the FileSystem implementation class.  The uri's authority is used to
  determine the host, port, etc. for a filesystem.</description>
</property>
 
<property>
  <name>mapred.job.tracker</name>
  <value>cluster.local:9001</value>
  <description>The host and port that the MapReduce job tracker runs
  at.  If "local", then jobs are run in-process as a single map
  and reduce task.
  </description>
</property>
 
<property>
  <name>mapred.tasktracker.map.tasks.maximum</name>
  <value>2</value>
  <description>
    The maximum number of tasks that will be run simultaneously by
    a task tracker. This should be adjusted according to the heap size
    per task, the amount of RAM available, and CPU consumption of each task.
  </description>
</property>

<property>
  <name>dfs.replication</name>
  <value>1</value>
  <description>Default block replication.
  The actual number of replications can be specified when the file is created.
  The default is used if replication is not specified in create time.
  </description>
</property>

<!-- do NOT put these properties in hadoop-site.xml (this file)-->
<!--
<property>
  <name>mapred.map.tasks</name>
  <value>1</value>
  <description>The default number of map tasks per job.  Typically set
  to a prime several times greater than number of available hosts.
  Ignored when mapred.job.tracker is "local".
  </description>
</property>

<property>
  <name>mapred.reduce.tasks</name>
  <value>1</value>
  <description>The default number of reduce tasks per job.  Typically set
  to a prime close to the number of available hosts.  Ignored when
  mapred.job.tracker is "local".
  </description>
</property>
-->

</configuration>

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>

<!-- ECXI requirements are 1500 each page -->
<property>
  <name>db.max.outlinks.per.page</name>
  <value>100</value>
  <description>The maximum number of outlinks that we'll process for a page.
  If this value is nonnegative (>=0), at most db.max.outlinks.per.page outlinks
  will be processed for a page; otherwise, all outlinks will be processed.
  </description>
</property>

<property>
  <name>fetcher.threads.per.host</name>
  <value>10</value>
  <description>This number is the maximum number of threads that
    should be allowed to access a host at one time.</description>
</property>

<property>
  <name>db.fetch.retry.max</name>
  <value>2</value>
  <description>The maximum number of times a url that has encountered
  recoverable errors is generated for fetch.</description>
</property>

<property>
  <name>fetcher.server.delay</name>
  <value>1.5</value>
  <description>The number of seconds the fetcher will delay between
   successive requests to the same server.</description>
</property>

<property>
  <name>http.max.delays</name>
  <value>10</value>
  <description>The number of times a thread will delay when trying to
  fetch a page.  Each time it finds that a host is busy, it will wait
  fetcher.server.delay.  After http.max.delays attepts, it will give
  up on the page for now.</description>
</property>

<property>
  <name>http.robots.agents</name>
  <value>ecxi,*</value>
  <description>The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  </description>
</property>

<property>
  <name>http.agent.name</name>
  <value>ecxi</value>

  <description>esCERT-UPC web crawling project</description>
</property>


<property>
  <name>http.agent.description</name>
  <value>esCERT-UPC-ecxi</value>
  
  <description>Searching malware-infected websites...</description>
</property>

<property>
  <name>http.agent.url</name>
  <value>http://escert.upc.edu/</value>

  <description>http://escert.upc.edu/</description>
</property>

<property>

  <name>http.agent.email</name>
  <value>admin escert edu</value>

  <description>admin escert edu</description>
</property>

</configuration>

Standalone vs distributed Nutch

Reply via email to