Re: Standalone vs distributed Nutch

brainstorm Thu, 17 Jul 2008 09:06:07 -0700

/state/partition1/hdfs/{mapred|temp} is also being created
automatically each new crawl on DFS... is it ok ? Seems weird to me :/


On Thu, Jul 17, 2008 at 5:44 PM, brainstorm <[EMAIL PROTECTED]> wrote:
> Hi !
>
> I've been running nutch for a while in a 4-node cluster, and I'm quite
> disappointed with my results... I'm quite sure that I'm doing
> something wrong, but I've re-readed/tested tons of related
> documentation to no avail :_(
>
> Problem is that crawling in a single node setup is actually more
> efficient than using clustered nutch+hadoop, for instance, given the
> same URL input set:
>
> standalone nutch+hadoop install (single node): dumped parsed_text is
> 425MB big, 2 days.
> 4-node cluster: 55MB, 2 days :_/
>
> I'm attaching my {hadoop|nutch}-site.xml files... if you are able to
> pinpoint the problem that would be really useful to me. What really
> annoys me is the time it takes to do some of the tasks: crawldb taking
> 3+ hours while in standalone was a matter of minutes :/
>
> More details:
>
> /state/partition1/hdfs is present on all nodes with actual data on it:
>
> [EMAIL PROTECTED] ~]$ cluster-fork du -hs /state/partition1/hdfs
> compute-0-1:
> 197M    /state/partition1/hdfs
> compute-0-2:
> 156M    /state/partition1/hdfs
> compute-0-3:
> 288M    /state/partition1/hdfs
>
> Nutch+hadoop trunk is checkout'd on /home/hadoop and exported via NFS
> to all nodes (note that DFS is on different *local* space, not
> exported (/state...)).
>
> Thanks in advance
>

Re: Standalone vs distributed Nutch

Reply via email to