Re: Standalone vs distributed Nutch

brainstorm Sat, 19 Jul 2008 03:44:20 -0700

Problem solved !

It was about static /etc/hosts with wrong IP addresses referring to
frontend node :-! Sorry for crossposting, but that's what I sent to
hadoop mailing list in response to a related email:


Got this problem too, and fixed it just 5 minutes ago... there were
wrong IP entries on the nodes referring to the frontend, it was
slowing down the reduce process *a lot*... in numbers:

Wrong hosts file using wordcount example: 3hrs, 45mins, 41sec (4
minutes map, the rest, reduce)
Right hosts file using wordcount example: 6mins, 26sec

Moral of the history: AVOID static hosts file, always use DNS.

PD: Static hosts files were replicated by rocksclusters to all compute
nodes on install (kickstart) time, but not refreshed afterwards while
doing "rocks sync dns" nor "rocks sync config".

On Thu, Jul 17, 2008 at 6:37 PM, brainstorm <[EMAIL PROTECTED]> wrote:
> sorry it's actually:
>
> /state/partition1/hdfs/hadoop/mapred/system (not accessible):
>
> org.apache.hadoop.fs.permission.AccessControlException: Permission
> denied: user=webuser, access=READ_EXECUTE,
> inode="system":hadoop:supergroup:rwx-wx-wx
>
> and:
>
> /state/partition1/hdfs/hadoop/mapred/temp/inject-temp-1098577375/part-00000
> (168.92 MB in size)
>
> Is it ok for nutch+hadoop to generate temp files on DFS ? I thought
> that temp files should be generated on each individual *local* node
> filesystem :/
>
> Do I have a wrong directive on my hadoop-site.xml causing this ?
>
> Thanks in advance !
>
> On Thu, Jul 17, 2008 at 6:05 PM, brainstorm <[EMAIL PROTECTED]> wrote:
>> /state/partition1/hdfs/{mapred|temp} is also being created
>> automatically each new crawl on DFS... is it ok ? Seems weird to me :/
>>
>> On Thu, Jul 17, 2008 at 5:44 PM, brainstorm <[EMAIL PROTECTED]> wrote:
>>> Hi !
>>>
>>> I've been running nutch for a while in a 4-node cluster, and I'm quite
>>> disappointed with my results... I'm quite sure that I'm doing
>>> something wrong, but I've re-readed/tested tons of related
>>> documentation to no avail :_(
>>>
>>> Problem is that crawling in a single node setup is actually more
>>> efficient than using clustered nutch+hadoop, for instance, given the
>>> same URL input set:
>>>
>>> standalone nutch+hadoop install (single node): dumped parsed_text is
>>> 425MB big, 2 days.
>>> 4-node cluster: 55MB, 2 days :_/
>>>
>>> I'm attaching my {hadoop|nutch}-site.xml files... if you are able to
>>> pinpoint the problem that would be really useful to me. What really
>>> annoys me is the time it takes to do some of the tasks: crawldb taking
>>> 3+ hours while in standalone was a matter of minutes :/
>>>
>>> More details:
>>>
>>> /state/partition1/hdfs is present on all nodes with actual data on it:
>>>
>>> [EMAIL PROTECTED] ~]$ cluster-fork du -hs /state/partition1/hdfs
>>> compute-0-1:
>>> 197M    /state/partition1/hdfs
>>> compute-0-2:
>>> 156M    /state/partition1/hdfs
>>> compute-0-3:
>>> 288M    /state/partition1/hdfs
>>>
>>> Nutch+hadoop trunk is checkout'd on /home/hadoop and exported via NFS
>>> to all nodes (note that DFS is on different *local* space, not
>>> exported (/state...)).
>>>
>>> Thanks in advance
>>>
>>
>

Re: Standalone vs distributed Nutch

Reply via email to