Re: Standalone vs distributed Nutch

brainstorm Thu, 17 Jul 2008 09:38:27 -0700

sorry it's actually:

/state/partition1/hdfs/hadoop/mapred/system (not accessible):


org.apache.hadoop.fs.permission.AccessControlException: Permission
denied: user=webuser, access=READ_EXECUTE,
inode="system":hadoop:supergroup:rwx-wx-wx

and:

/state/partition1/hdfs/hadoop/mapred/temp/inject-temp-1098577375/part-00000
(168.92 MB in size)

Is it ok for nutch+hadoop to generate temp files on DFS ? I thought
that temp files should be generated on each individual *local* node
filesystem :/

Do I have a wrong directive on my hadoop-site.xml causing this ?

Thanks in advance !

On Thu, Jul 17, 2008 at 6:05 PM, brainstorm <[EMAIL PROTECTED]> wrote:
> /state/partition1/hdfs/{mapred|temp} is also being created
> automatically each new crawl on DFS... is it ok ? Seems weird to me :/
>
> On Thu, Jul 17, 2008 at 5:44 PM, brainstorm <[EMAIL PROTECTED]> wrote:
>> Hi !
>>
>> I've been running nutch for a while in a 4-node cluster, and I'm quite
>> disappointed with my results... I'm quite sure that I'm doing
>> something wrong, but I've re-readed/tested tons of related
>> documentation to no avail :_(
>>
>> Problem is that crawling in a single node setup is actually more
>> efficient than using clustered nutch+hadoop, for instance, given the
>> same URL input set:
>>
>> standalone nutch+hadoop install (single node): dumped parsed_text is
>> 425MB big, 2 days.
>> 4-node cluster: 55MB, 2 days :_/
>>
>> I'm attaching my {hadoop|nutch}-site.xml files... if you are able to
>> pinpoint the problem that would be really useful to me. What really
>> annoys me is the time it takes to do some of the tasks: crawldb taking
>> 3+ hours while in standalone was a matter of minutes :/
>>
>> More details:
>>
>> /state/partition1/hdfs is present on all nodes with actual data on it:
>>
>> [EMAIL PROTECTED] ~]$ cluster-fork du -hs /state/partition1/hdfs
>> compute-0-1:
>> 197M    /state/partition1/hdfs
>> compute-0-2:
>> 156M    /state/partition1/hdfs
>> compute-0-3:
>> 288M    /state/partition1/hdfs
>>
>> Nutch+hadoop trunk is checkout'd on /home/hadoop and exported via NFS
>> to all nodes (note that DFS is on different *local* space, not
>> exported (/state...)).
>>
>> Thanks in advance
>>
>

Re: Standalone vs distributed Nutch

Reply via email to