I simply followed the wiki The right level of parallelism for maps
seems to be around 10-100 maps/node,
http://wiki.apache.org/lucene-hadoop/HowManyMapsAndReduces
We have 8 cores in each machine, so perhaps 100 mappers ought to be
right, it's set to 157 in the config but hadoop used ~200 for
Thanks Runping. It seems the bug is still open. However, in my case,
there were more than plenty of disk space available.
On Jan 16, 2008 2:44 AM, Runping Qi [EMAIL PROTECTED] wrote:
I encountered a similar case.
Here is the Jira: https://issues.apache.org/jira/browse/HADOOP-2164
Runping
I use the Norton Ghost 8.0 ghost a whole ubuntu hard disk to a image, and
restore another hard disk from the image, but the restored hard disk cannnot
start up ubuntu successfully.
The GRUB said error 22.
Does somebody know how to fix the problem?
Thanks.
Bin YANG
On Jan 16, 2008 4:54 AM,
There is some considerable and very understandable confusion about map
tasks, mappers and input splits.
It is true that for large inputs the input should ultimately be split into
chunks so that each core that you have has to process 10-100 pieces of data.
To do that, however, you only need one
This isn't really a question about Hadoop, but is about system
administration basics.
You are probably missing a master boot record (MBR) on the disk. Ask a
local linux expert to help you or look at the Norton documentation.
On 1/16/08 4:59 AM, Bin YANG [EMAIL PROTECTED] wrote:
I use the
Hi,
How do I make hadoop split its output? The program I am writing
crawls a catalog tree from a single url, so initially the input
contains only one entry. after a few iterations, it will have tens of
thousands of urls. But what I noticed is that the file is always in
one block (part-0).
Parallelizing the processing of data occurs at two steps. The first is
during the map phase where the input data file is (hopefully) split across
multiple tasks. This should happen transparently most of the time unless
you have a perverse data format or use unsplittable compression on your
Thanks Ted. I just didn't ask it right. Here is a stupid 101
question, which I am sure the answer lies in the documentation
somewhere, just that I was having some difficulties in finding it...
when I do an ls on the dfs, I would see this:
/user/bear/output/part-0 r 4
I probably got
The part nomenclature does not refer to splits. It refers to how many
reduce processes were involved in actually writing the output file. Files
are split at read-time as necessary.
You will get more of them if you have more reducers.
On 1/16/08 8:25 AM, Jim the Standing Bear [EMAIL
hmm.. interesting... these are supposed to be the output from mappers
(and default reducers since I didn't specify any for those jobs)...
but shouldn't the number of reducers match the number of mappers? If
there was only one reducer, it would mean I only had one mapper task
running?? That is
On 1/15/08 12:54 PM, Miles Osborne [EMAIL PROTECTED] wrote:
surely the clean way (in a streaming environment) would be to define a
representation of some kind which serialises the output.
http://en.wikipedia.org/wiki/Serialization
after your mappers and reducers have completed, you would
The number of reduces should be a function of the amount of data needing
reducing, not the number of mappers.
For example, your mappers might delete 90% of the input data, in which
case you should only need 1/10 of the number of reducers as mappers.
Miles
On 16/01/2008, Jim the Standing Bear
I've been running Hadoop 0.14.4 and, more recently, 0.15.2 on a dozen
machines in our CUBiT array for the last month. During this time I have
experienced two major data corruption losses on relatively small amounts
of data (50gb) that make me wonder about the suitability of this
platform for
Thanks, Miles.
On Jan 16, 2008 11:51 AM, Miles Osborne [EMAIL PROTECTED] wrote:
The number of reduces should be a function of the amount of data needing
reducing, not the number of mappers.
For example, your mappers might delete 90% of the input data, in which
case you should only need
Yahoo plans to host a summit / workshop on Apache Hadoop at our
Sunnyvale campus on March 25th. Given the interest we are seeing from
developers in a broad range of organizations, this seems like a good
time to get together and brief each other on the progress that is being
made.
We would like
The DFS is stored in /tmp on each box.
The developers who own the machines occasionally reboot and reprofile them
Wont you lose your blocks after reboot since /tmp gets cleaned up? Could this
be the reason you see data corruption?
Good idea is to configure DFS to be any place other than /tmp
The /tmp default has caught us once or twice too. Now we put the files
elsewhere.
[EMAIL PROTECTED] wrote:
The DFS is stored in /tmp on each box.
The developers who own the machines occasionally reboot and reprofile them
Wont you lose your blocks after reboot since /tmp gets cleaned up?
Thanks, I will try a safer place for the DFS.
Jeff
-Original Message-
From: Jason Venner [mailto:[EMAIL PROTECTED]
Sent: Wednesday, January 16, 2008 10:04 AM
To: hadoop-user@lucene.apache.org
Subject: Re: Platform reliability with Hadoop
The /tmp default has caught us once or twice too.
Apache's board this morning voted to make Hadoop a top-level project
(TLP). The initial project management committee (PMC) for Hadoop will
be composed of the following Hadoop committers:
* Andrzej Bialecki [EMAIL PROTECTED]
* Doug Cutting [EMAIL PROTECTED]
*
I was trying to copy a bunch of data over to my hadoop installation
and got the below error:
[EMAIL PROTECTED] ~]$ hadoop/bin/hadoop dfs -copyFromLocal /export/gpfs/en
wikipedia
copyFromLocal: file:/export/gpfs/en/a/g/e/Ageha100%25.html: No such
file or directory
However, there is no
It seems bug about RawLocalFileSystem class.
I'll made a issue.
B. Regards,
Edward yoon @ NHN, corp.
Date: Wed, 16 Jan 2008 23:03:03 -0500
From: [EMAIL PROTECTED]
To: hadoop-user@lucene.apache.org
Subject: copyFromLocal bug?
I was trying to copy a bunch of data over to my hadoop
I made a issue for this problem.
https://issues.apache.org/jira/browse/HADOOP-2635
Thanks.
B. Regards,
Edward yoon @ NHN, corp.
From: [EMAIL PROTECTED]
To: hadoop-user@lucene.apache.org
Subject: RE: copyFromLocal bug?
Date: Thu, 17 Jan 2008 04:55:20 +
It seems bug about
Hi,
Why couldn't you just write this logic in your reducer class. The reduce
[reduceClass.reduce()] method is invoked with a key and an iterator over
the values associated with the key. You can simply dump the values into
a file. Since the input to the reducer is sorted you can simply dump the
I would like the values for a key to exist in a single file, and only
the values for that key. Each reduced key/value would get its own
file. If I understand correctly, all output of the reducers is
written to a single file.
-Myles
On Jan 16, 2008, at 9:29 PM, Amar Kamat wrote:
Hi,
Dear colleagues;
Now,I have to use HBase in my map and reduce functions, but I
don't know how to use. I have seen the examples in the FQA and
org.apache.hadoop.hbase. ,but I can't run it successfully. Can you
give me some simple examples make me manipulate HBase using java api
in my map
Please copy the hadoop-0.16.*-hbase.jar to ${hadoop_home}/lib folder.
And, Here's a exmple of hadoop-site.xml
hbase.master
a51066.nhncorp.com:6
The port for the hbase master web UI
Set to -1 if you do not want the info server to run.
26 matches
Mail list logo