Dear all:
Any one knows Is Mapper's map method thread safe?
Thank you!
imcaptor
Each mapper instance will be executed in separate JVM.
On Thu, May 14, 2009 at 2:04 PM, imcaptor imcap...@gmail.com wrote:
Dear all:
Any one knows Is Mapper's map method thread safe?
Thank you!
imcaptor
--
朱盛凯
Jash Zhu
复旦大学软件学院
Software School, Fudan University
Ultimately it depends on how you write the Mapper.map method.
The framework supports a MultithreadedMapRunner which lets you set the
number of threads running your map method simultaneously.
Chapter 5 of my book covers this.
On Wed, May 13, 2009 at 11:10 PM, Shengkai Zhu geniusj...@gmail.com
The customary practice is to have your Reducer.reduce method handle the
filtering if you are reducing your output.
or the Mapper.map method if you are not.
On Wed, May 13, 2009 at 1:57 PM, Asim linka...@gmail.com wrote:
Hi,
I wish to output only selective records to the output files based on
Why don't you use it with localhost? Does it have a disadvantage?
As far as I know, there were several host = IP problems in hadoop, but
that was a while ago, I think these should have been solved..
It's can also be about the order of IP conversions in IP table file.
2009/5/14 andy2005cst
I am seeing the the same problem posted on the list on the 11th and have not
any reply.
Billy
- Original Message -
From: Manish Katyal
manish.katyal-re5jqeeqqe8avxtiumw...@public.gmane.org
Newsgroups: gmane.comp.jakarta.lucene.hadoop.user
To:
hi,
i just wonder if can we use moab cluster suite for managing hadoop cluster
Vishal S. Ghawate
DISCLAIMER
==
This e-mail may contain privileged and confidential information which is the
property of Persistent Systems Ltd. It is intended only for the use of the
individual or entity to
We find the disk I/O is the major bottleneck.
Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz
avgqu-sz await svctm %util
sda 1.00 0.00 85.21 0.00 20926.32 0.00 245.58
31.59 364.49 11.77 100.28
sdb 5.76 4752.88 53.13 131.08
Hi,
Can someone tell about Append functionality in Hadoop. Is it available now
in 0.20 ??
Regards,
Wasim
it's available, although not suitable for production purposes as i've
found / been told.
put the following in your $HADOOP_HOME/conf/hadoop-site.xml
property
namedfs.support.append/name
valuetrue/value
/property
-sd
On Thu, May 14, 2009 at 1:27 PM, Wasim Bari wasimb...@msn.com
is this property available in 0.20.0
since i dont thik it is there in prior versions
Vishal S. Ghawate
From: Sasha Dolgy [sdo...@gmail.com]
Sent: Thursday, May 14, 2009 6:03 PM
To: core-user@hadoop.apache.org
Subject: Re: Append in Hadoop
it's available,
I changed the ec2 scripts to have fs.default.name assigned to the public
hostname (instead of the private hostname).
Now I can submit jobs remotely via the socks proxy (the problem below is
resolved) - but the map tasks fail with an exception:
2009-05-14 07:30:34,913 INFO
yep, i'm using it in 0.19.1 and have used it in 0.20.0
-sasha
On Thu, May 14, 2009 at 1:35 PM, Vishal Ghawate
vishal_ghaw...@persistent.co.in wrote:
is this property available in 0.20.0
since i dont thik it is there in prior versions
Vishal S. Ghawate
Nobody has encountered with these problems: Error
register getProtocolVersion and Error
register getBuildVersion?
Starry
/* Tomorrow is another day. So is today. */
On Tue, May 12, 2009 at 13:27, Starry SHI starr...@gmail.com wrote:
Hi, all. Today I noticed that my hadoop cluster
Hi Joydeep,
The problem you are hitting may be because port 50001 isn't open,
whereas from within the cluster any node may talk to any other node
(because the security groups are set up to do this).
However I'm not sure this is a good approach. Configuring Hadoop to
use public IP addresses
Hi,
I wanna set up a cluster of 5 nodes in such a way that
node1 - master
node2 - secondary namenode
node3 - slave
node4 - slave
node5 - slave
How do we go about that?
there is no property in hadoop-env where i can set the ip-address for
secondary name node.
if i set node-1 and node-2 in
where did you find that property
Vishal S. Ghawate
From: Sasha Dolgy [sdo...@gmail.com]
Sent: Thursday, May 14, 2009 6:09 PM
To: core-user@hadoop.apache.org
Subject: Re: Append in Hadoop
yep, i'm using it in 0.19.1 and have used it in 0.20.0
-sasha
On
First of all, the secondary namenode is not a what you might think a
secondary is - it's not failover device. It does make a copy of the
filesystem metadata periodically, and it integrates the edits into the
image. It does *not* provide failover.
Second, you specify its IP address in
search this list for that variable name. i made a post last week
inquiring about appends() and was given enough information to go hunt
down the info on google and jira
On Thu, May 14, 2009 at 2:01 PM, Vishal Ghawate
vishal_ghaw...@persistent.co.in wrote:
where did you find that property
Hi Starry,
I noticed the same problem when I copied hadoop-metrics.properties from my
old hadoop-0.19 conf along with the other files. Make sure you are using the
right version of the conf files.
Hope that helps.
-Abhishek.
On Thu, May 14, 2009 at 7:48 AM, Starry SHI starr...@gmail.com wrote:
The ec2 documentation point to the use of public 'ip' addresses - whereas using
public 'hostnames' seems safe since it resolves to internal addresses from
within the cluster (and resolve to public ip addresses from outside).
The only data transfer that I would incur while submitting jobs from
Hi,
My company has about 50GB of pdfs and docs, and we would like to be able to
do some text search over a web interface.
Is there any good tutorial that specifies hardware requirements and software
specs to do this?
Regards
Yes, you're absolutely right.
Tom
On Thu, May 14, 2009 at 2:19 PM, Joydeep Sen Sarma jssa...@facebook.com wrote:
The ec2 documentation point to the use of public 'ip' addresses - whereas
using public 'hostnames' seems safe since it resolves to internal addresses
from within the cluster (and
Hi,
I want to test how Hadoop and HBase are performing. I have a cluster with 1
namenode and 4 datanodes. I use Hadoop 0.19.1 and HBase 0.19.2.
I first ran a few tests when the 4 datanodes use local storage specified in
dfs.data.dir.
Now, I want to see what is the tradeoff if I switch from
any machine put in the conf/masters file becomes a secondary namenode.
At some point there was confusion on the safety of more than one machine,
which I believe was settled, as many are safe.
The secondary namenode takes a snapshot at 5 minute (configurable)
intervals, rebuilds the fsimage and
Hi
First of all, you should probably know, what you want to do exactly. Without
this, it is hard to estimate any hardware requirements.
I assume, you want to use Hadoop for some kind of offline calculations used
for web-based search later ?
In your place I would start with reading about, how such
I'm implementing a map-side join as described in chapter 8 of Pro
Hadoop. I have two files that have been partitioned using the
TotalOrderPartitioner on the same key into the same number of
partitions. I've set mapred.min.split.size to Long.MAX_VALUE so that
one Mapper will handle an entire
Another possibility I am thinking about now, which is suitable for me as I do
not actually have much data stored in the cluster when I want to perform
this switch is to set the replication level really high and then simply
remove the local storage locations and restart the cluster. With a bit of
Manish,
The pre-emption code in capacity scheduler was found to require a good
relook and due to the inherent complexity of the problem is likely to
have issues of the type you have noticed. We have decided to relook at
the pre-emption code from scratch and to this effect removed it from the
Sort order is preserved if your Mapper doesn't change the key ordering in
output. Partition name is not preserved.
What I have done is to manually work out what the partition number of the
output file should be for each map task, by calling the partitioner on an
input key, and then renaming the
You can decommission the datanode, and then un-decommission it.
On Thu, May 14, 2009 at 7:44 AM, Alexandra Alecu
alexandra.al...@gmail.comwrote:
Hi,
I want to test how Hadoop and HBase are performing. I have a cluster with 1
namenode and 4 datanodes. I use Hadoop 0.19.1 and HBase 0.19.2.
Just had another cluster crash with the same issue. This is still a huge
issue for us- still crashing our cluster every other night (actually almost
every night now).
Should we move to .20? Is there more information i can provide? Is this
related to my other email Constantly getting
Here is the point in the logs where the infinite loop begins - see time
stamp 2009-05-14 04:03:56,348 : (JobTracker)
2009-05-14 04:03:56,324 INFO org.apache.hadoop.mapred.JobTracker: Removed
completed task 'attempt_200905122015_1168_m_29_0' from
All,
I have read some recommendation regarding image (binary input) processing
using Hadoop-streaming which only accept text out-of-box for now.
http://hadoop.apache.org/core/docs/current/streaming.html
https://issues.apache.org/jira/browse/HADOOP-1722
Hi Qiming,
You might consider using Dumbo, which is a Python wrapper for Hadoop
Streaming. The associated typedbytes module makes it easy for
streaming programs to work with binary data:
http://wiki.github.com/klbostee/dumbo
http://wiki.github.com/klbostee/typedbytes
Hi
If you want to read the files form HDFS and can not pass the binary data,
you can do some encoding of it (base 64 for example, but you can think about
sth more efficient since the range of characters accprable in the input
string is wider than that used by BASE64). It should solve the problem
Sorry, had missed that Todd had created Jira -
HADOOP-5761https://issues.apache.org/jira/browse/HADOOP-5761
Any progress there?
Thanks,
Lance
On Thu, May 14, 2009 at 8:52 AM, Lance Riedel la...@dotspots.com wrote:
Here is the point in the logs where the infinite loop begins - see time
stamp
just in addition to my previous post...
You don't have to store the enceded files in a file system of course since
you can write your own InoutFormat which wil do this on the fly... the
overhead should not be that big.
Piotr
2009/5/14 Piotr Praczyk piotr.prac...@gmail.com
Hi
If you want to
You can also get the input file name with conf.get(map.input.file) and
reuse the last part of the filename (i.e. part-0) with the
OutputCommitter.
-Original Message-
From: jason hadoop [mailto:jason.had...@gmail.com]
Sent: 14 May 2009 16:25
To: core-user@hadoop.apache.org
Subject:
The secondary namenode takes a snapshot
at 5 minute (configurable) intervals,
This is a bit too aggressive.
Checkpointing is still an expensive operation.
I'd say every hour or even every day.
Isn't the default 3600 seconds?
Koji
-Original Message-
From: jason hadoop
Hey Koji,
It's an expensive operation - for the secondary namenode, not the
namenode itself, right? I don't particularly care if I stress out a
dedicated node that doesn't have to respond to queries ;)
Locally we checkpoint+backup fairly frequently (not 5 minutes ...
maybe less than the
Before 0.19, fsimage/edits were on the same directory.
So whenever secondary finishes checkpointing, it copies back the fsimage
while namenode still kept on writing to the edits file.
Usually we observed some latency on the namenode side during that time.
HADOOP-3948 would probably help after
Along these lines, even simpler approach I would think is :
1) set data.dir to local and create the data.
2) stop the datanode
3) rsync local_dir network_dir
4) start datanode with data.dir with network_dir
There is no need to format or rebalnace.
This way you can switch between local and
Philip Zeyliger wrote:
You could use ssh to set up a SOCKS proxy between your machine and
ec2, and setup org.apache.hadoop.net.SocksSocketFactory to be the
socket factory.
http://www.cloudera.com/blog/2008/12/03/securing-a-hadoop-cluster-through-a-gateway/
has more information.
very useful
On Thu, May 14, 2009 at 10:25 AM, jason hadoop jason.had...@gmail.com wrote:
If you put up a discussion question on www.prohadoopbook.com, I will fill in
the example on how to do this.
Done. Thanks!
http://www.prohadoopbook.com/forum/topics/preserving-partition-file
Does anyone have upload performance numbers to share or suggested utilities
for uploading Hadoop input data to S3 for an EC2 cluster?
I'm finding EBS volume transfer to HDFS via put to be extremely slow...
--
Peter N. Skomoroch
617.285.8348
http://www.datawrangling.com
Hey All,
I am running Hadoop 0.19.1. One of my Mapper tasks was failing and the
problem that was reported was:
Task process exit with nonzero status of 1...
Looking through the mailing list archives, I got the impression that this
was only caused by a JVM crash.
After much hair pulling, I
http://www.freedomoss.com/clouddataingestion?
On Thu, May 14, 2009 at 1:23 PM, Peter Skomoroch
peter.skomor...@gmail.comwrote:
Does anyone have upload performance numbers to share or suggested utilities
for uploading Hadoop input data to S3 for an EC2 cluster?
I'm finding EBS volume transfer
just thinking out loud here to see if anything hits a chord.
since you're talking about an access log, i imagine the data is pretty
skewed. i.e., a good percentage of the access is for one resource. if you
use resource id as key, that means a good percentage of the intermediate
data is shuffled
Hello Everyone,
Actually I had a cluster which was up.
But i stopped the cluster as i wanted to format it.But cant start it back.
1)when i give start-dfs.sh I get following on screen
starting namenode, logging to
/Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-namenode-hadoopmaster.out
In the mapside join, the input file name is not visible. as the input is
actually a composite a large number of files.
I have started answering in www.prohadoopbook.com
On Thu, May 14, 2009 at 1:19 PM, Stuart White stuart.whi...@gmail.comwrote:
On Thu, May 14, 2009 at 10:25 AM, jason hadoop
You can have separate configuration files for the different datanodes.
If you are willing to deal with the complexity you can manually start them
with altered properties from the command line.
rsync or other means of sharing identical configs is simple and common.
Raghu, your technique will
A downside of this approach is that you will not likely have data locality
for the data on shared file systems, compared with data coming from an input
split.
That being said,
from your script, *hadoop dfs -get FILE -* will write the file to standard
out.
On Thu, May 14, 2009 at 10:01 AM, Piotr
You have to examine the datanode log files
the namenode does not start the datanodes, the start script does.
The name node passively waits for the datanodes to connect to it.
On Thu, May 14, 2009 at 6:43 PM, Pankil Doshi forpan...@gmail.com wrote:
Hello Everyone,
Actually I had a cluster
Can u guide me where can I find datanode log files? As I cannot find it in
$hadoop/logs and so.
I can only find following files in logs folder :-
hadoop-hadoop-namenode-hadoopmaster.log
hadoop-hadoop-namenode-hadoopmaster.out
hadoop-hadoop-namenode-hadoopmaster.out.1
The data node logs are on the datanode machines in the log directory.
You may wish to buy my book and read chapter 4 on hdfs management.
On Thu, May 14, 2009 at 9:39 PM, Pankil Doshi forpan...@gmail.com wrote:
Can u guide me where can I find datanode log files? As I cannot find it in
This is log from datanode.
2009-05-14 00:36:14,559 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
82 blocks got processed in 12 msecs
2009-05-14 01:36:15,768 INFO org.apache.hadoop.dfs.DataNode: BlockReport of
82 blocks got processed in 8 msecs
2009-05-14 02:36:13,975 INFO
57 matches
Mail list logo