Is Mapper's map method thread safe?

2009-05-14 Thread imcaptor
Dear all: Any one knows Is Mapper's map method thread safe? Thank you! imcaptor

Re: Is Mapper's map method thread safe?

2009-05-14 Thread Shengkai Zhu
Each mapper instance will be executed in separate JVM. On Thu, May 14, 2009 at 2:04 PM, imcaptor imcap...@gmail.com wrote: Dear all: Any one knows Is Mapper's map method thread safe? Thank you! imcaptor -- 朱盛凯 Jash Zhu 复旦大学软件学院 Software School, Fudan University

Re: Is Mapper's map method thread safe?

2009-05-14 Thread jason hadoop
Ultimately it depends on how you write the Mapper.map method. The framework supports a MultithreadedMapRunner which lets you set the number of threads running your map method simultaneously. Chapter 5 of my book covers this. On Wed, May 13, 2009 at 11:10 PM, Shengkai Zhu geniusj...@gmail.com

Re: Selective output based on keys

2009-05-14 Thread jason hadoop
The customary practice is to have your Reducer.reduce method handle the filtering if you are reducing your output. or the Mapper.map method if you are not. On Wed, May 13, 2009 at 1:57 PM, Asim linka...@gmail.com wrote: Hi, I wish to output only selective records to the output files based on

Re: how to connect to remote hadoop dfs by eclipse plugin?

2009-05-14 Thread Rasit OZDAS
Why don't you use it with localhost? Does it have a disadvantage? As far as I know, there were several host = IP problems in hadoop, but that was a while ago, I think these should have been solved.. It's can also be about the order of IP conversions in IP table file. 2009/5/14 andy2005cst

Re: Regarding Capacity Scheduler

2009-05-14 Thread Billy Pearson
I am seeing the the same problem posted on the list on the 11th and have not any reply. Billy - Original Message - From: Manish Katyal manish.katyal-re5jqeeqqe8avxtiumw...@public.gmane.org Newsgroups: gmane.comp.jakarta.lucene.hadoop.user To:

managing hadoop using moab

2009-05-14 Thread Vishal Ghawate
hi, i just wonder if can we use moab cluster suite for managing hadoop cluster Vishal S. Ghawate DISCLAIMER == This e-mail may contain privileged and confidential information which is the property of Persistent Systems Ltd. It is intended only for the use of the individual or entity to

Re: How to do load control of MapReduce

2009-05-14 Thread zsongbo
We find the disk I/O is the major bottleneck. Device: rrqm/s wrqm/s r/s w/s rsec/s wsec/s avgrq-sz avgqu-sz await svctm %util sda 1.00 0.00 85.21 0.00 20926.32 0.00 245.58 31.59 364.49 11.77 100.28 sdb 5.76 4752.88 53.13 131.08

Append in Hadoop

2009-05-14 Thread Wasim Bari
Hi, Can someone tell about Append functionality in Hadoop. Is it available now in 0.20 ?? Regards, Wasim

Re: Append in Hadoop

2009-05-14 Thread Sasha Dolgy
it's available, although not suitable for production purposes as i've found / been told. put the following in your $HADOOP_HOME/conf/hadoop-site.xml property namedfs.support.append/name valuetrue/value /property -sd On Thu, May 14, 2009 at 1:27 PM, Wasim Bari wasimb...@msn.com

RE: Append in Hadoop

2009-05-14 Thread Vishal Ghawate
is this property available in 0.20.0 since i dont thik it is there in prior versions Vishal S. Ghawate From: Sasha Dolgy [sdo...@gmail.com] Sent: Thursday, May 14, 2009 6:03 PM To: core-user@hadoop.apache.org Subject: Re: Append in Hadoop it's available,

RE: public IP for datanode on EC2

2009-05-14 Thread Joydeep Sen Sarma
I changed the ec2 scripts to have fs.default.name assigned to the public hostname (instead of the private hostname). Now I can submit jobs remotely via the socks proxy (the problem below is resolved) - but the map tasks fail with an exception: 2009-05-14 07:30:34,913 INFO

Re: Append in Hadoop

2009-05-14 Thread Sasha Dolgy
yep, i'm using it in 0.19.1 and have used it in 0.20.0 -sasha On Thu, May 14, 2009 at 1:35 PM, Vishal Ghawate vishal_ghaw...@persistent.co.in wrote: is this property available in 0.20.0 since i dont thik it is there in prior versions Vishal S. Ghawate

Re: hadoop getProtocolVersion and getBuildVersion error

2009-05-14 Thread Starry SHI
Nobody has encountered with these problems: Error register getProtocolVersion and Error register getBuildVersion? Starry /* Tomorrow is another day. So is today. */ On Tue, May 12, 2009 at 13:27, Starry SHI starr...@gmail.com wrote: Hi, all. Today I noticed that my hadoop cluster

Re: public IP for datanode on EC2

2009-05-14 Thread Tom White
Hi Joydeep, The problem you are hitting may be because port 50001 isn't open, whereas from within the cluster any node may talk to any other node (because the security groups are set up to do this). However I'm not sure this is a good approach. Configuring Hadoop to use public IP addresses

Setting up another machine as secondary node

2009-05-14 Thread Rakhi Khatwani
Hi, I wanna set up a cluster of 5 nodes in such a way that node1 - master node2 - secondary namenode node3 - slave node4 - slave node5 - slave How do we go about that? there is no property in hadoop-env where i can set the ip-address for secondary name node. if i set node-1 and node-2 in

RE: Append in Hadoop

2009-05-14 Thread Vishal Ghawate
where did you find that property Vishal S. Ghawate From: Sasha Dolgy [sdo...@gmail.com] Sent: Thursday, May 14, 2009 6:09 PM To: core-user@hadoop.apache.org Subject: Re: Append in Hadoop yep, i'm using it in 0.19.1 and have used it in 0.20.0 -sasha On

Re: Setting up another machine as secondary node

2009-05-14 Thread David Ritch
First of all, the secondary namenode is not a what you might think a secondary is - it's not failover device. It does make a copy of the filesystem metadata periodically, and it integrates the edits into the image. It does *not* provide failover. Second, you specify its IP address in

Re: Append in Hadoop

2009-05-14 Thread Sasha Dolgy
search this list for that variable name. i made a post last week inquiring about appends() and was given enough information to go hunt down the info on google and jira On Thu, May 14, 2009 at 2:01 PM, Vishal Ghawate vishal_ghaw...@persistent.co.in wrote: where did you find that property

Re: hadoop getProtocolVersion and getBuildVersion error

2009-05-14 Thread Abhishek Verma
Hi Starry, I noticed the same problem when I copied hadoop-metrics.properties from my old hadoop-0.19 conf along with the other files. Make sure you are using the right version of the conf files. Hope that helps. -Abhishek. On Thu, May 14, 2009 at 7:48 AM, Starry SHI starr...@gmail.com wrote:

RE: public IP for datanode on EC2

2009-05-14 Thread Joydeep Sen Sarma
The ec2 documentation point to the use of public 'ip' addresses - whereas using public 'hostnames' seems safe since it resolves to internal addresses from within the cluster (and resolve to public ip addresses from outside). The only data transfer that I would incur while submitting jobs from

Indexing pdfs and docs

2009-05-14 Thread PORTO aLET
Hi, My company has about 50GB of pdfs and docs, and we would like to be able to do some text search over a web interface. Is there any good tutorial that specifies hardware requirements and software specs to do this? Regards

Re: public IP for datanode on EC2

2009-05-14 Thread Tom White
Yes, you're absolutely right. Tom On Thu, May 14, 2009 at 2:19 PM, Joydeep Sen Sarma jssa...@facebook.com wrote: The ec2 documentation point to the use of public 'ip' addresses - whereas using public 'hostnames' seems safe since it resolves to internal addresses from within the cluster (and

How to replace the storage on a datanode without formatting the namenode?

2009-05-14 Thread Alexandra Alecu
Hi, I want to test how Hadoop and HBase are performing. I have a cluster with 1 namenode and 4 datanodes. I use Hadoop 0.19.1 and HBase 0.19.2. I first ran a few tests when the 4 datanodes use local storage specified in dfs.data.dir. Now, I want to see what is the tradeoff if I switch from

Re: Setting up another machine as secondary node

2009-05-14 Thread jason hadoop
any machine put in the conf/masters file becomes a secondary namenode. At some point there was confusion on the safety of more than one machine, which I believe was settled, as many are safe. The secondary namenode takes a snapshot at 5 minute (configurable) intervals, rebuilds the fsimage and

Re: Indexing pdfs and docs

2009-05-14 Thread Piotr Praczyk
Hi First of all, you should probably know, what you want to do exactly. Without this, it is hard to estimate any hardware requirements. I assume, you want to use Hadoop for some kind of offline calculations used for web-based search later ? In your place I would start with reading about, how such

Map-side join: Sort order preserved?

2009-05-14 Thread Stuart White
I'm implementing a map-side join as described in chapter 8 of Pro Hadoop. I have two files that have been partitioned using the TotalOrderPartitioner on the same key into the same number of partitions. I've set mapred.min.split.size to Long.MAX_VALUE so that one Mapper will handle an entire

Re: How to replace the storage on a datanode without formatting the namenode?

2009-05-14 Thread Alexandra Alecu
Another possibility I am thinking about now, which is suitable for me as I do not actually have much data stored in the cluster when I want to perform this switch is to set the replication level really high and then simply remove the local storage locations and restart the cluster. With a bit of

Re: Regarding Capacity Scheduler

2009-05-14 Thread Hemanth Yamijala
Manish, The pre-emption code in capacity scheduler was found to require a good relook and due to the inherent complexity of the problem is likely to have issues of the type you have noticed. We have decided to relook at the pre-emption code from scratch and to this effect removed it from the

Re: Map-side join: Sort order preserved?

2009-05-14 Thread jason hadoop
Sort order is preserved if your Mapper doesn't change the key ordering in output. Partition name is not preserved. What I have done is to manually work out what the partition number of the output file should be for each map task, by calling the partitioner on an input key, and then renaming the

Re: How to replace the storage on a datanode without formatting the namenode?

2009-05-14 Thread jason hadoop
You can decommission the datanode, and then un-decommission it. On Thu, May 14, 2009 at 7:44 AM, Alexandra Alecu alexandra.al...@gmail.comwrote: Hi, I want to test how Hadoop and HBase are performing. I have a cluster with 1 namenode and 4 datanodes. I use Hadoop 0.19.1 and HBase 0.19.2.

Re: Infinite Loop Resending status from task tracker

2009-05-14 Thread Lance Riedel
Just had another cluster crash with the same issue. This is still a huge issue for us- still crashing our cluster every other night (actually almost every night now). Should we move to .20? Is there more information i can provide? Is this related to my other email Constantly getting

Re: Infinite Loop Resending status from task tracker

2009-05-14 Thread Lance Riedel
Here is the point in the logs where the infinite loop begins - see time stamp 2009-05-14 04:03:56,348 : (JobTracker) 2009-05-14 04:03:56,324 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'attempt_200905122015_1168_m_29_0' from

hadoop streaming binary input / image processing

2009-05-14 Thread openresearch
All, I have read some recommendation regarding image (binary input) processing using Hadoop-streaming which only accept text out-of-box for now. http://hadoop.apache.org/core/docs/current/streaming.html https://issues.apache.org/jira/browse/HADOOP-1722

Re: hadoop streaming binary input / image processing

2009-05-14 Thread Zak Stone
Hi Qiming, You might consider using Dumbo, which is a Python wrapper for Hadoop Streaming. The associated typedbytes module makes it easy for streaming programs to work with binary data: http://wiki.github.com/klbostee/dumbo http://wiki.github.com/klbostee/typedbytes

Re: hadoop streaming binary input / image processing

2009-05-14 Thread Piotr Praczyk
Hi If you want to read the files form HDFS and can not pass the binary data, you can do some encoding of it (base 64 for example, but you can think about sth more efficient since the range of characters accprable in the input string is wider than that used by BASE64). It should solve the problem

Re: Infinite Loop Resending status from task tracker

2009-05-14 Thread Lance Riedel
Sorry, had missed that Todd had created Jira - HADOOP-5761https://issues.apache.org/jira/browse/HADOOP-5761 Any progress there? Thanks, Lance On Thu, May 14, 2009 at 8:52 AM, Lance Riedel la...@dotspots.com wrote: Here is the point in the logs where the infinite loop begins - see time stamp

Re: hadoop streaming binary input / image processing

2009-05-14 Thread Piotr Praczyk
just in addition to my previous post... You don't have to store the enceded files in a file system of course since you can write your own InoutFormat which wil do this on the fly... the overhead should not be that big. Piotr 2009/5/14 Piotr Praczyk piotr.prac...@gmail.com Hi If you want to

RE: Map-side join: Sort order preserved?

2009-05-14 Thread Jingkei Ly
You can also get the input file name with conf.get(map.input.file) and reuse the last part of the filename (i.e. part-0) with the OutputCommitter. -Original Message- From: jason hadoop [mailto:jason.had...@gmail.com] Sent: 14 May 2009 16:25 To: core-user@hadoop.apache.org Subject:

RE: Setting up another machine as secondary node

2009-05-14 Thread Koji Noguchi
The secondary namenode takes a snapshot at 5 minute (configurable) intervals, This is a bit too aggressive. Checkpointing is still an expensive operation. I'd say every hour or even every day. Isn't the default 3600 seconds? Koji -Original Message- From: jason hadoop

Re: Setting up another machine as secondary node

2009-05-14 Thread Brian Bockelman
Hey Koji, It's an expensive operation - for the secondary namenode, not the namenode itself, right? I don't particularly care if I stress out a dedicated node that doesn't have to respond to queries ;) Locally we checkpoint+backup fairly frequently (not 5 minutes ... maybe less than the

RE: Setting up another machine as secondary node

2009-05-14 Thread Koji Noguchi
Before 0.19, fsimage/edits were on the same directory. So whenever secondary finishes checkpointing, it copies back the fsimage while namenode still kept on writing to the edits file. Usually we observed some latency on the namenode side during that time. HADOOP-3948 would probably help after

Re: How to replace the storage on a datanode without formatting the namenode?

2009-05-14 Thread Raghu Angadi
Along these lines, even simpler approach I would think is : 1) set data.dir to local and create the data. 2) stop the datanode 3) rsync local_dir network_dir 4) start datanode with data.dir with network_dir There is no need to format or rebalnace. This way you can switch between local and

Re: public IP for datanode on EC2

2009-05-14 Thread Raghu Angadi
Philip Zeyliger wrote: You could use ssh to set up a SOCKS proxy between your machine and ec2, and setup org.apache.hadoop.net.SocksSocketFactory to be the socket factory. http://www.cloudera.com/blog/2008/12/03/securing-a-hadoop-cluster-through-a-gateway/ has more information. very useful

Re: Map-side join: Sort order preserved?

2009-05-14 Thread Stuart White
On Thu, May 14, 2009 at 10:25 AM, jason hadoop jason.had...@gmail.com wrote: If you put up a discussion question on www.prohadoopbook.com, I will fill in the example on how to do this. Done. Thanks! http://www.prohadoopbook.com/forum/topics/preserving-partition-file

Fast upload of input data to S3?

2009-05-14 Thread Peter Skomoroch
Does anyone have upload performance numbers to share or suggested utilities for uploading Hadoop input data to S3 for an EC2 cluster? I'm finding EBS volume transfer to HDFS via put to be extremely slow... -- Peter N. Skomoroch 617.285.8348 http://www.datawrangling.com

Task process exit with nonzero status of 1

2009-05-14 Thread g00dn3ss
Hey All, I am running Hadoop 0.19.1. One of my Mapper tasks was failing and the problem that was reported was: Task process exit with nonzero status of 1... Looking through the mailing list archives, I got the impression that this was only caused by a JVM crash. After much hair pulling, I

Re: Fast upload of input data to S3?

2009-05-14 Thread Jeff Hammerbacher
http://www.freedomoss.com/clouddataingestion? On Thu, May 14, 2009 at 1:23 PM, Peter Skomoroch peter.skomor...@gmail.comwrote: Does anyone have upload performance numbers to share or suggested utilities for uploading Hadoop input data to S3 for an EC2 cluster? I'm finding EBS volume transfer

Re: Large number of map output keys and performance issues.

2009-05-14 Thread Chuck Lam
just thinking out loud here to see if anything hits a chord. since you're talking about an access log, i imagine the data is pretty skewed. i.e., a good percentage of the access is for one resource. if you use resource id as key, that means a good percentage of the intermediate data is shuffled

Datanodes fail to start

2009-05-14 Thread Pankil Doshi
Hello Everyone, Actually I had a cluster which was up. But i stopped the cluster as i wanted to format it.But cant start it back. 1)when i give start-dfs.sh I get following on screen starting namenode, logging to /Hadoop/hadoop-0.18.3/bin/../logs/hadoop-hadoop-namenode-hadoopmaster.out

Re: Map-side join: Sort order preserved?

2009-05-14 Thread jason hadoop
In the mapside join, the input file name is not visible. as the input is actually a composite a large number of files. I have started answering in www.prohadoopbook.com On Thu, May 14, 2009 at 1:19 PM, Stuart White stuart.whi...@gmail.comwrote: On Thu, May 14, 2009 at 10:25 AM, jason hadoop

Re: How to replace the storage on a datanode without formatting the namenode?

2009-05-14 Thread jason hadoop
You can have separate configuration files for the different datanodes. If you are willing to deal with the complexity you can manually start them with altered properties from the command line. rsync or other means of sharing identical configs is simple and common. Raghu, your technique will

Re: hadoop streaming binary input / image processing

2009-05-14 Thread jason hadoop
A downside of this approach is that you will not likely have data locality for the data on shared file systems, compared with data coming from an input split. That being said, from your script, *hadoop dfs -get FILE -* will write the file to standard out. On Thu, May 14, 2009 at 10:01 AM, Piotr

Re: Datanodes fail to start

2009-05-14 Thread jason hadoop
You have to examine the datanode log files the namenode does not start the datanodes, the start script does. The name node passively waits for the datanodes to connect to it. On Thu, May 14, 2009 at 6:43 PM, Pankil Doshi forpan...@gmail.com wrote: Hello Everyone, Actually I had a cluster

Re: Datanodes fail to start

2009-05-14 Thread Pankil Doshi
Can u guide me where can I find datanode log files? As I cannot find it in $hadoop/logs and so. I can only find following files in logs folder :- hadoop-hadoop-namenode-hadoopmaster.log hadoop-hadoop-namenode-hadoopmaster.out hadoop-hadoop-namenode-hadoopmaster.out.1

Re: Datanodes fail to start

2009-05-14 Thread jason hadoop
The data node logs are on the datanode machines in the log directory. You may wish to buy my book and read chapter 4 on hdfs management. On Thu, May 14, 2009 at 9:39 PM, Pankil Doshi forpan...@gmail.com wrote: Can u guide me where can I find datanode log files? As I cannot find it in

Re: Datanodes fail to start

2009-05-14 Thread Pankil Doshi
This is log from datanode. 2009-05-14 00:36:14,559 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 82 blocks got processed in 12 msecs 2009-05-14 01:36:15,768 INFO org.apache.hadoop.dfs.DataNode: BlockReport of 82 blocks got processed in 8 msecs 2009-05-14 02:36:13,975 INFO