Compare data on HDFS side
Hello, Does anyone know is it possible to compare data on HDFS but avoid coping data to local box? I mean if I'd like to find difference between local text files I can use diff command. If files are at HDFS then I have to get them from HDFS to local box and only then do diff. Coping files to local fs is a bit annoying and could be problematical when files are huge, say 2-5 Gb. Thanks in advance. -- Andrey Pankov
Re: datanodes in virtual networks.
Hi, Dmitry! Please, take a look into Webdav server for HDFS. It supports read/write already, more details at http://www.hadoop.iponweb.net/ On Mon, Sep 1, 2008 at 7:28 AM, Dmitry Pushkarev [EMAIL PROTECTED] wrote: Dear hadoop users, Our lab in slowly switching from SGE to hadoop, however not everything seems to be easy and obvious. We are in no way computer scientists, we're just physicists, biologist and couple of statisticians trying to solve our computational problems, please take this into consideration if questions will look to you obvious.. Our setup: 1. Data cluster - 4 Raided and Hadooped servers, with 2TB of storage each, they all have real IP addresses, one of them reserved for NameNode. 2. Computational cluster: 100 dualcore servers running Sun Grid Engine, they live on virtual network (10.0.0.X) and can connect to outside world, but not accessible from out of the cluster. On these we don't have root access, and these are shared via SGE with other people, who get reasonably nervous when see idle reserved servers. Basic Idea is to create on-demand computational cluster, which when needed will reserve servers from second cluster run jobs and let them go. Currently it is done via script that reserves server for namenode 25 servers for datanode copies data from first cluster, runs job, send result back and releases servers. I still want to make them work together using one namenode. After a week playing with hadoop I couldn't answer some of my question vie thorough RTFM, so I'd really appreciate is you can answer at least some of them in our context: 1. Is it possible to connect servers from second cluster to first namenode? What worries me is implementation of data-transfer protocol, because some of the nodes cannot be reached but they can easily reach any other node. Will hadoop try to establish connection both ways to transfer data between nodes? 2. It is possible to specify reliability of the node, that is to make replica on the node with raid installed counts as two replicas as probability of failure is much lower. 3. I also bumped into problems with decommissioning, after I add hosts to free to dfs.hosts.exclude file and refreshNodes, they are marked as Decommission in progress for days, even though data is removed from them within first several minutes. What I currently do is shoot them down with some delay, but I really hope to see Decommissioned one day. What am I probably doing wrong? 4. The same question about dead hosts. I do a simple exercise: I create 20 datanodes on empty cluster, then I kill 15 of them and try to store a file on HDFS, hadoop fails because some nodes that it thinks in service aren't accessible. Is it possible to tell hadoop to remove these nodes from the list and do not try to store data on them? My current solution is hadoop-stop/start via cron every hour. 5. We also have some external secure storage that can be accesses via NFS from fists DATA cluster, and it'd be great if I could somehow mount this storage to HDFS folder and tell hadoop that all data written to that folder shouldn't be replicated rather they should go directly to NFS. 6. Ironically none of us who uses cluster knows java, and most tasks are launched via streaming with C++ programs/perl scripts. The problem is how to write/read files from HDFS in this context, we currently use things like -moveFromLocal but it doesn't seems to be right answer, because it slows things down a lot. 7. On one of the DataCluster machines with run pretty large MySQL database, and just thinking whether it is possible to spread database across the cluster, has anyone tried that? 8. Fuse-hdfs works great, but we really hope to be able to write to HDFS someday, how to enable it? 9. And may be someone can point out where to look for ways to specify how to partition data for the map jobs, in some our tasks processing of one line of input file takes several minutes, currently we split these files to many one-line files and process them independently, but a simple streaming-compatible way to tell hadoop that for example we want each job to take 10 lines or to split the 10kb input file into 1 map tasks would help as a lot! Thanks in advance. -- Andrey Pankov
Re: Streaming and subprocess error code
Hi Rick, Double checked my test. The syslog output contains msg about non-zero exit code (in this case mapper finished with segfault) 2008-05-14 18:12:04,473 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed.waitOutputThreads(): subprocess exited with code 134 in org.apache.hadoop.streaming.PipeMapRed stderr contains message with dump or smth about segfault. Reducer job also finished with error: 2008-05-14 20:28:34,128 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed.waitOutputThreads(): subprocess exited with code 55 in org.apache.hadoop.streaming.PipeMapRed Hence entire job is successful 08/05/14 18:12:03 INFO streaming.StreamJob: map 0% reduce 0% 08/05/14 18:12:05 INFO streaming.StreamJob: map 100% reduce 0% 08/05/14 18:12:06 INFO streaming.StreamJob: map 100% reduce 100% 08/05/14 18:12:06 INFO streaming.StreamJob: Job complete: job_200805131958_0020 08/05/14 18:12:06 INFO streaming.StreamJob: Output: /user/hadoop/data1_result Rick Cox wrote: Hi, Thanks: that message indicates the stream.non.zero.exit.is.failure feature isn't enabled for this task; the log is just reporting the exit status, but not raising the RuntimeException that it would if the feature were turned on. I've had problems getting this parameter through from the command line before. If you've got access, you could try setting it in the hadoop-site.xml instead (I think it should be the tasktrackers that read that parameter). (Sorry about the confusion here, we've been using that patch for so long I had forgotten it wasn't yet released, and I'm not exactly sure where we stand with these other bugs.) rick On Wed, May 14, 2008 at 11:05 PM, Andrey Pankov [EMAIL PROTECTED] wrote: Hi Rick, Double checked my test. The syslog output contains msg about non-zero exit code (in this case mapper finished with segfault) 2008-05-14 18:12:04,473 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed.waitOutputThreads(): subprocess exited with code 134 in org.apache.hadoop.streaming.PipeMapRed stderr contains message with dump or smth about segfault. Reducer job also finished with error: 2008-05-14 20:28:34,128 INFO org.apache.hadoop.streaming.PipeMapRed: PipeMapRed.waitOutputThreads(): subprocess exited with code 55 in org.apache.hadoop.streaming.PipeMapRed Hence entire job is successful 08/05/14 18:12:03 INFO streaming.StreamJob: map 0% reduce 0% 08/05/14 18:12:05 INFO streaming.StreamJob: map 100% reduce 0% 08/05/14 18:12:06 INFO streaming.StreamJob: map 100% reduce 100% 08/05/14 18:12:06 INFO streaming.StreamJob: Job complete: job_200805131958_0020 08/05/14 18:12:06 INFO streaming.StreamJob: Output: /user/hadoop/data1_result Rick Cox wrote: Does the syslog output from a should-have-failed task contain something like this? java.lang.RuntimeException: PipeMapRed.waitOutputThreads(): subprocess failed with code 1 (In particular, I'm curious if it mentions the RuntimeException.) Tasks that consume all their input and then exit non-zero are definitely supposed to be counted as failed, so there's either a problem with the setup or a bug somewhere. rick On Wed, May 14, 2008 at 8:49 PM, Andrey Pankov [EMAIL PROTECTED] wrote: Hi, I've tested this new option -jobconf stream.non.zero.exit.status.is.failure=true. Seems working but still not good for me. When mapper/reducer program have read all input data successfully and fails after that, streaming still finishes successfully so there are no chances to know about some data post-processing errors in subprocesses :( Andrey Pankov wrote: Hi Rick, Thank you for the quick response! I see this feature is in trunk and not available in last stable release. Anyway will try if it works for me from the trunk, and will try does it catch segmentation faults too. Rick Cox wrote: Try -jobconf stream.non.zero.exit.status.is.failure=true. That will tell streaming that a non-zero exit is a task failure. To turn that into an immediate whole job failure, I think configuring 0 task retries (mapred.map.max.attempts=1 and mapred.reduce.max.attempts=1) will be sufficient. rick On Tue, May 13, 2008 at 8:15 PM, Andrey Pankov [EMAIL PROTECTED] wrote: Hi all, I'm looking a way to force Streaming to shutdown the whole job in case when some of its subprocesses exits with non-zero error code. We have next situation. Sometimes either mapper or reducer could crush, as a rule it returns some exit code. In this case entire streaming job finishes successfully, but that's wrong. Almost the same when any subprocess finishes with segmentation fault. It's possible to check automatically if a subprocess crushed only via logs but it means you need to parse tons of outputs/logs/dirs/etc. In order to find logs of your job you have to know it's jobid ~ job_200805130853_0016. I don't know easy way to determine it - just scan stdout for the pattern. Then find logs of each mapper, each reducer, find
Re: Streaming and subprocess error code
Hi Zgeng, Your help was significant - it was my mistake I messed up option names. Now it works as desired for me. Thanks a lot! Zheng Shao wrote: See https://issues.apache.org/jira/secure/attachment/12369344/exit-status-20 57-0.16.patch The option is called stream.non.zero.exit.is.failure, not stream.non.zero.exit.status.is.failure. Some users (including me) are pushing to make this option default to true, but there is no response yet. Dhruba, maybe you can help push that? Zheng -Original Message- From: Joydeep Sen Sarma Sent: Wednesday, May 14, 2008 3:02 PM To: Zheng Shao Subject: FW: Streaming and subprocess error code Looks like the bug is not fixed correctly in trunk .. -Original Message- From: Andrey Pankov [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 14, 2008 8:19 AM To: core-user@hadoop.apache.org Subject: Re: Streaming and subprocess error code Hi, I've tested this new option -jobconf stream.non.zero.exit.status.is.failure=true. Seems working but still not good for me. When mapper/reducer program have read all input data successfully and fails after that, streaming still finishes successfully so there are no chances to know about some data post-processing errors in subprocesses :( Andrey Pankov wrote: Hi Rick, Thank you for the quick response! I see this feature is in trunk and not available in last stable release. Anyway will try if it works for me from the trunk, and will try does it catch segmentation faults too. Rick Cox wrote: Try -jobconf stream.non.zero.exit.status.is.failure=true. That will tell streaming that a non-zero exit is a task failure. To turn that into an immediate whole job failure, I think configuring 0 task retries (mapred.map.max.attempts=1 and mapred.reduce.max.attempts=1) will be sufficient. rick On Tue, May 13, 2008 at 8:15 PM, Andrey Pankov [EMAIL PROTECTED] wrote: Hi all, I'm looking a way to force Streaming to shutdown the whole job in case when some of its subprocesses exits with non-zero error code. We have next situation. Sometimes either mapper or reducer could crush, as a rule it returns some exit code. In this case entire streaming job finishes successfully, but that's wrong. Almost the same when any subprocess finishes with segmentation fault. It's possible to check automatically if a subprocess crushed only via logs but it means you need to parse tons of outputs/logs/dirs/etc. In order to find logs of your job you have to know it's jobid ~ job_200805130853_0016. I don't know easy way to determine it - just scan stdout for the pattern. Then find logs of each mapper, each reducer, find a way to parse them, etc, etc... So, is there any easiest way get correct status of the whole streaming job or I still have to build rather fragile parsing systems for such purposes? Thanks in advance. -- Andrey Pankov -- Andrey Pankov
Re: Streaming and subprocess error code
Hi, I've tested this new option -jobconf stream.non.zero.exit.status.is.failure=true. Seems working but still not good for me. When mapper/reducer program have read all input data successfully and fails after that, streaming still finishes successfully so there are no chances to know about some data post-processing errors in subprocesses :( Andrey Pankov wrote: Hi Rick, Thank you for the quick response! I see this feature is in trunk and not available in last stable release. Anyway will try if it works for me from the trunk, and will try does it catch segmentation faults too. Rick Cox wrote: Try -jobconf stream.non.zero.exit.status.is.failure=true. That will tell streaming that a non-zero exit is a task failure. To turn that into an immediate whole job failure, I think configuring 0 task retries (mapred.map.max.attempts=1 and mapred.reduce.max.attempts=1) will be sufficient. rick On Tue, May 13, 2008 at 8:15 PM, Andrey Pankov [EMAIL PROTECTED] wrote: Hi all, I'm looking a way to force Streaming to shutdown the whole job in case when some of its subprocesses exits with non-zero error code. We have next situation. Sometimes either mapper or reducer could crush, as a rule it returns some exit code. In this case entire streaming job finishes successfully, but that's wrong. Almost the same when any subprocess finishes with segmentation fault. It's possible to check automatically if a subprocess crushed only via logs but it means you need to parse tons of outputs/logs/dirs/etc. In order to find logs of your job you have to know it's jobid ~ job_200805130853_0016. I don't know easy way to determine it - just scan stdout for the pattern. Then find logs of each mapper, each reducer, find a way to parse them, etc, etc... So, is there any easiest way get correct status of the whole streaming job or I still have to build rather fragile parsing systems for such purposes? Thanks in advance. -- Andrey Pankov -- Andrey Pankov
Streaming and subprocess error code
Hi all, I'm looking a way to force Streaming to shutdown the whole job in case when some of its subprocesses exits with non-zero error code. We have next situation. Sometimes either mapper or reducer could crush, as a rule it returns some exit code. In this case entire streaming job finishes successfully, but that's wrong. Almost the same when any subprocess finishes with segmentation fault. It's possible to check automatically if a subprocess crushed only via logs but it means you need to parse tons of outputs/logs/dirs/etc. In order to find logs of your job you have to know it's jobid ~ job_200805130853_0016. I don't know easy way to determine it - just scan stdout for the pattern. Then find logs of each mapper, each reducer, find a way to parse them, etc, etc... So, is there any easiest way get correct status of the whole streaming job or I still have to build rather fragile parsing systems for such purposes? Thanks in advance. -- Andrey Pankov
Re: Streaming and subprocess error code
Hi Rick, Thank you for the quick response! I see this feature is in trunk and not available in last stable release. Anyway will try if it works for me from the trunk, and will try does it catch segmentation faults too. Rick Cox wrote: Try -jobconf stream.non.zero.exit.status.is.failure=true. That will tell streaming that a non-zero exit is a task failure. To turn that into an immediate whole job failure, I think configuring 0 task retries (mapred.map.max.attempts=1 and mapred.reduce.max.attempts=1) will be sufficient. rick On Tue, May 13, 2008 at 8:15 PM, Andrey Pankov [EMAIL PROTECTED] wrote: Hi all, I'm looking a way to force Streaming to shutdown the whole job in case when some of its subprocesses exits with non-zero error code. We have next situation. Sometimes either mapper or reducer could crush, as a rule it returns some exit code. In this case entire streaming job finishes successfully, but that's wrong. Almost the same when any subprocess finishes with segmentation fault. It's possible to check automatically if a subprocess crushed only via logs but it means you need to parse tons of outputs/logs/dirs/etc. In order to find logs of your job you have to know it's jobid ~ job_200805130853_0016. I don't know easy way to determine it - just scan stdout for the pattern. Then find logs of each mapper, each reducer, find a way to parse them, etc, etc... So, is there any easiest way get correct status of the whole streaming job or I still have to build rather fragile parsing systems for such purposes? Thanks in advance. -- Andrey Pankov -- Andrey Pankov
Run job not from namenode
Hi all, Currently I'm able to run map-reduce jobs from box where NameNode and JobTracker are running. But I'd like to run my jobs from separate box, from which I have access to HDFS. I have updated params fs.default.name and mapred.job.tracker in local hadoop dir to point to the clusters master. Now Hadoop returns me following error: [EMAIL PROTECTED]:/usr/local/hadoop-0.16.0$ bin/hadoop jar hadoop-0.16.0-examples.jar wordcount /user/username/gutenberg /user/username/gutenberg-output 08/03/31 10:21:46 INFO mapred.FileInputFormat: Total input paths to process : 3 org.apache.hadoop.ipc.RemoteException: java.io.IOException: /mnt/hadoop/mapred/system/job_200803210640_0852/job.xml: No such file or directory at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:159) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:133) at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1083) ... Here account 'username' has passwordless access to master box. Cluster runs over EC2. As a variant I can run tasks via ssh, i.e. ssh master /usr/local/hadoop-0.16.0bin/hadoop jar /home/username/jobs/hadoop-0.16.0-examples.jar wordcount /user/username/gutenberg /user/username/gutenberg-output But you need to put your jar file to the NameNode box before you run it. Thanks in advance. -- Andrey Pankov
Re: Hadoop on EC2 for large cluster
Hi, Did you see hadoop-0.16.0/src/contrib/ec2/bin/start-hadoop script? It already contains such part: echo Copying private key to slaves for slave in `cat slaves`; do scp $SSH_OPTS $PRIVATE_KEY_PATH [EMAIL PROTECTED]:/root/.ssh/id_rsa ssh $SSH_OPTS [EMAIL PROTECTED] chmod 600 /root/.ssh/id_rsa sleep 1 done Anyway, did you tried hadoop-ec2 script? It works well for task you described. Prasan Ary wrote: Hi All, I have been trying to configure Hadoop on EC2 for large number of clusters ( 100 plus). It seems that I have to copy EC2 private key to all the machines in the cluster so that they can have SSH connections. For now it seems I have to run a script to copy the key file to each of the EC2 instances. I wanted to know if there is a better way to accomplish this. Thanks, PA - Never miss a thing. Make Yahoo your homepage. --- Andrey Pankov
Two input pathes for reduce
Hi, A quick question. Is it possible to specify only for reduce() (java MapReducer) two or more input pathes? It could avoid some overhead over IdentityReducer / IdentityMapper usage in my case. Thanks! --- Andrey Pankov
Re: Separate data-nodes from worker-nodes
Thanks, Ted! I also thought it is not good one to separate them out. Just was wondering is it possible at all. Thanks! Ted Dunning wrote: It is quite possible to do this. It is also a bad idea. One of the great things about map-reduce architectures is that data is near the computation so that you don't have to wait for the network. If you separate data and computation, you impose additional load on the cluster. What this will do to your throughput is an open question and it depends a lot on your programs. On 3/13/08 1:42 AM, Andrey Pankov [EMAIL PROTECTED] wrote: Hi, Is it possible to configure hadoop cluster in such manner where there are separately data-nodes and separately worker-nodes? I.e. when nodes 1,2,3 store data in HDFS and nodes 3,4 and 5 do the map-reduce jobs and take data from HDFS? If it's possible what impact will be on performance? Any suggestions? Thanks in advance, --- Andrey Pankov --- Andrey Pankov