Compare data on HDFS side

2008-09-04 Thread Andrey Pankov
Hello,

Does anyone know is it possible to compare data on HDFS but avoid
coping data to local box? I mean if I'd like to find difference
between local text files I can use diff command. If files are at HDFS
then I have to get them from HDFS to local box and only then do diff.
Coping files to local fs is a bit annoying and could be problematical
when files are huge, say 2-5 Gb.

Thanks in advance.

-- 
Andrey Pankov


Re: datanodes in virtual networks.

2008-09-01 Thread Andrey Pankov
Hi, Dmitry!

Please, take a look into Webdav server for HDFS. It supports
read/write already, more details at http://www.hadoop.iponweb.net/

On Mon, Sep 1, 2008 at 7:28 AM, Dmitry Pushkarev [EMAIL PROTECTED] wrote:
 Dear hadoop users,



 Our lab in slowly switching from SGE to hadoop, however not everything seems
 to be easy and obvious. We are in no way computer scientists, we're just
 physicists, biologist and couple of statisticians trying to solve our
 computational problems, please take this into consideration if questions
 will look to you obvious..

 Our setup:

 1.   Data cluster - 4 Raided and Hadooped servers, with 2TB of storage
 each, they all have real IP addresses, one of them reserved for NameNode.

 2.   Computational cluster:  100 dualcore servers running Sun Grid
 Engine, they live on virtual network (10.0.0.X) and can connect to outside
 world, but not accessible from out of the cluster. On these we don't have
 root access, and these are shared via SGE with other people, who get
 reasonably nervous when see idle reserved servers.



 Basic Idea is to create on-demand computational cluster,  which when needed
 will reserve servers from second cluster run jobs and let them go.



 Currently it is done via script that reserves server for namenode 25 servers
 for datanode copies data from first cluster, runs job, send result back and
 releases servers. I still want to make them work together using one
 namenode.



 After a week playing with hadoop I couldn't answer some of my question vie
 thorough RTFM, so I'd really appreciate is you can answer at least some of
 them in our context:



 1.   Is it possible to connect servers from second cluster to first
 namenode? What worries me is implementation of data-transfer protocol,
 because some of the nodes cannot be reached but they can easily reach any
 other node.  Will hadoop try to establish connection both ways to transfer
 data between nodes?



 2.   It is possible to specify reliability of the node, that is to
 make replica on the node with raid installed counts as two replicas as
 probability of failure is much lower.



 3.   I also bumped into problems with decommissioning, after I add hosts
 to free to dfs.hosts.exclude file and refreshNodes, they are marked as
 Decommission in progress for days, even though data is removed from them
 within first several minutes. What I currently do is shoot them down with
 some delay, but I really hope to see Decommissioned one day. What am I
 probably doing wrong?



 4.   The same question about dead hosts. I do a simple exercise: I
 create 20 datanodes on empty cluster, then I kill 15 of them and try to
 store a file on HDFS, hadoop fails because some nodes that it thinks in
 service aren't accessible. Is it possible to tell hadoop to remove these
 nodes from the list and do not try to store data on them? My current
 solution is hadoop-stop/start via cron every hour.



 5.   We also have some external secure storage that can be accesses via
 NFS from fists DATA cluster,  and it'd be great if I could somehow mount
 this storage to HDFS folder and tell hadoop that all data written to that
 folder shouldn't be replicated rather they should go directly to NFS.



 6.   Ironically none of us who uses cluster knows java, and most tasks
 are launched via streaming with C++ programs/perl scripts.  The problem is
 how to write/read files from HDFS in this context, we currently use things
 like   -moveFromLocal  but it doesn't seems to be right answer, because it
 slows things down a lot.



 7.   On one of the DataCluster machines with run pretty large MySQL
 database, and just thinking whether it is possible to spread database across
 the cluster, has anyone tried that?



 8.   Fuse-hdfs works great, but we really hope to be able to write to
 HDFS someday, how to enable it?



 9.   And may be someone can point out where to look for ways to specify
 how to partition data for the map jobs, in some our tasks processing of one
 line of input file takes several minutes, currently we split these files to
 many one-line files and process them independently, but a simple
 streaming-compatible way to tell hadoop that for example we want each job to
 take 10 lines or to split the 10kb input file into 1 map tasks would
 help as a lot!







 Thanks in advance.









-- 
Andrey Pankov


Re: Streaming and subprocess error code

2008-05-15 Thread Andrey Pankov

Hi Rick,

Double checked my test. The syslog output contains msg about non-zero 
exit code (in this case mapper finished with segfault)


2008-05-14 18:12:04,473 INFO org.apache.hadoop.streaming.PipeMapRed: 
PipeMapRed.waitOutputThreads(): subprocess exited with code 134 in 
org.apache.hadoop.streaming.PipeMapRed


stderr contains message with dump or smth about segfault.

Reducer job also finished with error:

2008-05-14 20:28:34,128 INFO org.apache.hadoop.streaming.PipeMapRed: 
PipeMapRed.waitOutputThreads(): subprocess exited with code 55 in 
org.apache.hadoop.streaming.PipeMapRed


Hence entire job is successful

08/05/14 18:12:03 INFO streaming.StreamJob:  map 0%  reduce 0%
08/05/14 18:12:05 INFO streaming.StreamJob:  map 100%  reduce 0%
08/05/14 18:12:06 INFO streaming.StreamJob:  map 100%  reduce 100%
08/05/14 18:12:06 INFO streaming.StreamJob: Job complete: 
job_200805131958_0020
08/05/14 18:12:06 INFO streaming.StreamJob: Output: 
/user/hadoop/data1_result





Rick Cox wrote:

Hi,

Thanks: that message indicates the stream.non.zero.exit.is.failure
feature isn't enabled for this task; the log is just reporting the
exit status, but not raising the RuntimeException that it would if the
feature were turned on.

I've had problems getting this parameter through from the command line
before. If you've got access, you could try setting it in the
hadoop-site.xml instead (I think it should be the tasktrackers that
read that parameter).

(Sorry about the confusion here, we've been using that patch for so
long I had forgotten it wasn't yet released, and I'm not exactly sure
where we stand with these other bugs.)

rick

On Wed, May 14, 2008 at 11:05 PM, Andrey Pankov [EMAIL PROTECTED] wrote:

Hi Rick,

 Double checked my test. The syslog output contains msg about non-zero exit
code (in this case mapper finished with segfault)

 2008-05-14 18:12:04,473 INFO org.apache.hadoop.streaming.PipeMapRed:
PipeMapRed.waitOutputThreads(): subprocess exited with code 134 in
org.apache.hadoop.streaming.PipeMapRed

 stderr contains message with dump or smth about segfault.

 Reducer job also finished with error:

 2008-05-14 20:28:34,128 INFO org.apache.hadoop.streaming.PipeMapRed:
PipeMapRed.waitOutputThreads(): subprocess exited with code 55 in
org.apache.hadoop.streaming.PipeMapRed

 Hence entire job is successful

 08/05/14 18:12:03 INFO streaming.StreamJob:  map 0%  reduce 0%
 08/05/14 18:12:05 INFO streaming.StreamJob:  map 100%  reduce 0%
 08/05/14 18:12:06 INFO streaming.StreamJob:  map 100%  reduce 100%
 08/05/14 18:12:06 INFO streaming.StreamJob: Job complete:
job_200805131958_0020
 08/05/14 18:12:06 INFO streaming.StreamJob: Output:
/user/hadoop/data1_result






 Rick Cox wrote:


Does the syslog output from a should-have-failed task contain
something like this?

   java.lang.RuntimeException: PipeMapRed.waitOutputThreads():
subprocess failed with code 1

(In particular, I'm curious if it mentions the RuntimeException.)

Tasks that consume all their input and then exit non-zero are
definitely supposed to be counted as failed, so there's either a
problem with the setup or a bug somewhere.

rick

On Wed, May 14, 2008 at 8:49 PM, Andrey Pankov [EMAIL PROTECTED]

wrote:

Hi,

 I've tested this new option -jobconf
stream.non.zero.exit.status.is.failure=true. Seems working but still

not

good for me. When mapper/reducer program have read all input data
successfully and fails after that, streaming still finishes successfully

so

there are no chances to know about some data post-processing errors in
subprocesses :(



 Andrey Pankov wrote:



Hi Rick,

Thank you for the quick response! I see this feature is in trunk and

not

available in last stable release. Anyway will try if it works for me

from

the trunk, and will try does it catch segmentation faults too.


Rick Cox wrote:



Try -jobconf stream.non.zero.exit.status.is.failure=true.

That will tell streaming that a non-zero exit is a task failure. To
turn that into an immediate whole job failure, I think configuring 0
task retries (mapred.map.max.attempts=1 and
mapred.reduce.max.attempts=1) will be sufficient.

rick

On Tue, May 13, 2008 at 8:15 PM, Andrey Pankov [EMAIL PROTECTED]


wrote:


Hi all,

 I'm looking a way to force Streaming to shutdown the whole job in


case when


some of its subprocesses exits with non-zero error code.

 We have next situation. Sometimes either mapper or reducer could


crush, as


a rule it returns some exit code. In this case entire streaming

job

finishes


successfully, but that's wrong. Almost the same when any

subprocess

finishes


with segmentation fault.

 It's possible to check automatically if a subprocess crushed only

via

logs


but it means you need to parse tons of outputs/logs/dirs/etc.
 In order to find logs of your job you have to know it's jobid ~
job_200805130853_0016. I don't know easy way to determine it -

just

scan


stdout for the pattern. Then find logs of each mapper, each

reducer,

find

Re: Streaming and subprocess error code

2008-05-15 Thread Andrey Pankov

Hi Zgeng,

Your help was significant - it was my mistake I messed up option names. 
Now it works as desired for me. Thanks a lot!


Zheng Shao wrote:

See
https://issues.apache.org/jira/secure/attachment/12369344/exit-status-20
57-0.16.patch

The option is called stream.non.zero.exit.is.failure, not
stream.non.zero.exit.status.is.failure.


Some users (including me) are pushing to make this option default to
true, but there is no response yet.
Dhruba, maybe you can help push that? 



Zheng
-Original Message-
From: Joydeep Sen Sarma 
Sent: Wednesday, May 14, 2008 3:02 PM

To: Zheng Shao
Subject: FW: Streaming and subprocess error code

Looks like the bug is not fixed correctly in trunk ..

-Original Message-
From: Andrey Pankov [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, May 14, 2008 8:19 AM

To: core-user@hadoop.apache.org
Subject: Re: Streaming and subprocess error code

Hi,

I've tested this new option -jobconf 
stream.non.zero.exit.status.is.failure=true. Seems working but still 
not good for me. When mapper/reducer program have read all input data 
successfully and fails after that, streaming still finishes successfully


so there are no chances to know about some data post-processing errors 
in subprocesses :(


Andrey Pankov wrote:

Hi Rick,

Thank you for the quick response! I see this feature is in trunk and
not 
available in last stable release. Anyway will try if it works for me 
from the trunk, and will try does it catch segmentation faults too.



Rick Cox wrote:

Try -jobconf stream.non.zero.exit.status.is.failure=true.

That will tell streaming that a non-zero exit is a task failure. To
turn that into an immediate whole job failure, I think configuring 0
task retries (mapred.map.max.attempts=1 and
mapred.reduce.max.attempts=1) will be sufficient.

rick

On Tue, May 13, 2008 at 8:15 PM, Andrey Pankov [EMAIL PROTECTED] 
wrote:

Hi all,

 I'm looking a way to force Streaming to shutdown the whole job in 
case when

some of its subprocesses exits with non-zero error code.

 We have next situation. Sometimes either mapper or reducer could 
crush, as
a rule it returns some exit code. In this case entire streaming job 
finishes
successfully, but that's wrong. Almost the same when any subprocess 
finishes

with segmentation fault.

 It's possible to check automatically if a subprocess crushed only 
via logs

but it means you need to parse tons of outputs/logs/dirs/etc.
 In order to find logs of your job you have to know it's jobid ~
job_200805130853_0016. I don't know easy way to determine it - just

scan

stdout for the pattern. Then find logs of each mapper, each reducer,



find a
way to parse them, etc, etc...

 So, is there any easiest way get correct status of the whole 
streaming job
or I still have to build rather fragile parsing systems for such 
purposes?


 Thanks in advance.

 --
 Andrey Pankov










--
Andrey Pankov



Re: Streaming and subprocess error code

2008-05-14 Thread Andrey Pankov

Hi,

I've tested this new option -jobconf 
stream.non.zero.exit.status.is.failure=true. Seems working but still 
not good for me. When mapper/reducer program have read all input data 
successfully and fails after that, streaming still finishes successfully 
so there are no chances to know about some data post-processing errors 
in subprocesses :(


Andrey Pankov wrote:

Hi Rick,

Thank you for the quick response! I see this feature is in trunk and not 
available in last stable release. Anyway will try if it works for me 
from the trunk, and will try does it catch segmentation faults too.



Rick Cox wrote:

Try -jobconf stream.non.zero.exit.status.is.failure=true.

That will tell streaming that a non-zero exit is a task failure. To
turn that into an immediate whole job failure, I think configuring 0
task retries (mapred.map.max.attempts=1 and
mapred.reduce.max.attempts=1) will be sufficient.

rick

On Tue, May 13, 2008 at 8:15 PM, Andrey Pankov [EMAIL PROTECTED] 
wrote:

Hi all,

 I'm looking a way to force Streaming to shutdown the whole job in 
case when

some of its subprocesses exits with non-zero error code.

 We have next situation. Sometimes either mapper or reducer could 
crush, as
a rule it returns some exit code. In this case entire streaming job 
finishes
successfully, but that's wrong. Almost the same when any subprocess 
finishes

with segmentation fault.

 It's possible to check automatically if a subprocess crushed only 
via logs

but it means you need to parse tons of outputs/logs/dirs/etc.
 In order to find logs of your job you have to know it's jobid ~
job_200805130853_0016. I don't know easy way to determine it - just scan
stdout for the pattern. Then find logs of each mapper, each reducer, 
find a

way to parse them, etc, etc...

 So, is there any easiest way get correct status of the whole 
streaming job
or I still have to build rather fragile parsing systems for such 
purposes?


 Thanks in advance.

 --
 Andrey Pankov










--
Andrey Pankov



Streaming and subprocess error code

2008-05-13 Thread Andrey Pankov

Hi all,

I'm looking a way to force Streaming to shutdown the whole job in case 
when some of its subprocesses exits with non-zero error code.


We have next situation. Sometimes either mapper or reducer could crush, 
as a rule it returns some exit code. In this case entire streaming job 
finishes successfully, but that's wrong. Almost the same when any 
subprocess finishes with segmentation fault.


It's possible to check automatically if a subprocess crushed only via 
logs but it means you need to parse tons of outputs/logs/dirs/etc.
In order to find logs of your job you have to know it's jobid ~ 
job_200805130853_0016. I don't know easy way to determine it - just scan 
stdout for the pattern. Then find logs of each mapper, each reducer, 
find a way to parse them, etc, etc...


So, is there any easiest way get correct status of the whole streaming 
job or I still have to build rather fragile parsing systems for such 
purposes?


Thanks in advance.

--
Andrey Pankov



Re: Streaming and subprocess error code

2008-05-13 Thread Andrey Pankov

Hi Rick,

Thank you for the quick response! I see this feature is in trunk and not 
available in last stable release. Anyway will try if it works for me 
from the trunk, and will try does it catch segmentation faults too.



Rick Cox wrote:

Try -jobconf stream.non.zero.exit.status.is.failure=true.

That will tell streaming that a non-zero exit is a task failure. To
turn that into an immediate whole job failure, I think configuring 0
task retries (mapred.map.max.attempts=1 and
mapred.reduce.max.attempts=1) will be sufficient.

rick

On Tue, May 13, 2008 at 8:15 PM, Andrey Pankov [EMAIL PROTECTED] wrote:

Hi all,

 I'm looking a way to force Streaming to shutdown the whole job in case when
some of its subprocesses exits with non-zero error code.

 We have next situation. Sometimes either mapper or reducer could crush, as
a rule it returns some exit code. In this case entire streaming job finishes
successfully, but that's wrong. Almost the same when any subprocess finishes
with segmentation fault.

 It's possible to check automatically if a subprocess crushed only via logs
but it means you need to parse tons of outputs/logs/dirs/etc.
 In order to find logs of your job you have to know it's jobid ~
job_200805130853_0016. I don't know easy way to determine it - just scan
stdout for the pattern. Then find logs of each mapper, each reducer, find a
way to parse them, etc, etc...

 So, is there any easiest way get correct status of the whole streaming job
or I still have to build rather fragile parsing systems for such purposes?

 Thanks in advance.

 --
 Andrey Pankov







--
Andrey Pankov



Run job not from namenode

2008-03-31 Thread Andrey Pankov

Hi all,

Currently I'm able to run map-reduce jobs from box where NameNode and 
JobTracker are running. But I'd like to run my jobs from separate box, 
from which I have access to HDFS. I have updated params fs.default.name 
and  mapred.job.tracker in local hadoop dir to point to the clusters 
master. Now Hadoop returns me following error:


[EMAIL PROTECTED]:/usr/local/hadoop-0.16.0$ bin/hadoop jar 
hadoop-0.16.0-examples.jar wordcount /user/username/gutenberg 
/user/username/gutenberg-output
08/03/31 10:21:46 INFO mapred.FileInputFormat: Total input paths to 
process : 3
org.apache.hadoop.ipc.RemoteException: java.io.IOException: 
/mnt/hadoop/mapred/system/job_200803210640_0852/job.xml: No such file or 
directory

at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:159)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:133)
at 
org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1083)

 ...

Here account 'username' has passwordless access to master box. Cluster 
runs over EC2.


As a variant I can run tasks via ssh, i.e.

ssh master /usr/local/hadoop-0.16.0bin/hadoop jar 
/home/username/jobs/hadoop-0.16.0-examples.jar wordcount 
/user/username/gutenberg /user/username/gutenberg-output


But you need to put your jar file to the NameNode box before you run it.

Thanks in advance.

--
Andrey Pankov



Re: Hadoop on EC2 for large cluster

2008-03-20 Thread Andrey Pankov

Hi,

Did you see hadoop-0.16.0/src/contrib/ec2/bin/start-hadoop script? It 
already contains such part:


echo Copying private key to slaves
for slave in `cat slaves`; do
  scp $SSH_OPTS $PRIVATE_KEY_PATH [EMAIL PROTECTED]:/root/.ssh/id_rsa
  ssh $SSH_OPTS [EMAIL PROTECTED] chmod 600 /root/.ssh/id_rsa
  sleep 1
done

Anyway, did you tried hadoop-ec2 script? It works well for task you 
described.



Prasan Ary wrote:

Hi All,
  I have been trying to configure Hadoop on EC2 for large number of clusters ( 
100 plus). It seems that I have to copy EC2 private key to all the machines in 
the cluster so that they can have SSH connections.
  For now it seems I have to run a script to copy the key file to each of the 
EC2 instances. I wanted to know if there is a better way to accomplish this.
   
  Thanks,

  PA

   
-

Never miss a thing.   Make Yahoo your homepage.


---
Andrey Pankov


Two input pathes for reduce

2008-03-14 Thread Andrey Pankov

Hi,

A quick question. Is it possible to specify only for reduce() (java 
MapReducer) two or more input pathes? It could avoid some overhead over 
IdentityReducer / IdentityMapper usage in my case. Thanks!


---
Andrey Pankov


Re: Separate data-nodes from worker-nodes

2008-03-13 Thread Andrey Pankov

Thanks, Ted!

I also thought it is not good one to separate them out. Just was 
wondering is it possible at all. Thanks!



Ted Dunning wrote:

It is quite possible to do this.

It is also a bad idea.

One of the great things about map-reduce architectures is that data is near
the computation so that you don't have to wait for the network.  If you
separate data and computation, you impose additional load on the cluster.

What this will do to your throughput is an open question and it depends a
lot on your programs.


On 3/13/08 1:42 AM, Andrey Pankov [EMAIL PROTECTED] wrote:


Hi,

Is it possible to configure hadoop cluster in such manner where there
are separately data-nodes and separately worker-nodes? I.e. when nodes
1,2,3 store data in HDFS and nodes 3,4 and 5 do the map-reduce jobs and
take data from HDFS?

If it's possible what impact will be on performance? Any suggestions?

Thanks in advance,

--- Andrey Pankov





---
Andrey Pankov