S3/EC2 setup problem: port 9001 unreachable

2008-03-10 Thread Andreas Kostyrka
Hi! I'm trying to setup a Hadoop 0.16.0 cluster on EC2/S3. (Manually, not using the Hadoop AMIs) I've got the S3 based HDFS working, but I'm stumped when I try to get a test job running: [EMAIL PROTECTED]:~/hadoop-0.16.0$ time bin/hadoop jar contrib/streaming/hadoop-0.16.0-streaming.jar

Re: S3/EC2 setup problem: port 9001 unreachable

2008-03-10 Thread Andreas Kostyrka
Found it, was security group setup problem ;( Andreas Am Montag, den 10.03.2008, 16:49 +0100 schrieb Andreas Kostyrka: Hi! I'm trying to setup a Hadoop 0.16.0 cluster on EC2/S3. (Manually, not using the Hadoop AMIs) I've got the S3 based HDFS working, but I'm stumped when I try to get

Re: streaming problem

2008-03-19 Thread Andreas Kostyrka
testlogs-output -file path-on-local-fs Thanks, Amareshwari Andreas Kostyrka wrote: Some additional details if it's helping, the HDFS is hosted on AWS S3, and the input file set consists of 152 gzipped Apache log files. Thanks, Andreas Am Dienstag, den 18.03.2008, 22:17 +0100

Re: streaming problem

2008-03-19 Thread Andreas Kostyrka
Ok, tracked it down. Seems like Hadoop Streaming corrupts the input files. Any way to force it to pass whole files to one-to-one mapper? TIA, Andreas Am Mittwoch, den 19.03.2008, 09:18 +0100 schrieb Andreas Kostyrka: The /home/hadoop/dist/workloadmf script is available on all nodes

Re: Hadoop on EC2 for large cluster

2008-03-20 Thread Andreas Kostyrka
Actually, I personally use the following 2 part copy technique to copy files to a cluster of boxes: tar cf - myfile | dsh -f host-list-file -i -c -M tar xCfv /tmp - The first tar packages myfile into a tar file. dsh runs a tar that unpacks the tar (in the above case all boxes listed in

Re: Hadoop streaming performance problem

2008-03-31 Thread Andreas Kostyrka
Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to provide the input files gzipped. Not great difference (e.g. 50% slower when not gzipped, plus it took more than twice as long to upload the data to HDFS-on-S3 in the first place), but still probably relevant. Andreas Am Montag,

Re: Hadoop streaming performance problem

2008-03-31 Thread Andreas Kostyrka
. On Mon, Mar 31, 2008 at 1:51 PM, Andreas Kostyrka [EMAIL PROTECTED] wrote: Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to provide the input files gzipped. Not great difference (e.g. 50% slower when not gzipped, plus it took more than twice as long to upload the data

Re: Hadoop streaming performance problem

2008-03-31 Thread Andreas Kostyrka
, and that compressed files actually increase the speed of jobs? -Colin On Mon, Mar 31, 2008 at 4:51 PM, Andreas Kostyrka [EMAIL PROTECTED] wrote: Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to provide the input files gzipped. Not great difference (e.g. 50% slower when not gzipped

HDFS access/Jython examples

2008-04-07 Thread Andreas Kostyrka
Hi! I just wondered if there is some Jython example that shows how to access the HDFS from Jython, without running a mapreduce? Andreas signature.asc Description: Dies ist ein digital signierter Nachrichtenteil

Re: Newbie asking: ordinary filesystem above Hadoop

2008-04-08 Thread Andreas Kostyrka
HDFS has slightly different design goals. It's not meant as a general purpose filesystem, it's meant as the fast sequential input/output storage thing meant for hadoops map/reduce. Andreas Am Dienstag, den 08.04.2008, 16:24 +0300 schrieb Mika Joukainen: Hi! Yes, I'm aware that it's not good

hadoop 0.16.2 hangs

2008-04-13 Thread Andreas Kostyrka
Hi! I'm getting the following hang, when trying to run a streaming command: [EMAIL PROTECTED]:~/hadoop-0.16.2$ time bin/hadoop jar contrib/streaming/hadoop-0.16.2-streaming.jar -mapper '/home/hadoop/bin/llfp -f [EMAIL PROTECTED] -t [EMAIL PROTECTED] -s heaven.kostyrka.org -d gen_dailysites -d

Re: hadoop 0.16.2 hangs

2008-04-14 Thread Andreas Kostyrka
Ok, a short grep in the sources suggests that the exceptions happen just in the closeAll method of FileSystem. So no indication what hadoop is working on :( Am Montag, den 14.04.2008, 07:26 +0200 schrieb Andreas Kostyrka: Hi! I'm getting the following hang, when trying to run a streaming

Re: hadoop 0.16.2 hangs

2008-04-14 Thread Andreas Kostyrka
As another item, the submitting Java process hangs in a futex call: [EMAIL PROTECTED]:~# strace -p 3810 Process 3810 attached - interrupt to quit futex(0xb7d6ebd8, FUTEX_WAIT, 3832, NULL and hangs, hangs, hangs, ... Andreas Am Montag, den 14.04.2008, 11:46 +0200 schrieb Andreas Kostyrka: Ok

hadoop 0.16.3 problems to submit job

2008-04-21 Thread Andreas Kostyrka
stopped the submission after half a day). Any ideas? TIA, Andreas Kostyrka signature.asc Description: Dies ist ein digital signierter Nachrichtenteil

RE: Re: splitting of big files?

2008-05-27 Thread Andreas Kostyrka
It's text lines for streaming, which is just another Map/Reduce app. And how it's interpreted by your app, it's up to your input class. Andreas Am Dienstag, den 27.05.2008, 16:46 + schrieb [EMAIL PROTECTED]: - Od: Doug Cutting

hadoop on EC2

2008-05-28 Thread Andreas Kostyrka
Hi! I just wondered what other people use to access the hadoop webservers, when running on EC2? Ideas that I had: 1.) opening ports 50030 and so on = not good, data goes unprotected over the internet. Even if I could enable some form of authentication it would still plain http. 2.) Some kind of

Re: hadoop on EC2

2008-05-28 Thread Andreas Kostyrka
is wron with opening up the ports only to the hosts that you want to have access to them. This is what I cam currently doing, -s 0.0.0.0/0 is everyone everywhere so change it to -s my.ip.add.ress/32 On Wed, May 28, 2008 at 4:22 PM, Andreas Kostyrka [EMAIL PROTECTED] wrote: Hi! I

Re: hadoop on EC2

2008-05-30 Thread Andreas Kostyrka
script for this kind of tunneling. Andreas On Wed, May 28, 2008 at 1:51 PM, Andreas Kostyrka [EMAIL PROTECTED] wrote: What I wonder is what ports do I need to access? 50060 on all nodes. 50030 on the jobtracker. Any other ports? Andreas Am Mittwoch, den 28.05.2008, 13:37 -0700

Re: Stackoverflow

2008-06-03 Thread Andreas Kostyrka
On Tuesday 03 June 2008 08:35:10 Chris Douglas wrote: I have no Java implementation of my job, sorry. Since it's all in the map side, IdentityMapper/IdentityReducer is fine, as long as both the splits and the number of reduce tasks are the same. The data is a representation for loglines,

Re: Stackoverflow

2008-06-03 Thread Andreas Kostyrka
Ok, a new dead job: ;( This time after 2.4GB/11,3M lines ;( Any idea what I could do debug this? (No idea how to go at debugging a Java process that is distributed and does GBs of data. How does one stabilize that kind of stuff to generate a reproducable situation?) Andresa signature.asc

Re: Stackoverflow

2008-06-03 Thread Andreas Kostyrka
On Tuesday 03 June 2008 20:35:03 Chris Douglas wrote: By not exactly small, do you mean each line is long or that there are many records? Well, not small in the meaning, that even I could get my boss to allow me to give you the data, transfering it might be painful. (E.g. the job that

Re: hadoop on EC2

2008-06-03 Thread Andreas Kostyrka
Well, the basic trouble with EC2 is that clusters usually are not networks in the TCP/IP sense. This makes it painful to decide which URLs should be resolved where. Plus to make it even more painful, you cannot easily run it with one simple SOCKS server, because you need to defer DNS

key assignment to reducers

2008-06-06 Thread Andreas Kostyrka
Hi! I just wondered what semantics I can rely on concerning reducing: -) All key/value pairs with a given key end up in the same reducer. -) What I now wonder, do all key/value pairs for a given key end up in one sequence? So basically, do reducers get something like file-a or file-b?

Re: Patch

2008-06-13 Thread Andreas Kostyrka
Sorry, for replying the private email to the mailing list, but I strongly believe in leaving the next guy something to google ;) Anyway, as you seem to be knowledgeable about sorting, one question: Does hadoop provide all key/value tuples for a given key in one batch to the reducer, or not?

Re: What did I do wrong? (Too many fetch-failures)

2008-06-13 Thread Andreas Kostyrka
For me, I had to upgrade to 0.17.0, which made this problem go away magically. No idea if that will solve your problem. Andreas On Thursday 12 June 2008 23:04:17 Rob Collins wrote: In a previous life, I had no problems setting up a small cluster. Now I have managed to mess it up. I see reports

reducers hanging problem

2008-06-30 Thread Andreas Kostyrka
Hi! I'm running streaming tasks on hadoop 0.17.0, and wondered, if anyone has an approach to debugging the following situation: -) map have all finished (100% in http display), -) some reducers are hanging, with the messages below. Notice, that the task had 100 map tasks at allo, so 58 seems

Re: reducers hanging problem

2008-06-30 Thread Andreas Kostyrka
Another observation, the TaskTracker$Child was alive, and the reduce script has hung on read(0, ) :( Andreas signature.asc Description: This is a digitally signed message part.

Re: reducers hanging problem

2008-06-30 Thread Andreas Kostyrka
On Monday 30 June 2008 18:38:28 Runping Qi wrote: Looks like the reducer stuck at shuffling phase. What is the progression percentage do you see for the reducer from web GUI? It is known that 0.17 does not handle shuffling well. I think it has been 87% (meaning that 19 of 22 reducer tasks

Re: reducers hanging problem

2008-07-01 Thread Andreas Kostyrka
On Tuesday 01 July 2008 02:00:00 Andreas Kostyrka wrote: On Monday 30 June 2008 18:38:28 Runping Qi wrote: Looks like the reducer stuck at shuffling phase. What is the progression percentage do you see for the reducer from web GUI? It is known that 0.17 does not handle shuffling well

Re: XEN guest OS

2008-07-03 Thread Andreas Kostyrka
On Tuesday 01 July 2008 09:36:18 Ashok Varma wrote: Hi , I'm trying to install Fedora8 as a Guest OS in XEN on CentOS5.2 -64 bit. Always getting failed to Mount directory error. I configured NFS share, then also installation getting failed in middle.. Slightly offtopic on a hadoop mailing

streaming problem

2008-07-08 Thread Andreas Kostyrka
Hi! I've noticed that streaming has big problems handling long lines, when streaming. In my special case the output of a reducer process takes very long time to run and sometimes crashes with a number of random effects, a Java OutOfMemory being the nicest one. (which is a fact. A reducer

Re: Finished or not?

2008-07-09 Thread Andreas Kostyrka
On Wednesday 09 July 2008 05:56:28 Amar Kamat wrote: Andreas Kostyrka wrote: See attached screenshot, wonder how that could happen? What Hadoop version are you using? Is this reproducible? Is it possible to get the JT logs? Hadoop 0.17.0 Reproducible: As such no. I did notice

Re: running hadoop with gij

2008-07-17 Thread Andreas Kostyrka
On Thursday 17 July 2008 13:45:15 Gert Pfeifer wrote: Did anyone try to get hadoop running on the Gnu java environment? Does that work? Considering how stable it runs on plain standard Sun JVM, I'd reserve the gij task for the next monthly meeting of masochists anonymous. Andreas Cheers,

Re: hadoop 0.17.1 reducer not fetching map output problem

2008-07-24 Thread Andreas Kostyrka
series? /rant-mode Sorry, this has been driving me up the walls into an asylum till I compared notes with a collegue, and decided that I'm not crazy ;) Andreas Thanks, Devaraj On 7/24/08 1:42 PM, Andreas Kostyrka [EMAIL PROTECTED] wrote: Hi! I'm experiencing hung reducers

Re: hadoop 0.17.1 reducer not fetching map output problem

2008-07-24 Thread Andreas Kostyrka
On Thursday 24 July 2008 21:40:22 Devaraj Das wrote: On 7/25/08 12:09 AM, Andreas Kostyrka [EMAIL PROTECTED] wrote: On Thursday 24 July 2008 15:19:22 Devaraj Das wrote: Could you try to kill the tasktracker hosting the task the next time when it happens? I just want to isolate the problem

Re: Bean Scripting Framework?

2008-07-24 Thread Andreas Kostyrka
On Thursday 24 July 2008 21:40:20 Lincoln Ritter wrote: Hello all. Has anybody ever tried/considered using the Bean Scripting Framework within Hadoop? BSF seems nice since it allows two-way communication between ruby and java. I'd love to hear your thoughts as I've been trying to make this

Re: Bean Scripting Framework?

2008-07-24 Thread Andreas Kostyrka
On Thursday 24 July 2008 23:24:19 Lincoln Ritter wrote: Why not use jruby? Indeed! I'm basically working from the JRuby wiki page on Java integration (http://wiki.jruby.org/wiki/Java_Integration). I'm taking this one step at a time and, while I would love tighter integration, the

Re: Bean Scripting Framework?

2008-07-25 Thread Andreas Kostyrka
On Friday 25 July 2008 15:18:24 James Moore wrote: On Thu, Jul 24, 2008 at 10:48 PM, Venkat Seeth [EMAIL PROTECTED] wrote: Why dont you use hadoop streaming? I think that's more of a broader question - why doesn't everyone use streaming? There's no real difference between doing Hadoop in

Re: Bean Scripting Framework?

2008-07-27 Thread Andreas Kostyrka
On Saturday 26 July 2008 00:53:48 Joydeep Sen Sarma wrote: Just as an aside - there is probably a general perception that streaming is really slow (at least I had it). The last I did some profiling (in 0.15) - the primary overheads from streaming came from the scripting language (python is

Re: About HDFS` class Path

2008-07-28 Thread Andreas Kostyrka
On Monday 28 July 2008 13:31:42 wangxiaowei wrote: Dear All, I need use Hadoop to read all files in a given directory,I wonder how to know the path is a directory not a file and if it is how can I get all the files in the directory? Thanks Very Much. getFileStatus and listPaths should

Re: Question about fault tolerance and fail over for name nodes

2008-07-29 Thread Andreas Kostyrka
On Tuesday 29 July 2008 18:22:07 Paco NATHAN wrote: Jason, FWIW -- based on a daily batch process, requiring 9 Hadoop jobs in sequence -- 100+2 EC2 nodes, 2 Tb data, 6 hrs run time. We tend to see a namenode failing early, e.g., the problem advancing exception in the values iterator,

Re: How can I control Number of Mappers of a job?

2008-07-31 Thread Andreas Kostyrka
Well, the only way to reliably fix the number of maptasks that I've found is by using compressed input files, that forces hadoop to assign one and only one file to a map task ;) Andreas On Thursday 31 July 2008 21:30:33 Gopal Gandhi wrote: Thank you, finally someone has interests in my

Re: access jobconf in streaming job

2008-08-08 Thread Andreas Kostyrka
On Friday 08 August 2008 11:43:50 Rong-en Fan wrote: After looking into streaming source, the answer is via environment variables. For example, mapred.task.timeout is in the mapred_task_timeout environment variable. Well, another typical way to deal with that is to pass the parameters via

Re: Setting up a Hadoop cluster where nodes are spread over the Internet

2008-08-08 Thread Andreas Kostyrka
On Friday 08 August 2008 15:43:46 Lucas Nazário dos Santos wrote: You are completely right. It's not safe at all. But this is what I have for now: two computers distributed across the Internet. I would really appreciate if anyone could give me spark on how to configure the namenode's IP in a

Re: Setting up a Hadoop cluster where nodes are spread over the Internet

2008-08-10 Thread Andreas Kostyrka
it. On Fri, Aug 8, 2008 at 5:47 PM, Andreas Kostyrka [EMAIL PROTECTED]wrote: On Friday 08 August 2008 15:43:46 Lucas Nazário dos Santos wrote: You are completely right. It's not safe at all. But this is what I have for now: two computers distributed across the Internet. I

critical name node problem

2008-09-05 Thread Andreas Kostyrka
Hi! My namenode has run out of space, and now I'm getting the following: 08/09/05 09:23:22 WARN dfs.StateChange: DIR* FSDirectory.unprotectedDelete: failed to remove /data_v1/2008/06/26/12/pub1-access-2008-06-26-11_52_07.log.gz because it does not exist 08/09/05 09:23:22 INFO ipc.Server:

Re: critical name node problem

2008-09-05 Thread Andreas Kostyrka
will be able to figure out how to get rid of the last incomplete record. Another idea would be a tool or namenode startup mode that would make it ignore EOFExceptions to recover as much of the edits as possible. Andreas On Friday 05 September 2008 13:30:34 Andreas Kostyrka wrote: Hi! My namenode has