Hi!
I'm trying to setup a Hadoop 0.16.0 cluster on EC2/S3. (Manually, not
using the Hadoop AMIs)
I've got the S3 based HDFS working, but I'm stumped when I try to get a
test job running:
[EMAIL PROTECTED]:~/hadoop-0.16.0$ time bin/hadoop jar
contrib/streaming/hadoop-0.16.0-streaming.jar
Found it, was security group setup problem ;(
Andreas
Am Montag, den 10.03.2008, 16:49 +0100 schrieb Andreas Kostyrka:
Hi!
I'm trying to setup a Hadoop 0.16.0 cluster on EC2/S3. (Manually, not
using the Hadoop AMIs)
I've got the S3 based HDFS working, but I'm stumped when I try to get
testlogs-output -file
path-on-local-fs
Thanks,
Amareshwari
Andreas Kostyrka wrote:
Some additional details if it's helping, the HDFS is hosted on AWS S3,
and the input file set consists of 152 gzipped Apache log files.
Thanks,
Andreas
Am Dienstag, den 18.03.2008, 22:17 +0100
Ok, tracked it down. Seems like Hadoop Streaming corrupts the input
files. Any way to force it to pass whole files to one-to-one mapper?
TIA,
Andreas
Am Mittwoch, den 19.03.2008, 09:18 +0100 schrieb Andreas Kostyrka:
The /home/hadoop/dist/workloadmf script is available on all nodes
Actually, I personally use the following 2 part copy technique to copy
files to a cluster of boxes:
tar cf - myfile | dsh -f host-list-file -i -c -M tar xCfv /tmp -
The first tar packages myfile into a tar file.
dsh runs a tar that unpacks the tar (in the above case all boxes listed
in
Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to
provide the input files gzipped. Not great difference (e.g. 50% slower
when not gzipped, plus it took more than twice as long to upload the
data to HDFS-on-S3 in the first place), but still probably relevant.
Andreas
Am Montag,
.
On Mon, Mar 31, 2008 at 1:51 PM, Andreas Kostyrka [EMAIL PROTECTED]
wrote:
Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to
provide the input files gzipped. Not great difference (e.g. 50% slower
when not gzipped, plus it took more than twice as long to upload the
data
, and that compressed files actually
increase the speed of jobs?
-Colin
On Mon, Mar 31, 2008 at 4:51 PM, Andreas Kostyrka [EMAIL PROTECTED]
wrote:
Well, on our EC2/HDFS-on-S3 cluster I've noticed that it helps to
provide the input files gzipped. Not great difference (e.g. 50% slower
when not gzipped
Hi!
I just wondered if there is some Jython example that shows how to access
the HDFS from Jython, without running a mapreduce?
Andreas
signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil
HDFS has slightly different design goals. It's not meant as a general
purpose filesystem, it's meant as the fast sequential input/output
storage thing meant for hadoops map/reduce.
Andreas
Am Dienstag, den 08.04.2008, 16:24 +0300 schrieb Mika Joukainen:
Hi!
Yes, I'm aware that it's not good
Hi!
I'm getting the following hang, when trying to run a streaming command:
[EMAIL PROTECTED]:~/hadoop-0.16.2$ time bin/hadoop jar
contrib/streaming/hadoop-0.16.2-streaming.jar -mapper '/home/hadoop/bin/llfp -f
[EMAIL PROTECTED] -t [EMAIL PROTECTED] -s heaven.kostyrka.org -d gen_dailysites
-d
Ok, a short grep in the sources suggests that the exceptions happen just
in the closeAll method of FileSystem. So no indication what hadoop is
working on :(
Am Montag, den 14.04.2008, 07:26 +0200 schrieb Andreas Kostyrka:
Hi!
I'm getting the following hang, when trying to run a streaming
As another item, the submitting Java process hangs in a futex call:
[EMAIL PROTECTED]:~# strace -p 3810
Process 3810 attached - interrupt to quit
futex(0xb7d6ebd8, FUTEX_WAIT, 3832, NULL
and hangs, hangs, hangs, ...
Andreas
Am Montag, den 14.04.2008, 11:46 +0200 schrieb Andreas Kostyrka:
Ok
stopped the submission after half a day).
Any ideas?
TIA,
Andreas Kostyrka
signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil
It's text lines for streaming, which is just another Map/Reduce app.
And how it's interpreted by your app, it's up to your input class.
Andreas
Am Dienstag, den 27.05.2008, 16:46 + schrieb [EMAIL PROTECTED]:
-
Od: Doug Cutting
Hi!
I just wondered what other people use to access the hadoop webservers,
when running on EC2?
Ideas that I had:
1.) opening ports 50030 and so on = not good, data goes unprotected
over the internet. Even if I could enable some form of authentication it
would still plain http.
2.) Some kind of
is wron with opening up the ports only to the hosts that you want to
have access to them. This is what I cam currently doing, -s 0.0.0.0/0 is
everyone everywhere so change it to -s my.ip.add.ress/32
On Wed, May 28, 2008 at 4:22 PM, Andreas Kostyrka [EMAIL PROTECTED]
wrote:
Hi!
I
script for this kind of tunneling.
Andreas
On Wed, May 28, 2008 at 1:51 PM, Andreas Kostyrka [EMAIL PROTECTED]
wrote:
What I wonder is what ports do I need to access?
50060 on all nodes.
50030 on the jobtracker.
Any other ports?
Andreas
Am Mittwoch, den 28.05.2008, 13:37 -0700
On Tuesday 03 June 2008 08:35:10 Chris Douglas wrote:
I have no Java implementation of my job, sorry.
Since it's all in the map side, IdentityMapper/IdentityReducer is
fine, as long as both the splits and the number of reduce tasks are
the same.
The data is a representation for loglines,
Ok, a new dead job: ;(
This time after 2.4GB/11,3M lines ;(
Any idea what I could do debug this?
(No idea how to go at debugging a Java process that is distributed and does
GBs of data. How does one stabilize that kind of stuff to generate a
reproducable situation?)
Andresa
signature.asc
On Tuesday 03 June 2008 20:35:03 Chris Douglas wrote:
By not exactly small, do you mean each line is long or that there
are many records?
Well, not small in the meaning, that even I could get my boss to
allow me to
give you the data, transfering it might be painful. (E.g. the job that
Well, the basic trouble with EC2 is that clusters usually are not networks
in the TCP/IP sense.
This makes it painful to decide which URLs should be resolved where.
Plus to make it even more painful, you cannot easily run it with one simple
SOCKS server, because you need to defer DNS
Hi!
I just wondered what semantics I can rely on concerning reducing:
-) All key/value pairs with a given key end up in the same reducer.
-) What I now wonder, do all key/value pairs for a given key end up in one
sequence?
So basically, do reducers get something like file-a or file-b?
Sorry, for replying the private email to the mailing list, but I strongly
believe in leaving the next guy something to google ;)
Anyway, as you seem to be knowledgeable about sorting, one question:
Does hadoop provide all key/value tuples for a given key in one batch to the
reducer, or not?
For me, I had to upgrade to 0.17.0, which made this problem go away magically.
No idea if that will solve your problem.
Andreas
On Thursday 12 June 2008 23:04:17 Rob Collins wrote:
In a previous life, I had no problems setting up a small cluster. Now I
have managed to mess it up. I see reports
Hi!
I'm running streaming tasks on hadoop 0.17.0, and wondered, if anyone has an
approach to debugging the following situation:
-) map have all finished (100% in http display),
-) some reducers are hanging, with the messages below.
Notice, that the task had 100 map tasks at allo, so 58 seems
Another observation, the TaskTracker$Child was alive, and the reduce script
has hung on read(0, ) :(
Andreas
signature.asc
Description: This is a digitally signed message part.
On Monday 30 June 2008 18:38:28 Runping Qi wrote:
Looks like the reducer stuck at shuffling phase.
What is the progression percentage do you see for the reducer from web
GUI?
It is known that 0.17 does not handle shuffling well.
I think it has been 87% (meaning that 19 of 22 reducer tasks
On Tuesday 01 July 2008 02:00:00 Andreas Kostyrka wrote:
On Monday 30 June 2008 18:38:28 Runping Qi wrote:
Looks like the reducer stuck at shuffling phase.
What is the progression percentage do you see for the reducer from web
GUI?
It is known that 0.17 does not handle shuffling well
On Tuesday 01 July 2008 09:36:18 Ashok Varma wrote:
Hi ,
I'm trying to install Fedora8 as a Guest OS in XEN on CentOS5.2 -64 bit.
Always getting failed to Mount directory error. I configured NFS share,
then also
installation getting failed in middle..
Slightly offtopic on a hadoop mailing
Hi!
I've noticed that streaming has big problems handling long lines, when
streaming.
In my special case the output of a reducer process takes very long time to run
and sometimes crashes with a number of random effects, a Java OutOfMemory
being the nicest one.
(which is a fact. A reducer
On Wednesday 09 July 2008 05:56:28 Amar Kamat wrote:
Andreas Kostyrka wrote:
See attached screenshot, wonder how that could happen?
What Hadoop version are you using? Is this reproducible? Is it possible
to get the JT logs?
Hadoop 0.17.0
Reproducible: As such no. I did notice
On Thursday 17 July 2008 13:45:15 Gert Pfeifer wrote:
Did anyone try to get hadoop running on the Gnu java environment? Does
that work?
Considering how stable it runs on plain standard Sun JVM, I'd reserve the gij
task for the next monthly meeting of masochists anonymous.
Andreas
Cheers,
series?
/rant-mode
Sorry, this has been driving me up the walls into an asylum till I compared
notes with a collegue, and decided that I'm not crazy ;)
Andreas
Thanks,
Devaraj
On 7/24/08 1:42 PM, Andreas Kostyrka [EMAIL PROTECTED] wrote:
Hi!
I'm experiencing hung reducers
On Thursday 24 July 2008 21:40:22 Devaraj Das wrote:
On 7/25/08 12:09 AM, Andreas Kostyrka [EMAIL PROTECTED] wrote:
On Thursday 24 July 2008 15:19:22 Devaraj Das wrote:
Could you try to kill the tasktracker hosting the task the next time
when it happens? I just want to isolate the problem
On Thursday 24 July 2008 21:40:20 Lincoln Ritter wrote:
Hello all.
Has anybody ever tried/considered using the Bean Scripting Framework
within Hadoop? BSF seems nice since it allows two-way communication
between ruby and java. I'd love to hear your thoughts as I've been
trying to make this
On Thursday 24 July 2008 23:24:19 Lincoln Ritter wrote:
Why not use jruby?
Indeed! I'm basically working from the JRuby wiki page on Java
integration (http://wiki.jruby.org/wiki/Java_Integration). I'm taking
this one step at a time and, while I would love tighter integration,
the
On Friday 25 July 2008 15:18:24 James Moore wrote:
On Thu, Jul 24, 2008 at 10:48 PM, Venkat Seeth [EMAIL PROTECTED] wrote:
Why dont you use hadoop streaming?
I think that's more of a broader question - why doesn't everyone use
streaming?
There's no real difference between doing Hadoop in
On Saturday 26 July 2008 00:53:48 Joydeep Sen Sarma wrote:
Just as an aside - there is probably a general perception that streaming
is really slow (at least I had it).
The last I did some profiling (in 0.15) - the primary overheads from
streaming came from the scripting language (python is
On Monday 28 July 2008 13:31:42 wangxiaowei wrote:
Dear All,
I need use Hadoop to read all files in a given directory,I wonder
how to know the path is a directory not a file and if it is
how can I get all the files in the directory?
Thanks Very Much.
getFileStatus and listPaths should
On Tuesday 29 July 2008 18:22:07 Paco NATHAN wrote:
Jason,
FWIW -- based on a daily batch process, requiring 9 Hadoop jobs in
sequence -- 100+2 EC2 nodes, 2 Tb data, 6 hrs run time.
We tend to see a namenode failing early, e.g., the problem advancing
exception in the values iterator,
Well, the only way to reliably fix the number of maptasks that I've found is
by using compressed input files, that forces hadoop to assign one and only
one file to a map task ;)
Andreas
On Thursday 31 July 2008 21:30:33 Gopal Gandhi wrote:
Thank you, finally someone has interests in my
On Friday 08 August 2008 11:43:50 Rong-en Fan wrote:
After looking into streaming source, the answer is via environment
variables. For example, mapred.task.timeout is in
the mapred_task_timeout environment variable.
Well, another typical way to deal with that is to pass the parameters via
On Friday 08 August 2008 15:43:46 Lucas Nazário dos Santos wrote:
You are completely right. It's not safe at all. But this is what I have for
now:
two computers distributed across the Internet. I would really appreciate if
anyone could give me spark on how to configure the namenode's IP in a
it.
On Fri, Aug 8, 2008 at 5:47 PM, Andreas Kostyrka
[EMAIL PROTECTED]wrote:
On Friday 08 August 2008 15:43:46 Lucas Nazário dos Santos wrote:
You are completely right. It's not safe at all. But this is what I
have
for
now:
two computers distributed across the Internet. I
Hi!
My namenode has run out of space, and now I'm getting the following:
08/09/05 09:23:22 WARN dfs.StateChange: DIR* FSDirectory.unprotectedDelete:
failed to
remove /data_v1/2008/06/26/12/pub1-access-2008-06-26-11_52_07.log.gz because
it does not exist
08/09/05 09:23:22 INFO ipc.Server:
will be able to figure out how to get rid of the last
incomplete record.
Another idea would be a tool or namenode startup mode that would make it
ignore EOFExceptions to recover as much of the edits as possible.
Andreas
On Friday 05 September 2008 13:30:34 Andreas Kostyrka wrote:
Hi!
My namenode has
47 matches
Mail list logo