Re: question on Hadoop configuration for non cpu intensive jobs - 0.15.1

2007-12-28 Thread Eric Baldeschwieler
I created HADOOP-2497 to describe this bug. Was your sequence file stored on HDFS? Because HDFS does provide checksums. On Dec 28, 2007, at 7:20 AM, Jason Venner wrote: Our OOM was being caused by a damaged sequence data file. We had assumed that the sequence files had checksums, which

Re: lists

2007-12-05 Thread Eric Baldeschwieler
Hi Folks, Please ignore the last email I sent with this subject. I was just planning to pass some information to the guy in the next cube. Instead I shared it with the world. Whoops, E14

Re: Multiple output files, and controlling output file name...

2007-09-24 Thread Eric Baldeschwieler
yes please! On Sep 24, 2007, at 11:36 AM, Owen O'Malley wrote: On Sep 24, 2007, at 12:00 AM, Enis Soztutar wrote: Arun, could you please port the discussion to wiki. That would be very helpful. Thanks. Actually, I think it is better to put this kind of documentation either in the the

Re: Hadoop in an OSGi environment

2007-09-06 Thread Eric Baldeschwieler
Sounds interesting. Let us know of any success you have! Cross posting to -dev so more folks will notice. On Sep 5, 2007, at 2:47 PM, David Savage wrote: Thx very much, sorry for the spam in previous mail in that case. Yep agreed, most of the changes were minor changes really - I'll do as

Re: Overhead of Java?

2007-09-06 Thread Eric Baldeschwieler
Hadoop has a lot of inefficiencies in it still. Most of them are not related to the language choice. If you look at what the per node tasks are doing (as opposed to the name node and job tracker) you will see that very little real work is being done by Hadoop Java code. Pumbing bytes / io

Re: Use HDFS as a long term storage solution?

2007-09-05 Thread Eric Baldeschwieler
We are very interested in ideas and patches to improve the systems stability. This is very young software, but we are using it at very large scale and intend to keep enhancing it. We currently have a 2000 node file system with 3TB raw storage per node and are supporting millions of

Re: Hadoop success stories...

2007-09-04 Thread Eric Baldeschwieler
Responses to the list welcome. I know of several companies not on that list that are using it. It would be great to hear from you guys. E14 On Sep 4, 2007, at 6:59 AM, C G wrote: All: I am interested in hearing any success stories around deploying Hadoop in a commercial/non-academic

Re: Max number of files in HDFS?

2007-08-29 Thread Eric Baldeschwieler
Keeping all the datastructures simple and in ram let's us keep the transaction rate pretty high. Going to a DB while keeping the transaction rate up would require a lot of engineering. And would add complexity to administering the system. I'm not a fan of this approach, at least not

Re: Poly-reduce?

2007-08-24 Thread Eric Baldeschwieler
especially at scale! And we are testing on 1000 node clusters with long jobs. We see lots of failures per job. On Aug 24, 2007, at 4:20 PM, Ted Dunning wrote: On 8/24/07 12:11 PM, Doug Cutting [EMAIL PROTECTED] wrote: Using the same logic, streaming reduce outputs to the next map

Re: Reduce Performance

2007-08-22 Thread Eric Baldeschwieler
+1 On Aug 22, 2007, at 11:23 AM, Doug Cutting wrote: Thorsten Schuett wrote: In my case, it looks as if the loopback device is the bottleneck. So increasing the number of tasks won't help. Hmm. I have trouble believing that the loopback device is actually the bottleneck. What makes you

Re: Reduce Performance

2007-08-20 Thread Eric Baldeschwieler
Actually... I think it is greatly in the projects interest to have a really elegant one node solution. It should certainly support multithreading, the web UI, etc. If it is trivial to write and use single node jobs, then we can write an application once in map-reduce and use it either

Re: Loading data into HDFS

2007-08-07 Thread Eric Baldeschwieler
I'll have our operations folks comment on our current techniques. We use map-reduce jobs to copy from all nodes in the cluster from the source. Generally using either HTTP(S) or HDFS protocol. We've seen write rates as high as 8.3 GBytes/sec on 900 nodes. This is network limited. We

Are you using Hadoop in your company??

2007-07-20 Thread Eric Baldeschwieler
Hi Folks, I'd love to hear more about how Hadoop is being used in the wild. If you are using Hadoop, please add your project to our PoweredBy page, and/or respond to this email. http://wiki.apache.org/lucene-hadoop/PoweredBy Thanks! E14

Re: HadoopStreaming

2006-10-23 Thread Eric Baldeschwieler
There may need to be some streaming specific follow on work. But I believe 489 will capture stderr. Anyone? On Oct 20, 2006, at 9:33 PM, Andrew McNabb wrote: On Fri, Oct 20, 2006 at 03:45:17PM -0700, Eric Baldeschwieler wrote: I filed: http://issues.apache.org/jira/browse/HADOOP-619

Re: HadoopStreaming

2006-10-20 Thread Eric Baldeschwieler
Hi Andrew, I filed: http://issues.apache.org/jira/browse/HADOOP-619 to address the -input issues. There is work in progress to address getting job debugging info. I think this will be coming out in the next release (8?). http://issues.apache.org/jira/browse/HADOOP-489 I'll let others

Re: HDFS assumptions

2006-09-26 Thread Eric Baldeschwieler
The limit is RAM in the namenode. Every file uses some of this non- scalable resource currently. So what is critical is that your total number of files remains small where small is probably safely defined as 100s of thousands today. Bigger files let you use much more storage with the same

Re: MapReduce: specify a *DFS* path for mapred.jar property

2006-08-31 Thread Eric Baldeschwieler
Interesting thread. This relates to HADOOP-288. Also the thread I started last week on using URLs in general for input arguments. Seems like we should just take a URL for the jar, which could be file: or hdfs: Thoughts? On Aug 31, 2006, at 10:54 AM, Doug Cutting wrote: Frédéric Bertin

Re: Enhancement to TextInputFormat?

2006-07-06 Thread Eric Baldeschwieler
I think it is interesting. I think you'd want a way to specify that the target file is itself a list of additional URIs as well. That would support scenarios such as a .jsp on a master server that simply listed its slaves and then the slaves could list their local content. Might also