Re: Question on running simultaneous jobs

2008-01-09 Thread Michael Bieniosek
Hadoop-0.14 introduced job priorities (https://issues.apache.org/jira/ browse/HADOOP-1433); you might be able to get somewhere with this. Another possibility is to create two mapreduce clusters on top of the same dfs cluster. The mapred.tasktracker.tasks.maximum doesn't do what you think --

Re: DFS Datanodes are suddenly not formatted

2007-12-06 Thread Michael Bieniosek
In your hadoop-site.xml, you can set property namehadoop.tmp.dir/name value/hadoop/value /property This will put all the hadoop stuff in /hadoop. By default, this directory is /tmp/hadoop-$USER, which is probably worth a bug report. -Michael On 12/6/07 10:31 AM, Michael Harris

Re: Controlling the number of simultanious jobs per machine - 0.15.0

2007-12-03 Thread Michael Bieniosek
You also might want to look at HADOOP-2300 On 12/2/07 7:33 PM, Jason Venner [EMAIL PROTECTED] wrote: We have jobs that require different resources and as such saturate our machines at different levels or parallelization. What we want to do in the driver is set the number of simultaneous jobs

Re: using hadoop within the Tomcat

2007-11-22 Thread Michael Bieniosek
Hi Eugeny, I do something like this in a jetty server, which I start with java -jar server.jar. To monitor hadoop jobs, I simply use the JobClient class and manually set the fs.default.name/mapred.job.tracker properties on the JobConf object used in the JobClient constructor. Since I don't have

Re: Accessing EC2 / local client

2007-11-12 Thread Michael Bieniosek
In order to use hadoop dfs, your client must be able to talk to all your datanodes and the namenode. So you should: 1. Make sure you can talk to datanodes 2. Make sure your datanode reports its public ip/dns name to the namenode, not its internal amazon ip/dns name. You can check this on the

Re: InputFiles, Splits, Maps, Tasks Questions 1.3 Base

2007-10-22 Thread Michael Bieniosek
You can tune number of map tasks/node with the config variable mapred.tasktracker.tasks.maximum on the jobtracker (there is a patch to make it configurable on the tasktracker: see https://issues.apache.org/jira/browse/HADOOP-1245). -Michael On 10/22/07 5:53 PM, Lance Amundsen [EMAIL

jdk6 on darwin (was: 14.1 to 14.2)

2007-10-12 Thread Michael Bieniosek
Does anybody know if there is a jdk6 available for Mac? I checked the apple developer site, and there doesn't seem to be one available, despite blogs from last year claiming apple was distributing it. Since I do my development work on a Mac, switching to jdk6 would be very difficult for me if

Re: HBase performance

2007-10-11 Thread Michael Bieniosek
MySQL and hbase are optimized for different operations. What are you trying to do? -Michael On 10/11/07 3:35 PM, Rafael Turk [EMAIL PROTECTED] wrote: Hi All, Does any one have comments about how Hbase will perform in a 4 node cluster compared to an equivalent MySQL configuration? Thanks,

Re: new to hadoop and first question

2007-10-08 Thread Michael Bieniosek
It looks like you are treating a jobtracker as a namenode. Make sure fs.default.name is set to a namenode's address. By default, namenodes run on port 1, while jobtrackers run on port 10001. -Michael On 10/8/07 5:47 PM, Jim the Standing Bear [EMAIL PROTECTED] wrote: Hi Khalil, Yes, SSH

Re: Hadoop behind a Firewall

2007-09-11 Thread Michael Bieniosek
While you can proxy puts/gets to HDFS, this can dramatically decrease your bandwidth. The hadoop dfs client is pretty good about writing to/reading from multiple HDFS nodes simultaneously; a proxy makes this impossible. Of course, depending on your cluster size, network connection, and data

Re: Anybody using HDFS as a long term storage solution?

2007-09-06 Thread Michael Bieniosek
Well, there is an (undocumented?) way to get rack-awareness in the Datanode, so you could co-opt this to represent datacenter-awareness. I don't think there is such a rack-awareness ability for the DFSClient or TaskTracker though. -Michael On 9/6/07 3:10 PM, Torsten Curdt [EMAIL PROTECTED]

Re: Anybody using HDFS as a long term storage solution?

2007-09-06 Thread Michael Bieniosek
-- Torsten On 07.09.2007, at 00:26, Michael Bieniosek wrote: Well, there is an (undocumented?) way to get rack-awareness in the Datanode, so you could co-opt this to represent datacenter-awareness. I don't think there is such a rack-awareness ability for the DFSClient or TaskTracker

Re: Hadoop in an OSGi environment

2007-09-05 Thread Michael Bieniosek
The hadoop way of submitting patches is to create a JIRA issue for each patch so they can be tested and discussed separately. It looks like you have several unrelated changes in there. You'll also need to regenerate your patches against HEAD. It's always nice to have more contributors. I'm

Re: Query about number of task trackers specific to a site

2007-08-17 Thread Michael Bieniosek
, Neeraj -Original Message- From: Michael Bieniosek [mailto:[EMAIL PROTECTED] Sent: Friday, August 17, 2007 11:55 AM To: hadoop-user@lucene.apache.org; Mahajan, Neeraj Subject: Re: Query about number of task trackers specific to a site https://issues.apache.org/jira/browse/HADOOP

Re: Query about number of task trackers specific to a site

2007-08-17 Thread Michael Bieniosek
than 500 tasks. Each task tracker executed many tasks, but at all times I could see that 4 child processes were running on each machine. ~ Neeraj -Original Message- From: Michael Bieniosek [mailto:[EMAIL PROTECTED] Sent: Friday, August 17, 2007 1:01 PM To: Mahajan, Neeraj; hadoop

Is mapred-default.xml read for dfs config?

2007-08-16 Thread Michael Bieniosek
The wiki page http://wiki.apache.org/lucene-hadoop/HowToConfigure implies that mapred-default.xml is read for the dfs configuration, as well as for mapreduce jobs. But this doesn't appear to be true based on the code, as the string mapred-default.xml only appears in the mapred package. So in

Re: Error reporting from map function

2007-08-02 Thread Michael Bieniosek
On 8/2/07 5:20 AM, [EMAIL PROTECTED] [EMAIL PROTECTED] wrote: I've found the getMapTaskReports method in the JobClient class, but can't work out how to access it other than by creating a new instance of JobClient - but then that JobClient would be a differnt one to the one that was

Re: Killing a running job

2007-06-27 Thread Michael Bieniosek
bin/hadoop job -kill job_0001 On 6/27/07 11:11 AM, patrik [EMAIL PROTECTED] wrote: Is there a way to kill a job that's currently running? pb

disappearing task logs in hadoop-0.13

2007-06-21 Thread Michael Bieniosek
Hi, I just upgraded my cluster to hadoop-0.13. I now notice that the task logs in userlogs/ and viewable through the gui often get cut off in the middle of a task. I checked the file system, and it appears there's only a part-0 on the system. The tasktracker log doesn't seem to indicate

Re: Formatting the namenode

2007-06-15 Thread Michael Bieniosek
In hadoop-default.xml you should find: property namehadoop.tmp.dir/name value/tmp/hadoop-${user.name}/value descriptionA base for other temporary directories./description /property property namedfs.name.dir/name value${hadoop.tmp.dir}/dfs/name/value descriptionDetermines where on the

Re: Formatting the namenode

2007-06-15 Thread Michael Bieniosek
the /tmp/hadoop-user-name directory. Thanks A -Original Message- From: Michael Bieniosek [mailto:[EMAIL PROTECTED] Sent: Friday, June 15, 2007 11:31 AM To: hadoop-user@lucene.apache.org; Phantom Subject: Re: Formatting the namenode In hadoop-default.xml you should find

how do I get SYSLOG and STDOUT to go to the same place?

2007-06-14 Thread Michael Bieniosek
Hi, I noticed that HADOOP-975 and HADOOP-1000 made the log4j from child vms go to a different place than the stdout for the task. My tasks send some of their debugging information to stdout, and some of it to log4j. I'd like all this information to go to the same place, so that I can see the

Re: Setting up large clusters

2007-06-08 Thread Michael Bieniosek
The slaves connect to the master, not the other way around. I don't use a slaves file at all; I just point new tasktrackers at the jobtracker and everything just works (without restarting). My understanding is that the slaves file, if present, merely functions as an allow list of slaves that can

Re: Hadoop on Mac OSX

2007-05-31 Thread Michael Bieniosek
On 5/30/07 7:31 PM, Peter W. [EMAIL PROTECTED] wrote: Unsetting JAVA_PLATFORM gives an error message: % bin/hadoop jar hadoop-0.12.3-examples.jar pi 10 20 Exception in thread main java.lang.NoClassDefFoundError: OS This was fixed in https://issues.apache.org/jira/browse/HADOOP-1081

Re: Namenode cannot accept connection from datanode

2007-05-14 Thread Michael Bieniosek
a serverSocket on 9000, started with the same user on the same machine. And I am able to connect to it from all other machines. So is there some settings that will cause the namenode to only bind to the 9000 port on the local interface ? Cedric On 5/12/07, Michael Bieniosek [EMAIL PROTECTED

Re: Serializing code to nodes: no can do?

2007-04-18 Thread Michael Bieniosek
I'm not sure exactly what you're trying to do, but you can specify command line parameters to hadoop -jar which you can interpret in your code. Your code can then write arbitrary config parameters before starting the mapreduce. Based on these configs, you can load specific jars in your mapreduce

Re: bandwidth (Was: Re: Running on multiple CPU's)

2007-04-16 Thread Michael Bieniosek
What are you trying to do? Hadoop dfs has different goals than a network file system such as samba. -Michael On 4/16/07 10:32 AM, jafarim [EMAIL PROTECTED] wrote: On linux and jvm6 with normal IDE disks and a giga ethernet switch with corresponding NIC and with hadoop 0.9.11's HDFS. We wrote

Scaling hadoop up

2007-03-29 Thread Michael Bieniosek
Hi, When I try to scale Hadoop up to about 100 nodes on EC2 (single-cpu Xen), I notice things start to fall apart. For example, the jobtracker starts dropping requests with the message Call queue overflow discarding oldest call. I've also seen problems with the namenode where dfs requests fail

Re: Scaling hadoop up

2007-03-29 Thread Michael Bieniosek
? -Michael On 3/29/07 1:37 PM, Doug Cutting [EMAIL PROTECTED] wrote: Michael Bieniosek wrote: When I try to scale Hadoop up to about 100 nodes on EC2 (single-cpu Xen), I notice things start to fall apart. For example, the jobtracker starts dropping requests with the message Call queue overflow