Re: MapReduce: specify a *DFS* path for mapred.jar property

2006-08-31 Thread Frédéric Bertin

Doug Cutting wrote:

Frédéric Bertin wrote:

*// Set the user's name and working directory*
String user = System.getProperty(user.name);
job.setUser(user != null ? user : Dr Who);
if (job.getWorkingDirectory() == null) {
  
job.setWorkingDirectory(fs.getWorkingDirectory());  
}


This should run clientside, since it depends on the username, which is 
different on the server.
then, what about passing the username as a parameter to the 
JobSubmissionProtocol.submitJob(...) ? This avoids loading the whole 
JobConf clientside just to set the username.



FileSystem userFileSys = FileSystem.get(job);
Path[] inputDirs = job.getInputPaths();
boolean[] validDirs =
 
job.getInputFormat().areValidInputDirectories(userFileSys, inputDirs);

for(int i=0; i  validDirs.length; ++i) {
  if (!validDirs[i]) {
String msg = Input directory  + inputDirs[i] +
  in  + userFileSys.getName() +  is
invalid.;
LOG.error(msg);
throw new IOException(msg);
  }
}

*// Check the output specification*
job.getOutputFormat().checkOutputSpecs(fs, job);


Why not moving it in the JobSubmissionProtocol (JobTracker's 
submitJob method) ?


These could probably run on the server.  They're currently run on the 
client in an attempt to return errors as quickly as possible when jobs 
are misconfigured.
Is it really quicker to make all those checkings remotely than remotely 
asking the JobTracker to make them locally? (just a question, I really 
have no idea of the answer)


Fred


Re: MapReduce: specify a *DFS* path for mapred.jar property

2006-08-31 Thread Doug Cutting

Frédéric Bertin wrote:
This should run clientside, since it depends on the username, which is 
different on the server.
then, what about passing the username as a parameter to the 
JobSubmissionProtocol.submitJob(...) ? This avoids loading the whole 
JobConf clientside just to set the username.


That sounds like a reasonable change to me.

Why not moving it in the JobSubmissionProtocol (JobTracker's 
submitJob method) ?


These could probably run on the server.  They're currently run on the 
client in an attempt to return errors as quickly as possible when jobs 
are misconfigured.
Is it really quicker to make all those checkings remotely than remotely 
asking the JobTracker to make them locally? (just a question, I really 
have no idea of the answer)


We'd need to be careful that this is not a synchronized method on the 
server, so it doesn't interfere with other server activities.  Also, 
checking the input and output has to be much faster than the RPC 
timeout, which it should be, since this just checks for the existence of 
directories, not of individual files.


Doug


Re: MapReduce: specify a *DFS* path for mapred.jar property

2006-08-31 Thread Owen O'Malley


On Aug 31, 2006, at 9:48 AM, Frédéric Bertin wrote:


Doug Cutting wrote:

Frédéric Bertin wrote:

*// Set the user's name and working directory*
String user = System.getProperty(user.name);
job.setUser(user != null ? user : Dr Who);
if (job.getWorkingDirectory() == null) {
  job.setWorkingDirectory(fs.getWorkingDirectory 
());  }


This should run clientside, since it depends on the username,  
which is different on the server.
then, what about passing the username as a parameter to the  
JobSubmissionProtocol.submitJob(...) ? This avoids loading the  
whole JobConf clientside just to set the username.


I don't understand what the problem is. The user sets up their job by  
creating a JobConf(). Do you already have the job.xml in dfs and just  
want to resubmit it? I don't think that will ever be the typical  
case.  I thought the original topic of this thread was the jar file.


-- Owen

Re: MapReduce: specify a *DFS* path for mapred.jar property

2006-08-31 Thread Eric Baldeschwieler

Interesting thread.

This relates to HADOOP-288.

Also the thread I started last week on using URLs in general for  
input arguments.  Seems like we should just take a URL for the jar,  
which could be file: or hdfs:


Thoughts?

On Aug 31, 2006, at 10:54 AM, Doug Cutting wrote:


Frédéric Bertin wrote:
This should run clientside, since it depends on the username,  
which is different on the server.
then, what about passing the username as a parameter to the  
JobSubmissionProtocol.submitJob(...) ? This avoids loading the  
whole JobConf clientside just to set the username.


That sounds like a reasonable change to me.

Why not moving it in the JobSubmissionProtocol (JobTracker's  
submitJob method) ?


These could probably run on the server.  They're currently run on  
the client in an attempt to return errors as quickly as possible  
when jobs are misconfigured.
Is it really quicker to make all those checkings remotely than  
remotely asking the JobTracker to make them locally? (just a  
question, I really have no idea of the answer)


We'd need to be careful that this is not a synchronized method on  
the server, so it doesn't interfere with other server activities.   
Also, checking the input and output has to be much faster than the  
RPC timeout, which it should be, since this just checks for the  
existence of directories, not of individual files.


Doug




Re: MapReduce: specify a *DFS* path for mapred.jar property

2006-08-31 Thread Doug Cutting

Eric Baldeschwieler wrote:
Also the thread I started last week on using URLs in general for input 
arguments.  Seems like we should just take a URL for the jar, which 
could be file: or hdfs:


That would work.  The jobclient could automatically copy file: urls to 
the jobtracker's native fs.  Other kinds of uris could be passed as-is.


Doug


RE: URIs and hadoop

2006-08-31 Thread Michel Tourn
On URIs:

I had to learn more about URIs while looking at WebDAV code...
I am starting to like them.

Below scheme file: is really for local files 
Hadoop Path-s would use scheme hdfs:

Some developers like named pipes.
You can write to an existing named pipe from Java.
But this is not supported well in Java and Windows 
(cygwin named pipes only work between Cygwin applications)

So I also added support for a socket endpoint. 
To connect them:
use nc -l -p 123 and -mapsideoutput socket://localhost:123

All these variations unify well using standard URI syntax.

The reason you may want to use a socket or named-pipe as 
your map output:
to do a huge streaming computation: 
get all you part-k out of HDFS and 
process them on-the-fly in global order
from the comfort of your home

-Michel

---
With hadoopStreaming syntax:

  -input +/in/part-0 | /in/part-1 | .. 

To specify a single side-effect output file:
  
  -mapsideoutput [file:/C:/win|file:/unix|socket://host:port]

  If the jobtracker is local it makes sense to use a local file
  This currently requires -reducer NONE and num.map.tasks=1


 -Original Message-
 From: Eric Baldeschwieler [mailto:[EMAIL PROTECTED]
 Sent: Thursday, August 31, 2006 11:19 AM
 To: hadoop-user@lucene.apache.org
 Cc: Owen O'Malley
 Subject: Re: MapReduce: specify a *DFS* path for mapred.jar property
 
 Interesting thread.
 
 This relates to HADOOP-288.
 
 Also the thread I started last week on using URLs in general for
 input arguments.  Seems like we should just take a URL for the jar,
 which could be file: or hdfs:
 
 Thoughts?
 
 On Aug 31, 2006, at 10:54 AM, Doug Cutting wrote:
 
  Frédéric Bertin wrote:
  This should run clientside, since it depends on the username,
  which is different on the server.
  then, what about passing the username as a parameter to the
  JobSubmissionProtocol.submitJob(...) ? This avoids loading the
  whole JobConf clientside just to set the username.
 
  That sounds like a reasonable change to me.
 
  Why not moving it in the JobSubmissionProtocol (JobTracker's
  submitJob method) ?
 
  These could probably run on the server.  They're currently run on
  the client in an attempt to return errors as quickly as possible
  when jobs are misconfigured.
  Is it really quicker to make all those checkings remotely than
  remotely asking the JobTracker to make them locally? (just a
  question, I really have no idea of the answer)
 
  We'd need to be careful that this is not a synchronized method on
  the server, so it doesn't interfere with other server activities.
  Also, checking the input and output has to be much faster than the
  RPC timeout, which it should be, since this just checks for the
  existence of directories, not of individual files.
 
  Doug