Problems running TestDFSIO to a non-default directory

2008-11-25 Thread Joel Welling
Hi Konstantin (et al.);
  A while ago you gave me the following trick to run TestDFSIO to an
output directory other than the default- just use
-Dtest.build.data=/output/dir to pass the new directory to the
executable.  I recall this working, but it is failing now under 0.18.1,
and looking at it I can't see how it ever worked.  The -D option will
set the property on the Java virtual machine which runs as a direct
child of /bin/hadoop, but I see no way the property would get set on the
mapper virtual machines.  Should this still work?  

Thanks,
-Joel

On Thu, 2008-09-04 at 13:05 -0700, Konstantin Shvachko wrote:
> Sure.
> 
> bin/hadoop
-Dtest.build.data=/bessemer/welling/hadoop_test/benchmarks/TestDFSIO/
org.apache.hadoop.fs.TestDFSIO -write -nrFiles 2*N -fileSize 360
> 
> --Konst
> 
> Joel Welling wrote:
> > With my setup, I need to change the file directory
> > from /benchmarks/TestDFSIO/io_control to something
> > like /bessemer/welling/hadoop_test/benchmarks/TestDFSIO/io_control .
Is
> > there a command line argument or parameter that will do this?
> > Basically, I have to point it explicitly into my Lustre filesystem.
> > 
> > -Joel
> > 



How to run sort900?

2008-11-19 Thread Joel Welling
Hi folks;
  Is there a standard procedure for running the sort900 test?  In
particular, the timings show time for a verifier, but I can't find where
that's implemented.

Thanks,
-Joel
 [EMAIL PROTECTED]



Re: reading input for a map function from 2 different files?

2008-11-12 Thread Joel Welling
Amar, isn't there a problem with your method in that it gets a small
result by subtracting very large numbers?  Given a million inputs, won't
A and B be so much larger than the standard deviation that there aren't
enough no bits left in the floating point number to represent it?

I just thought I should mention that, before this thread goes in an
archive somewhere and some student looks it up.

-Joel

On Wed, 2008-11-12 at 12:32 +0530, Amar Kamat wrote:
> some speed wrote:
> > Thanks for the response. What I am trying is to do is finding the average
> > and then the standard deviation for a very large set (say a million) of
> > numbers. The result would be used in further calculations.
> > I have got the average from the first map-reduce chain. now i need to read
> > this average as well as the set of numbers to calculate the standard
> > deviation.  so one file would have the input set and the other "resultant"
> > file would have just the average.
> > Please do tell me in case there is a better way of doing things than what i
> > am doing. Any input/suggestion is appreciated.:)
> >
> >   
> std_dev^2 = sum_i((Xi - Xa) ^ 2) / N; where Xa is the avg.
> Why dont you use the formula to compute it in one MR job.
> std_dev^2 = (sum_i(Xi ^ 2)  - N * (Xa ^ 2) ) / N;
>  = (A - N*(avg^2))/N
> 
> For this your map would look like
>map (key, val) : output.collect(key^2, key); // imagine your input as 
> (k,v) = (Xi, null)
> Reduce should simply sum over the keys to find out 'sum_i(Xi ^ 2)' and 
> sum over the values to find out 'Xa'. You could use the close() api to 
> finally dump there 2 values to a file.
> 
> For example :
> input : 1,2,3,4
> Say input is split in 2 groups [1,2] and [4,5]
> Now there will be 2 maps with output as follows
> map1 output : (1,1) (4,2)
> map2 output : (9,3) (16,4)
> 
> Reducer will maintain the sum over all keys and all values
> A = sum(key i.e  input squared) = 1+ 4 + 9 + 16 = 30
> B = sum(values i.e input) = 1 + 2 + 3 + 4 = 10
> 
> With A and B you can compute the standard deviation offline.
> So avg = B / N = 10/4 = 2.5
> Hence the std deviation would be
> sqrt( (A - N * avg^2) / N) = sqrt ((30 - 4*6.25)/4) = *1.11803399
> 
> *Using the main formula the answer is *1.11803399*
> Amar
> >
> > On Mon, Nov 10, 2008 at 4:22 AM, Amar Kamat <[EMAIL PROTECTED]> wrote:
> >
> >   
> >> Amar Kamat wrote:
> >>
> >> 
> >>> some speed wrote:
> >>>
> >>>   
>  I was wondering if it was possible to read the input for a map function
>  from
>  2 different files:
>   1st file ---> user-input file from a particular location(path)
> 
>  
> >>> Is the input/user file sorted? If yes then you can use "map-side join" for
> >>>   
> >> performance reasons. See org.apache.hadoop.mapred.join for more details.
> >>
> >> 
> >>> 2nd file=---> A resultant file (has just one  pair) from a
> >>>   
>  previous MapReduce job. (I am implementing a chain MapReduce function)
> 
>  
> >>> Can you explain in more detail the contents of 2nd file?
> >>>   
>  Now, for every  pair in the user-input file, I would like to
>  use
>  the same  pair from the 2nd file for some calculations.
> 
>  
> >>> Can you explain this in more detail? Can you give some abstracted example
> >>>   
> >> of how file1 and file2 look like and what operation/processing you want to
> >> do?
> >>
> >>
> >> 
> >>> I guess you might need to do some kind of join on the 2 files. Look at
> >>> contrib/data_join for more details.
> >>> Amar
> >>>
> >>>   
>  Is it possible for me to do so? Can someone guide me in the right
>  direction
>  please?
> 
> 
>  Thanks!
> 
> 
> 
>  
> >>>   
> >
> >   



Is there a unique ID associated with each task?

2008-10-30 Thread Joel Welling
Hi folks;
  I'm writing a Hadoop Pipes application, and I need to generate a bunch
of integers that are unique across all map tasks.  If each map task has
a unique integer ID, I can make sure my integers are unique by including
that integer ID.  I have this theory that each map task has a unique
identifier associated with some configuration parameter, but I don't
know the name of that parameter.
  Is there an integer associated with each task?  If so, how do I get
it?  While we're at it, is there a way to get the total number of map
tasks?

Thanks,
-Joel



Any examples using Hadoop Pipes with binary SequenceFiles?

2008-10-29 Thread Joel Welling
Hi folks;
  I'm interested in reading binary data, running it through some C++
code, and writing the result as binary data.  It looks like
SequenceFiles and Pipes are the way to do it, but I can't find any
examples or docs beyond the API specification.  Can someone point me to
an example where this is done?

Thanks,
-Joel



Re: Problems increasing number of tasks per node- really a task management problem!

2008-09-23 Thread Joel Welling
I think I've found my problem.  At some point about a week ago, I must
have tried to start new tasktracker processes on my worker nodes without
killing the ones that were already there.  The new processes died
immediately because their sockets were already in use.  The old
processes then took over their roles, running happily with new
JobTrackers and doing tasks as requested.  The pid files that are
supposed to point to the tasktrackers did not contain their pids,
however, and 'bin/stop-mapred.sh' chooses its targets from the pid
files.  So I could do 'bin/stop-mapred.sh' all day long without killing
them.  I ended up killing them explicitly one node at a time.

These tasktrackers knew the *old* config values that were in force when
they were started, so pushing the new values out to the worker nodes had
no effect.  

So.  Is there any mechanism for killing 'rogue' tasktrackers?  I'm a
little surprised that they are killed via their pids rather than by
sending them a kill signal via the same mechanism whereby they learn of
new work.

-Joel
 [EMAIL PROTECTED]

On Tue, 2008-09-23 at 14:29 -0700, Arun C Murthy wrote:
> On Sep 23, 2008, at 2:21 PM, Joel Welling wrote:
> 
> > Stopping and restarting the mapred service should push the new .xml  
> > file
> > out, should it not?  I've done 'bin/mapred-stop.sh',
> 
> No, you need to run 'bin/mapred-stop.sh', push it out to all the  
> machines and then do 'bin/mapred-start.sh'.
> 
> You do see it in your job's config - but that config isn't used by the  
> TaskTrackers. They use the config in their HADOOP_CONF_DIR; which is  
> why you'd need to push it to all machines.
> 
> Arun



Re: Problems increasing number of tasks per node

2008-09-23 Thread Joel Welling
Stopping and restarting the mapred service should push the new .xml file
out, should it not?  I've done 'bin/mapred-stop.sh',
'bin/mapred-start.sh', and I can see my new values in the
file:.../mapred/system/job_SomeNumber_SomeNumber/job.xml files
associated with the jobs.  The mapred.tasktracker.map.tasks.maximum
values shown in those files are 8, but each worker node tasktracker
still uses the value 2.  What file should contain the xml for the
tasktracker itself?  Does the maximum map task number get set when the
task tracker is spawned, or can a new job reset the number?

Thanks,
-Joel

On Tue, 2008-09-23 at 11:46 -0700, Arun C Murthy wrote:
> On Sep 23, 2008, at 11:41 AM, Joel Welling wrote:
> 
> > Hi folks;
> >  I have a small cluster, but each node is big- 8 cores each, with lots
> > of IO bandwidth.  I'd like to increase the number of simultaneous map
> > and reduce tasks scheduled per node from the default of 2 to something
> > like 8.
> >  My understanding is that I should be able to do this by increasing
> > mapred.tasktracker.reduce.tasks.maximum and
> > mapred.tasktracker.map.tasks.maximum , but doing so does not increase
> > the number of tasks.  I've been running gridmix with these parameters
> > set to 4, but the average number of tasks per node stays at 4, with 2
> > reduce and 2 map.
> >  Am I missing something?  Do I need to adjust something else as well?
> >
> 
> Please ensure that _all_ machines (tasktrackers) have this updated  
> configuration file... the above config knobs are used by the  
> TaskTrackers and hence they need to have the updated configs.
> 
> Arun
> 
> > Thanks,
> > -Joel
> > [EMAIL PROTECTED]
> >



Problems increasing number of tasks per node

2008-09-23 Thread Joel Welling
Hi folks;
  I have a small cluster, but each node is big- 8 cores each, with lots
of IO bandwidth.  I'd like to increase the number of simultaneous map
and reduce tasks scheduled per node from the default of 2 to something
like 8.
  My understanding is that I should be able to do this by increasing
mapred.tasktracker.reduce.tasks.maximum and
mapred.tasktracker.map.tasks.maximum , but doing so does not increase
the number of tasks.  I've been running gridmix with these parameters
set to 4, but the average number of tasks per node stays at 4, with 2
reduce and 2 map.  
  Am I missing something?  Do I need to adjust something else as well?

Thanks,
-Joel
 [EMAIL PROTECTED]



gridmix on a small cluster?

2008-09-17 Thread Joel Welling
Hi folks;
  I'd like to try the gridmix benchmark on my small cluster (3 nodes at
8 cores each, Lustre with IB interconnect).  The documentation for
gridmix suggests that it will take 4 hours on a 500 node cluster, which
suggests it would take me something like a week to run.  Is there a way
to scale the problem size back?  I don't mind the file size too much,
but the running time would be excessive if things scale linearly with
the number of nodes.

Thanks,
-Joel



Ordering of records in output files?

2008-09-10 Thread Joel Welling
Hi folks;
  I have a simple Streaming job where the mapper produces output records
beginning with a 16 character ascii string and passes them to
IdentityReducer.  When I run it, I get the same number of output files
as I have mapred.reduce.tasks .  Each one contains some of the strings,
and within each file the strings are in sorted order.
  But there is no obvious ordering *across* the files.  For example, I
can see where the first few strings in the output went to files 0,1,3,4,
and then back to 0, but none of them ended up in file 2.
  What's the algorithm that determines which strings end up in which
files?  Is there a way I can change it so that sequentially ordered
strings end up in the same file rather than spraying off across all the
files?

Thanks,
-Joel
 [EMAIL PROTECTED]



Trouble doing two-step sort

2008-09-03 Thread Joel Welling
Hi folks;
  I'm trying to do a Hadoop streaming job which involves a two-step
sort, of the type described at
http://hadoop.apache.org/core/docs/r0.18.0/streaming.html#A+Useful
+Partitioner+Class+%28secondary+sort%2C+the+-partitioner
+org.apache.hadoop.mapred.lib.KeyFieldBasedPartitioner+option%29 .  
I've got a python script that emits records like:

020700414.0140. 140 12 132475

Over a whole bunch of such records I expect to find about a dozen with
each first number (020700414 in this case), with consecutive values of
the second number (0140 in this case).  I'm using IdentityReducer and
KeyFieldBasedPartitioner.  I'd like to partition them based only on the
first number, because I definitely want all records with the same first
number to end up in the same output file.  Then I'd like those to be
sorted on the second number, so each output file contains a set of
ordered records for each first number, and all records for a given first
number end up in the same file.

I've tried setting these values:

stream.map.output.field.separator=.
map.output.key.field.separator=.
stream.num.map.output.key.fields=2
num.key.fields.for.partition=1

but I end up with each first number evenly distributed over all the
output files!  In each, the records for a given first number appear
together and the second numbers are in the right order.  This isn't the
result I wanted.

Am I misunderstanding this example somehow?  What settings should give
me the expected output order?  

Thanks, I hope;
-Joel Welling
 [EMAIL PROTECTED]



Re: Hadoop over Lustre?

2008-08-29 Thread Joel Welling
That seems to have done the trick!  I am now running Hadoop 0.18
straight out of Lustre, without an intervening HDFS.  The unusual things
about my hadoop-site.xml are:


  fs.default.name
  file:///bessemer/welling


  mapred.system.dir
  ${fs.default.name}/hadoop_tmp/mapred/system
  The shared directory where MapReduce stores control
files.
  


where /bessemer/welling is a directory on a mounted Lustre filesystem.
I then do 'bin/start-mapred.sh' (without starting dfs), and I can run
Hadoop programs normally.  I do have to specify full input and output
file paths- they don't seem to be relative to fs.default.name .  That's
not too troublesome, though.

Thanks very much!  
-Joel
 [EMAIL PROTECTED]

On Fri, 2008-08-29 at 10:52 -0700, Owen O'Malley wrote:
> Check the setting for mapred.system.dir. This needs to be a path that is on
> a distributed file system. In old versions of Hadoop, it had to be on the
> default file system, but that is no longer true. In recent versions, the
> system dir only needs to be configured on the JobTracker and it is passed to
> the TaskTrackers and clients.



Re: Hadoop over Lustre?

2008-08-29 Thread Joel Welling
Sorry; I'm picking this thread up after a couple day's delay.  Setting
fs.default.name to the equivalent of file:///path/to/lustre and changing
mapred.job.tracker to just a hostname and port does allow mapreduce to
start up.  However, test jobs fail with the exceptions below.  It looks
like TaskTracker.localizeJob is looking for job.xml in the local
filesystem; I would have expected it to look in lustre.

I can't find that particular job.xml anywhere on the system after the
run aborts, I'm afraid.  I guess it's getting cleaned up.

Thanks,
-Joel

08/08/28 18:46:07 INFO mapred.FileInputFormat: Total input paths to
process : 1508/08/28 18:46:07 INFO mapred.FileInputFormat: Total input
paths to process : 1508/08/28 18:46:08 INFO mapred.JobClient: Running
job: job_200808281828_0002
08/08/28 18:46:09 INFO mapred.JobClient:  map 0% reduce 0%
08/08/28 18:46:12 INFO mapred.JobClient: Task Id :
attempt_200808281828_0002_m_00_0, Status : FAILED
Error initializing attempt_200808281828_0002_m_00_0:
java.io.IOException:
file:/tmp/hadoop-welling/mapred/system/job_200808281828_0002/job.xml: No
such file or directory
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:216)
at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:150)
at
org.apache.hadoop.fs.LocalFileSystem.copyToLocalFile(LocalFileSystem.java:55)
at
org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1193)
at
org.apache.hadoop.mapred.TaskTracker.localizeJob(TaskTracker.java:668)
at
org.apache.hadoop.mapred.TaskTracker.startNewTask(TaskTracker.java:1306)
at
org.apache.hadoop.mapred.TaskTracker.offerService(TaskTracker.java:946)
at
org.apache.hadoop.mapred.TaskTracker.run(TaskTracker.java:1343)
at
org.apache.hadoop.mapred.TaskTracker.main(TaskTracker.java:2354)

08/08/28 18:46:12 WARN mapred.JobClient: Error reading task
outputhttp://foo.psc.edu:50060/tasklog?plaintext=true&taskid=attempt_200808281828_0002_m_00_0&filter=stdout
08/08/28 18:46:12 WARN mapred.JobClient: Error reading task
outputhttp://foo.psc.edu:50060/tasklog?plaintext=true&taskid=attempt_200808281828_0002_m_00_0&filter=stderr



On Mon, 2008-08-25 at 14:24 -0700, Konstantin Shvachko wrote:
> mapred.job.tracker is the address and port of the JobTracker - the main 
> server that controls map-reduce jobs.
> Every task tracker needs to know the address in order to connect.
> Do you follow the docs, e.g. that one
> http://wiki.apache.org/hadoop/GettingStartedWithHadoop
> 
> Can you start one node cluster?
> 
>  > Are there standard tests of hadoop performance?
> 
> There is the sort benchmark. We also run DFSIO benchmark for read and write 
> throughputs.
> 
> --Konstantin
> 
> Joel Welling wrote:
> > So far no success, Konstantin- the hadoop job seems to start up, but
> > fails immediately leaving no logs.  What is the appropriate setting for
> > mapred.job.tracker ?  The generic value references hdfs, but it also has
> > a port number- I'm not sure what that means.
> > 
> > My cluster is small, but if I get this working I'd be very happy to run
> > some benchmarks.  Are there standard tests of hadoop performance?
> > 
> > -Joel
> >  [EMAIL PROTECTED]
> > 
> > On Fri, 2008-08-22 at 15:59 -0700, Konstantin Shvachko wrote:
> >> I think the solution should be easier than Arun and Steve advise.
> >> Lustre is already mounted as a local directory on each cluster machines, 
> >> right?
> >> Say, it is mounted on /mnt/lustre.
> >> Then you configure hadoop-site.xml and set
> >> 
> >>fs.default.name
> >>file:///mnt/lustre
> >> 
> >> And then you start map-reduce only without hdfs using start-mapred.sh
> >>
> >> By this you basically redirect all FileSystem requests to Lustre and you 
> >> don't need
> >> data-nodes or the name-node.
> >>
> >> Please let me know if that works.
> >>
> >> Also it would very interesting to have your experience shared on this list.
> >> Problems, performance - everything is quite interesting.
> >>
> >> Cheers,
> >> --Konstantin
> >>
> >> Joel Welling wrote:
> >>>> 2. Could you set up symlinks from the local filesystem, so point every 
> >>>> node at a local dir
> >>>>   /tmp/hadoop
> >>>> with each node pointing to a different subdir in the big filesystem?
> >>> Yes, I could do that!  Do I need to do it for the log directories as
> >>> well, or can they be shared?
> >>>
> >>> -Joel
> >>>
> >>> On Fri, 2008-08-22 at 15:48 +0100, Steve Loughran wrote:
> &g

Re: Hadoop over Lustre?

2008-08-23 Thread Joel Welling
So far no success, Konstantin- the hadoop job seems to start up, but
fails immediately leaving no logs.  What is the appropriate setting for
mapred.job.tracker ?  The generic value references hdfs, but it also has
a port number- I'm not sure what that means.

My cluster is small, but if I get this working I'd be very happy to run
some benchmarks.  Are there standard tests of hadoop performance?

-Joel
 [EMAIL PROTECTED]

On Fri, 2008-08-22 at 15:59 -0700, Konstantin Shvachko wrote:
> I think the solution should be easier than Arun and Steve advise.
> Lustre is already mounted as a local directory on each cluster machines, 
> right?
> Say, it is mounted on /mnt/lustre.
> Then you configure hadoop-site.xml and set
> 
>fs.default.name
>file:///mnt/lustre
> 
> And then you start map-reduce only without hdfs using start-mapred.sh
> 
> By this you basically redirect all FileSystem requests to Lustre and you 
> don't need
> data-nodes or the name-node.
> 
> Please let me know if that works.
> 
> Also it would very interesting to have your experience shared on this list.
> Problems, performance - everything is quite interesting.
> 
> Cheers,
> --Konstantin
> 
> Joel Welling wrote:
> >> 2. Could you set up symlinks from the local filesystem, so point every 
> >> node at a local dir
> >>   /tmp/hadoop
> >> with each node pointing to a different subdir in the big filesystem?
> > 
> > Yes, I could do that!  Do I need to do it for the log directories as
> > well, or can they be shared?
> > 
> > -Joel
> > 
> > On Fri, 2008-08-22 at 15:48 +0100, Steve Loughran wrote:
> >> Joel Welling wrote:
> >>> Thanks, Steve and Arun.  I'll definitely try to write something based on
> >>> the KFS interface.  I think that for our applications putting the mapper
> >>> on the right rack is not going to be that useful.  A lot of our
> >>> calculations are going to be disordered stuff based on 3D spatial
> >>> relationships like nearest-neighbor finding, so things will be in a
> >>> random access pattern most of the time.
> >>>
> >>> Is there a way to set up the configuration for HDFS so that different
> >>> datanodes keep their data in different directories?  That would be a big
> >>> help in the short term.
> >> yes, but you'd have to push out a different config to each datanode.
> >>
> >> 1. I have some stuff that could help there, but its not ready for 
> >> production use yet [1].
> >>
> >> 2. Could you set up symlinks from the local filesystem, so point every 
> >> node at a local dir
> >>   /tmp/hadoop
> >> with each node pointing to a different subdir in the big filesystem?
> >>
> >>
> >> [1] 
> >> http://people.apache.org/~stevel/slides/deploying_hadoop_with_smartfrog.pdf
> > 
> > 



Re: Hadoop over Lustre?

2008-08-22 Thread Joel Welling
> 2. Could you set up symlinks from the local filesystem, so point every 
> node at a local dir
>   /tmp/hadoop
> with each node pointing to a different subdir in the big filesystem?

Yes, I could do that!  Do I need to do it for the log directories as
well, or can they be shared?

-Joel

On Fri, 2008-08-22 at 15:48 +0100, Steve Loughran wrote:
> Joel Welling wrote:
> > Thanks, Steve and Arun.  I'll definitely try to write something based on
> > the KFS interface.  I think that for our applications putting the mapper
> > on the right rack is not going to be that useful.  A lot of our
> > calculations are going to be disordered stuff based on 3D spatial
> > relationships like nearest-neighbor finding, so things will be in a
> > random access pattern most of the time.
> > 
> > Is there a way to set up the configuration for HDFS so that different
> > datanodes keep their data in different directories?  That would be a big
> > help in the short term.
> 
> yes, but you'd have to push out a different config to each datanode.
> 
> 1. I have some stuff that could help there, but its not ready for 
> production use yet [1].
> 
> 2. Could you set up symlinks from the local filesystem, so point every 
> node at a local dir
>   /tmp/hadoop
> with each node pointing to a different subdir in the big filesystem?
> 
> 
> [1] 
> http://people.apache.org/~stevel/slides/deploying_hadoop_with_smartfrog.pdf



Re: Hadoop over Lustre?

2008-08-22 Thread Joel Welling
Thanks, Steve and Arun.  I'll definitely try to write something based on
the KFS interface.  I think that for our applications putting the mapper
on the right rack is not going to be that useful.  A lot of our
calculations are going to be disordered stuff based on 3D spatial
relationships like nearest-neighbor finding, so things will be in a
random access pattern most of the time.

Is there a way to set up the configuration for HDFS so that different
datanodes keep their data in different directories?  That would be a big
help in the short term.

-Joel
 [EMAIL PROTECTED]

On Fri, 2008-08-22 at 10:48 +0100, Steve Loughran wrote:
> Joel Welling wrote:
> > Hi folks;
> >   I'm new to Hadoop, and I'm trying to set it up on a cluster for which
> > almost all the disk is mounted via the Lustre filesystem.  That
> > filesystem is visible to all the nodes, so I don't actually need HDFS to
> > implement a shared filesystem.  (I know the philosophical reasons why
> > people say local disks are better for Hadoop, but that's not the
> > situation I've got).  My system is failing, and I think it's because the
> > different nodes are tripping over each other when they try to run HDFS
> > out of the same directory tree.
> >   Is there a way to turn off HDFS and just let Lustre do the distributed
> > filesystem?  I've seen discussion threads about Hadoop with NFS which
> > said something like 'just specify a local filesystem and everything will
> > be fine', but I don't know how to do that.  I'm using Hadoop 0.17.2.
> > 
> 
> I dont know enough about Lustre to be very useful
> 
> * You shouldnt have nodes trying to use the same directories. At the 
> very least, point each datanode at a different bit of the filesystem.
> 
> * If there is a specific API call to find out which rack has the data, 
> that could be used to place work near the data. Someone (==you) would 
> have to write a new filesystem back-end for Hadoop for this.
> 
> -steve



Hadoop over Lustre?

2008-08-21 Thread Joel Welling
Hi folks;
  I'm new to Hadoop, and I'm trying to set it up on a cluster for which
almost all the disk is mounted via the Lustre filesystem.  That
filesystem is visible to all the nodes, so I don't actually need HDFS to
implement a shared filesystem.  (I know the philosophical reasons why
people say local disks are better for Hadoop, but that's not the
situation I've got).  My system is failing, and I think it's because the
different nodes are tripping over each other when they try to run HDFS
out of the same directory tree.
  Is there a way to turn off HDFS and just let Lustre do the distributed
filesystem?  I've seen discussion threads about Hadoop with NFS which
said something like 'just specify a local filesystem and everything will
be fine', but I don't know how to do that.  I'm using Hadoop 0.17.2.

Thanks, I hope;
-Joel Welling
 [EMAIL PROTECTED]



Problem with installation of 0.17.0: things start but tests fail

2008-08-13 Thread Joel Welling
I'm new to Hadoop, and am doing an installation of 0.17.1 on a small
system.  The only thing that I know is unusual about the system is that
the underlying filesystem is in fact shared- a single Lustre filesystem
is visible to all the nodes, and individual nodes all seem to be storing
their blocks in the same directory, under /current .

I follow the installation instructions on
http://hadoop.apache.org/core/docs/current/cluster_setup.html , and
everything seems to start up just fine.  I then try the test problem
from http://hadoop.apache.org/core/docs/current/quickstart.html with:

bin/hadoop dfs -put conf input
bin/hadoop jar hadoop-*-examples.jar grep input output 'dfs[a-z.]+'

This results in:

08/08/13 19:18:47 INFO mapred.FileInputFormat: Total input paths to
process : 1308/08/13 19:18:47 INFO mapred.JobClient: Running job:
job_200808131916_0001
08/08/13 19:18:49 INFO mapred.JobClient:  map 0% reduce 0%
08/08/13 19:18:51 INFO mapred.JobClient:  map 14% reduce 0%
08/08/13 19:19:05 INFO mapred.JobClient:  map 14% reduce 4%
08/08/13 19:19:05 INFO mapred.JobClient: Task Id :
task_200808131916_0001_m_02_0, Status : FAILED
java.io.IOException: Could not obtain block: blk_-4001091320210055678
file=/user/welling/input/log4j.properties
at org.apache.hadoop.dfs.DFSClient
$DFSInputStream.chooseDataNode(DFSClient.java:1430)
...

and similar errors on a bunch of other files.  When I look at the status
of a particular file, like log4j.properties, I see:

java.io.IOException: Got error in response to OP_READ_BLOCK for file
muir03.psc.edu/128.182.99.103:50010:-4001091320210055678 for block
-4001091320210055678

However, I do see a block by that name in /current !  Is
this some sort of configuration issue?  Do I need to somehow arrange for
the different datanodes to keep their files in separate directories?

Thanks, I hope,
-Joel Welling
 [EMAIL PROTECTED]