Re: best practice: mapred.local vs dfs drives

2009-04-06 Thread Craig Macdonald

Thanks for the headsup.

C

Owen O'Malley wrote:

We always share the drives.

-- Owen

On Apr 5, 2009, at 0:52, zsongbo zson...@gmail.com wrote:

I usually set mapred.local.dir to share the disk space with DFS, 
since some

mapreduce job need big temp space.



On Fri, Apr 3, 2009 at 8:36 PM, Craig Macdonald 
cra...@dcs.gla.ac.ukwrote:



Hello all,

Following recent hardware discussions, I thought I'd ask a related
question. Our cluster nodes have 3 drives: 1x 160GB system/scratch 
and 2x

500GB DFS drives.

The 160GB system drive is partitioned such that 100GB is for job
mapred.local space. However, we find that for our application, 
mapred.local
free space for map output space is the limiting parameter on the 
number of

reducers we can have (our application prefers less reducers).

How do people normally work for dfs vs mapred.local space. Do you 
(a) share
the DFS drives with the task tracker temporary files, Or do you (b) 
keep

them on separate partitions or drives?

We originally went with (b) because it prevented a run-away job from 
eating

all the DFS space on the machine, however, I'm beginning to realise the
disadvantages.

Any comments?

Thanks

Craig






best practice: mapred.local vs dfs drives

2009-04-03 Thread Craig Macdonald

Hello all,

Following recent hardware discussions, I thought I'd ask a related 
question. Our cluster nodes have 3 drives: 1x 160GB system/scratch and 
2x 500GB DFS drives.


The 160GB system drive is partitioned such that 100GB is for job 
mapred.local space. However, we find that for our application, 
mapred.local free space for map output space is the limiting parameter 
on the number of reducers we can have (our application prefers less 
reducers).


How do people normally work for dfs vs mapred.local space. Do you (a) 
share the DFS drives with the task tracker temporary files, Or do you 
(b) keep them on separate partitions or drives?


We originally went with (b) because it prevented a run-away job from 
eating all the DFS space on the machine, however, I'm beginning to 
realise the disadvantages.


Any comments?

Thanks

Craig



Re: how to mount specification-path of hdfs with fuse-dfs

2009-03-26 Thread Craig Macdonald

Hi Jacky,

Please to hear that fuse-dfs is working for you.

Do you mean that you want to mount dfs://localhost:9000/users at /mnt/hdfs ?

If so, fuse-dfs doesnt currently support this, but it would be a good 
idea for a future improvement.


Craig

jacky_ji wrote:

i can use fuse-dfs to mount hdfs.  just like this: ./fuse-dfs
dfs://localhost:9000 /mnt/hdfs -d
but  i want to mount the specification path in hdfs now, and i have no idea
ablut it, any advice will be appreciated.
  




Re: Can never restart HDFS after a day or two

2009-02-17 Thread Craig Macdonald
Yes, tmpwatch may cause problems deleting files that havent been 
accessed for a while.


On fedora/redhat chmod -x /etc/cron.daily/tmpwatch to disable it completely.

C

Mark Kerzner wrote:

Indeed, this was the right answer, and in the morning the file system was as
fresh as in the evening. Somebody already told me to move out of /tmp, but I
didn't believe him then. sorry.
Mark

On Tue, Feb 17, 2009 at 7:57 AM, Rasit OZDAS rasitoz...@gmail.com wrote:

  

I agree with Amandeep, and results will remain forever, unless you manually
delete them.

If we are on the right road,
change hadoop.tmp.dir property to be outside of /tmp, or changing
dfs.name.dir and dfs.data.dir should be enough for basic use (I didn't have
to change anything else).

Cheers,
Rasit

2009/2/17 Amandeep Khurana ama...@gmail.com



Where are your namenode and datanode storing the data? By default, it
  

goes


into the /tmp directory. You might want to move that out of there.

Amandeep


Amandeep Khurana
Computer Science Graduate Student
University of California, Santa Cruz


On Mon, Feb 16, 2009 at 8:11 PM, Mark Kerzner markkerz...@gmail.com
wrote:

  

Hi all,

I consistently have this problem that I can run HDFS and restart it


after


short breaks of a few hours, but the next day I always have to reformat
HDFS
before the daemons begin to work.

Is that normal? Maybe this is treated as temporary data, and the


results


need to be copied out of HDFS and not stored for long periods of time?


I


verified that the files in /tmp related to hadoop are seemingly intact.

Thank you,
Mark




--
M. Raşit ÖZDAŞ






Re: HDFS Appends in 0.19.0

2009-02-02 Thread Craig Macdonald

Hi Arifa,

The O_APPEND flag is the subject of
https://issues.apache.org/jira/browse/HADOOP-4494

Craig

Arifa Nisar wrote:

Hello All,

I am using hadoop 0.19.0, whose release notes includes HADOOP-1700
Introduced append operation for HDFS files. I am trying to test this new
feature using my test program. I have experienced that O_APPEND flag added
in hdfsopen() is ignored by libhdfs. Also, only WRONLY and RDONLY are
defined in hdfs.h. Please let me know how to use append functionality in
this release.

Thanks,
Arifa.

  




Re: Hadoop+s3 fuse-dfs

2009-01-28 Thread Craig Macdonald

Hi Roopa,

I cant comment on the S3 specifics. However, fuse-dfs is based on a C 
interface called libhdfs which allows C programs (such as fuse-dfs) to 
connect to the Hadoop file system Java API. This being the case, 
fuse-dfs should (theoretically) be able to connect to any file system 
that Hadoop can. Your mileage may vary, but if you find issues, please 
do report them through the normal channels.


Craig


Roopa Sudheendra wrote:
I am experimenting with Hadoop backed by Amazon s3 filesystem as one 
of our backup storage solution. Just the hadoop and s3(block based 
since it overcomes the 5gb limit) so far seems to be fine.
My problem is that i want to mount this filesystem using fuse-dfs ( 
since i don't have to worry about how the file is written on the 
system ) . Since the namenode does not get started with s3 backed 
hadoop system how can i connect fuse-dfs to this setup.


Appreciate your help.
Thanks,
Roopa




Re: Hadoop+s3 fuse-dfs

2009-01-28 Thread Craig Macdonald

In theory, yes.
On inspection of libhdfs, which underlies fuse-dfs, I note that:

* libhdfs takes a host and port number as input when connecting, but 
not a scheme (hdfs etc). The easiest option would be to set the S3 as 
your default file system in your hadoop-site.xml, then use the host of 
default. That should get libhdfs to use the S3 file system. i.e. set 
fuse-dfs to mount dfs://default:0/ and all should work as planned.


* libhdfs also casts the FileSystem to a DistributedFileSystem for the 
df command. This would fail in your case. This issue is currently being 
worked on - see HADOOP-4368

https://issues.apache.org/jira/browse/HADOOP-4368.

C


Roopa Sudheendra wrote:

Thanks for the response craig.
I looked at fuse-dfs c code and looks like it does not like anything 
other than dfs://  so with the fact that hadoop can connect to S3 
file system ..allowing s3 scheme should solve my problem?


Roopa

On Jan 28, 2009, at 1:03 PM, Craig Macdonald wrote:


Hi Roopa,

I cant comment on the S3 specifics. However, fuse-dfs is based on a C 
interface called libhdfs which allows C programs (such as fuse-dfs) 
to connect to the Hadoop file system Java API. This being the case, 
fuse-dfs should (theoretically) be able to connect to any file system 
that Hadoop can. Your mileage may vary, but if you find issues, 
please do report them through the normal channels.


Craig


Roopa Sudheendra wrote:
I am experimenting with Hadoop backed by Amazon s3 filesystem as one 
of our backup storage solution. Just the hadoop and s3(block based 
since it overcomes the 5gb limit) so far seems to be fine.
My problem is that i want to mount this filesystem using fuse-dfs ( 
since i don't have to worry about how the file is written on the 
system ) . Since the namenode does not get started with s3 backed 
hadoop system how can i connect fuse-dfs to this setup.


Appreciate your help.
Thanks,
Roopa








Re: Hadoop+s3 fuse-dfs

2009-01-28 Thread Craig Macdonald

Hi Roopa,

Firstly, can you get the fuse-dfs working for an instance HDFS?
There is also a debug mode for fuse: enable this by adding -d on the 
command line.


C

Roopa Sudheendra wrote:

Hey Craig,
 I tried the way u suggested..but i get this transport endpoint not 
connected. Can i see the logs anywhere? I dont see anything in 
/var/log/messages either
 looks like it tries to create the file system in hdfs.c but not sure 
where it fails.

I have the hadoop home set so i believe it gets the config info.

any idea?

Thanks,
Roopa
On Jan 28, 2009, at 1:59 PM, Craig Macdonald wrote:


In theory, yes.
On inspection of libhdfs, which underlies fuse-dfs, I note that:

* libhdfs takes a host and port number as input when connecting, but 
not a scheme (hdfs etc). The easiest option would be to set the S3 as 
your default file system in your hadoop-site.xml, then use the host 
of default. That should get libhdfs to use the S3 file system. i.e. 
set fuse-dfs to mount dfs://default:0/ and all should work as planned.


* libhdfs also casts the FileSystem to a DistributedFileSystem for 
the df command. This would fail in your case. This issue is currently 
being worked on - see HADOOP-4368

https://issues.apache.org/jira/browse/HADOOP-4368.

C


Roopa Sudheendra wrote:

Thanks for the response craig.
I looked at fuse-dfs c code and looks like it does not like anything 
other than dfs://  so with the fact that hadoop can connect to S3 
file system ..allowing s3 scheme should solve my problem?


Roopa

On Jan 28, 2009, at 1:03 PM, Craig Macdonald wrote:


Hi Roopa,

I cant comment on the S3 specifics. However, fuse-dfs is based on a 
C interface called libhdfs which allows C programs (such as 
fuse-dfs) to connect to the Hadoop file system Java API. This being 
the case, fuse-dfs should (theoretically) be able to connect to any 
file system that Hadoop can. Your mileage may vary, but if you find 
issues, please do report them through the normal channels.


Craig


Roopa Sudheendra wrote:
I am experimenting with Hadoop backed by Amazon s3 filesystem as 
one of our backup storage solution. Just the hadoop and s3(block 
based since it overcomes the 5gb limit) so far seems to be fine.
My problem is that i want to mount this filesystem using fuse-dfs 
( since i don't have to worry about how the file is written on the 
system ) . Since the namenode does not get started with s3 backed 
hadoop system how can i connect fuse-dfs to this setup.


Appreciate your help.
Thanks,
Roopa












Re: Hadoop+s3 fuse-dfs

2009-01-28 Thread Craig Macdonald

Hi Roopa,

Glad it worked :-)

Please file JIRA issues against the fuse-dfs / libhdfs components that 
would have made it easier to mount the S3 filesystem.


Craig

Roopa Sudheendra wrote:
Thanks, Yes a setup with fuse-dfs and hdfs works fine.I think the 
mount point was bad for whatever reason and was failing with that 
error .I created another mount point for mounting which resolved  the 
transport end point error.


Also i had -d option on my command..:)


Roopa


On Jan 28, 2009, at 6:35 PM, Craig Macdonald wrote:


Hi Roopa,

Firstly, can you get the fuse-dfs working for an instance HDFS?
There is also a debug mode for fuse: enable this by adding -d on the 
command line.


C

Roopa Sudheendra wrote:

Hey Craig,
I tried the way u suggested..but i get this transport endpoint not 
connected. Can i see the logs anywhere? I dont see anything in 
/var/log/messages either
looks like it tries to create the file system in hdfs.c but not sure 
where it fails.

I have the hadoop home set so i believe it gets the config info.

any idea?

Thanks,
Roopa
On Jan 28, 2009, at 1:59 PM, Craig Macdonald wrote:


In theory, yes.
On inspection of libhdfs, which underlies fuse-dfs, I note that:

* libhdfs takes a host and port number as input when connecting, 
but not a scheme (hdfs etc). The easiest option would be to set the 
S3 as your default file system in your hadoop-site.xml, then use 
the host of default. That should get libhdfs to use the S3 file 
system. i.e. set fuse-dfs to mount dfs://default:0/ and all should 
work as planned.


* libhdfs also casts the FileSystem to a DistributedFileSystem for 
the df command. This would fail in your case. This issue is 
currently being worked on - see HADOOP-4368

https://issues.apache.org/jira/browse/HADOOP-4368.

C


Roopa Sudheendra wrote:

Thanks for the response craig.
I looked at fuse-dfs c code and looks like it does not like 
anything other than dfs://  so with the fact that hadoop can 
connect to S3 file system ..allowing s3 scheme should solve my 
problem?


Roopa

On Jan 28, 2009, at 1:03 PM, Craig Macdonald wrote:


Hi Roopa,

I cant comment on the S3 specifics. However, fuse-dfs is based on 
a C interface called libhdfs which allows C programs (such as 
fuse-dfs) to connect to the Hadoop file system Java API. This 
being the case, fuse-dfs should (theoretically) be able to 
connect to any file system that Hadoop can. Your mileage may 
vary, but if you find issues, please do report them through the 
normal channels.


Craig


Roopa Sudheendra wrote:
I am experimenting with Hadoop backed by Amazon s3 filesystem as 
one of our backup storage solution. Just the hadoop and s3(block 
based since it overcomes the 5gb limit) so far seems to be fine.
My problem is that i want to mount this filesystem using 
fuse-dfs ( since i don't have to worry about how the file is 
written on the system ) . Since the namenode does not get 
started with s3 backed hadoop system how can i connect fuse-dfs 
to this setup.


Appreciate your help.
Thanks,
Roopa
















Re: Hadoop 0.19 over OS X : dfs error

2009-01-24 Thread Craig Macdonald

Hi,

I guess that the java on your PATH is different from the setting of your 
$JAVA_HOME env variable.

Try $JAVA_HOME/bin/java -version?

Also, there is a program called Java Preferences on each system for 
changing the default java version used.


Craig

nitesh bhatia wrote:

Hi
I am trying to setup Hadoop 0.19 on OS X. Current Java Version is

java version 1.6.0_07
Java(TM) SE Runtime Environment (build 1.6.0_07-b06-153)
Java HotSpot(TM) 64-Bit Server VM (build 1.6.0_07-b06-57, mixed mode)

When I am trying to format dfs  using bin/hadoop dfs -format command. I am
getting following errors:

nMac:hadoop-0.19.0 Aryan$ bin/hadoop dfs -format
Exception in thread main java.lang.UnsupportedClassVersionError: Bad
version number in .class file
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:675)
at
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:260)
at java.net.URLClassLoader.access$100(URLClassLoader.java:56)
at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:316)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:280)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:374)
Exception in thread main java.lang.UnsupportedClassVersionError: Bad
version number in .class file
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:675)
at
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:124)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:260)
at java.net.URLClassLoader.access$100(URLClassLoader.java:56)
at java.net.URLClassLoader$1.run(URLClassLoader.java:195)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:188)
at java.lang.ClassLoader.loadClass(ClassLoader.java:316)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:280)
at java.lang.ClassLoader.loadClass(ClassLoader.java:251)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:374)


I am not sure why this error is coming. I am having latest Java version. Can
anyone help me out with this?

Thanks
Nitesh

  




FileOutputFormat.getWorkOutputPath and map-to-reduce-only side-effect files

2009-01-22 Thread Craig Macdonald

Hello Hadoop Core,

I have a very brief question: Our map tasks create side-effect files, in 
the directory returned by FileOutputFormat.getWorkOutputPath().


This works fine for the getting the side-effect files that can be 
accessed by the reducers.


However, as these map-generated side-effect files are only of use to the 
reducers, it would be nice to have them deleted from the output 
directory. However, we cant delete them in a reducer.close(), as this 
would prevent them being accessible to other reduce tasks (speculative 
or otherwise).


Any suggestions, short of deleting them after the job completes?

Craig


Re: libhdfs append question[MESSAGE NOT SCANNED]

2009-01-14 Thread Craig Macdonald

Tamas,

There is a patch attached to the issue, which you should be able to 
apply to get O_APPEND .


https://issues.apache.org/jira/browse/HADOOP-4494

Craig

Tamás Szokol wrote:


Hi!


I'm using the latest stable 0.19.0 version of hadoop. I'd like to try 
the new append functionality. Is it available from libhdfs? I didn't 
find it in hdfs.h interface nor in the hdfs.c implementation.


I saw the hdfsOpenFile's new flag O_APPEND in: HADOOP-4494
(http://mail-archives.apache.org/mod_mbox/hadoop-core-dev/200810.mbox/%3c270645717.1224722204314.javamail.j...@brutus%3e),
but I still don't find it in the latest release.

Is it available as a patch, or maybe only available in the svn repository?

Could you please give me some pointers how to use the append 
functionality from libhdfs?

Thank you in advance!

Cheers,
Tamas



 





Re: HOD questions

2008-12-19 Thread Craig Macdonald

Hi Hemanth,

While HOD does not do this automatically, please note that since you 
are bringing up a Map/Reduce cluster on the allocated nodes, you can 
submit map/reduce parameters with which to bring up the cluster when 
allocating jobs. The relevant options are 
--gridservice-mapred.server-params (or -M in shorthand). Please 
refer to
http://hadoop.apache.org/core/docs/r0.19.0/hod_user_guide.html#Options+for+Configuring+Hadoop 
for details.
I was aware of this, but the issue is that unless you obtain 
dedicated nodes (as above), this option is not suitable, as it isn't 
set on a per-node basis. I think it would be /fairly/ straightfoward 
to add to HOD, as I detailed in my initial email, so that it does 
the correct thing out the box.
True, I did assume you obtained dedicated nodes. It has been fairly 
simpler to operate HOD in this manner, and if I understand correctly, 
would help to solve the requirement you are having as well.
I think it's a Maui change (or qos directive) to obtain dedicated nodes 
- I'm looking into it presently, but I'm not sure that the correct exact 
incantation is correct.

-W x=NACCESSPOLICY=SINGLETASK

For mixed job environments [e.g. universities] - where users have jobs 
which aren't HOD, often using single CPUs, it can mean that a job has 
more complicated requirements and will hence take longer to reach the 
head of the queue.


According to hadoop-default.xml, the number of maps is Typically set 
to a prime several times greater than number of available hosts. - 
Say that we relax this recommendation to read Typically set to a 
NUMBER several times greater than number of available hosts then it 
should be straightforward for HOD to set it automatically then?
Actually, AFAIK, the number of maps for a job is determined more or 
less exclusively by the M/R framework based on the number of splits. 
I've seen messages on this list before about how the documentation for 
this configuration item is misleading. So, this might actually not 
make a difference at all, whatever is specified.
The reason we were asking is that mapred.map.tasks is provided as the 
hint to the input split.
We were using this number to generate the number of maps. I think its 
just that FileInputFormat doesn't exactly honour the hint, from what I 
can see. Pig's InputFormat ignores the hint.




Craig


Re: HOD questions

2008-12-18 Thread Craig Macdonald

Hemanth,
snip
Just FYI, at Yahoo! we've set torque to allocate separate nodes for 
the number specified to HOD. In other words, the number corresponds to 
the number of nodes, not processors. This has proved simpler to 
manage. I forget right now, but I think you can make Torque behave 
like this (to not treat processors as individual nodes).
Thanks  - I think it's a Maui directive, either on the job level or 
globally. I'm looking into this currently.
However, on inspection of the Jobtracker UI, it tells us that node19 
has Max Map Tasks and Max Reduce Tasks both set to 2, when for 
node19, it should only be allowed one map task. 
While HOD does not do this automatically, please note that since you 
are bringing up a Map/Reduce cluster on the allocated nodes, you can 
submit map/reduce parameters with which to bring up the cluster when 
allocating jobs. The relevant options are 
--gridservice-mapred.server-params (or -M in shorthand). Please refer to
http://hadoop.apache.org/core/docs/r0.19.0/hod_user_guide.html#Options+for+Configuring+Hadoop 
for details.
I was aware of this, but the issue is that unless you obtain dedicated 
nodes (as above), this option is not suitable, as it isn't set on a 
per-node basis. I think it would be /fairly/ straightfoward to add to 
HOD, as I detailed in my initial email, so that it does the correct 
thing out the box.
(2) In our InputFormat, we use the numSplits to tell us how many map 
tasks the job's files should be split into. However, HOD does not 
override the mapred.map.tasks property (nor the mapred.reduce.tasks), 
while they should be set dependent on the number of available task 
trackers and/or nodes in the HOD session.
Can this not be submitted via the Hadoop job's configuration ? Again, 
HOD cannot do this automatically currently. But you could use the 
hod.client-params to set up a client side hadoop-site.xml that would 
work like this for all jobs submitted to the cluster.
According to hadoop-default.xml, the number of maps is Typically set to 
a prime several times greater than number of available hosts. - Say 
that we relax this recommendation to read Typically set to a NUMBER 
several times greater than number of available hosts then it should be 
straightforward for HOD to set it automatically then?


Craig


HOD questions

2008-12-17 Thread Craig Macdonald

Hello,

We have two HOD questions:

(1) For our current Torque PBS setup, the number of nodes requested by 
HOD (-l nodes=X) corresponds to the number of CPUs allocated, however 
these nodes can be spread across various partially or empty nodes. 
Unfortunately, HOD does not appear to honour the number of processors 
actually allocated by Torque PBS to that job.


For example, a current running HOD session can be viewed in qstat as:
104544.trmaster  user parallel HOD   4178 8  ----  288:0 R 01:48
  node29/2+node29/1+node29/0+node17/2+node17/1+node18/2+node18/1
  +node19/1

However, on inspection of the Jobtracker UI, it tells us that node19 has 
Max Map Tasks and Max Reduce Tasks both set to 2, when I think that 
for node19, it should only be allowed one map task.


I believe that for each node, HOD should determine (using the 
information in the $PBS_NODEFILE), how many CPUs for each node are 
allocated to the HOD job, and then set 
mapred.tasktracker.map.tasks.maximum appropriately on each node.


(2) In our InputFormat, we use the numSplits to tell us how many map 
tasks the job's files should be split into. However, HOD does not 
override the mapred.map.tasks property (nor the mapred.reduce.tasks), 
while they should be set dependent on the number of available task 
trackers and/or nodes in the HOD session.


Craig


Re: getting Configuration object in mapper

2008-12-05 Thread Craig Macdonald
I have a related question - I have a class which is both mapper and 
reducer. How can I tell in configure() if the current task is map or a 
reduce task? Parse the taskid?


C

Owen O'Malley wrote:


On Dec 4, 2008, at 9:19 PM, abhinit wrote:


I have set some variable using the JobConf object.

jobConf.set(Operator, operator) etc.

How can I get an instance of Configuration object/ JobConf object inside
a map method so that I can retrieve these variables.


In your Mapper class, implement a method like:
 public void configure(JobConf job) { ... }

This will be called when the object is created with the job conf.

-- Owen