Re: Hadoop Installation

2008-11-25 Thread Steve Loughran

Mithila Nagendra wrote:

Hey Steve
I deleted what ever I needed to.. still no luck..

You said that the classpath might be messed up.. Is there some way I can
reset it? For the root user? What path do I set it to.



Let's start with what kind of machine is this? Windows? or Linux. If 
Linux, which one?


Datanode log for errors

2008-11-25 Thread Taeho Kang
Hi,

I have encountered some IOExceptions in Datanode, while some
intermediate/temporary map-reduce data is written to HDFS.

2008-11-25 18:27:08,070 INFO org.apache.hadoop.dfs.DataNode: writeBlock
blk_-460494523413678075 received exception java.io.IOException: Block
blk_-460494523413678075 is valid, and cannot be written to.
2008-11-25 18:27:08,070 ERROR org.apache.hadoop.dfs.DataNode:
10.31.xx.xxx:50010:DataXceiver: java.io.IOException: Block
blk_-460494523413678075 is valid, and cannot be written to.
at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:616)
at
org.apache.hadoop.dfs.DataNode$BlockReceiver.init(DataNode.java:1995)
at
org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1074)
at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:938)
at java.lang.Thread.run(Thread.java:619)
It looks like one of the HDD partitons has a problem with being written to,
but the log doesn't show which partition.
Is there a way to find it out?

(Or it could be a new feature for the next version...)

Thanks in advance,

/Taeho


Re: Getting Reduce Output Bytes

2008-11-25 Thread Sharad Agarwal



Is there an easy way to get Reduce Output Bytes?
  
Reduce Output bytes not available directly but perhaps can be inferred 
from File system Read/Write bytes counters.


  




java.lang.OutOfMemoryError: Direct buffer memory

2008-11-25 Thread tim robertson
Hi all,

I am doing a very simple Map that determines an integer value to
assign to an input (1-64000).
The reduction does nothing, but I then use this output formatter to
put the data in a file per Key.

public class CellBasedOutputFormat extends
MultipleTextOutputFormatWritableComparable, Writable {
@Override
protected String generateFileNameForKeyValue(WritableComparable
key,Writable value, String name) {
return cell_ + key.toString();
}
}

I get an out of memory error:
java.lang.OutOfMemoryError: Direct buffer memory
at java.nio.Bits.reserveMemory(Bits.java:633)
at java.nio.DirectByteBuffer.(DirectByteBuffer.java:95)
at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:288)
at 
org.apache.hadoop.io.compress.zlib.ZlibCompressor.(ZlibCompressor.java:198)
at 
org.apache.hadoop.io.compress.zlib.ZlibCompressor.(ZlibCompressor.java:211)
at 
org.apache.hadoop.io.compress.zlib.ZlibFactory.getZlibCompressor(ZlibFactory.java:83)
at 
org.apache.hadoop.io.compress.DefaultCodec.createCompressor(DefaultCodec.java:59)
at 
org.apache.hadoop.io.compress.DefaultCodec.createOutputStream(DefaultCodec.java:43)
at 
org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:131)
at 
org.apache.hadoop.mapred.lib.MultipleTextOutputFormat.getBaseRecordWriter(MultipleTextOutputFormat.java:44)
at 
org.apache.hadoop.mapred.lib.MultipleOutputFormat$1.write(MultipleOutputFormat.java:99)
at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:300)
at 
org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:39)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318)
at 
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209)

I will keep this alive for about 24 hours but you can see the errors
here: 
http://ec2-67-202-42-36.compute-1.amazonaws.com:50030/jobtasks.jsp?jobid=job_200811250345_0001type=reducepagenum=1

Please can you offer some advice?
Are my tuning parameters (Map tasks, Reduce tasks) perhaps wrong?

My configuration is:
JobConf conf = new JobConf();
conf.setJobName(OccurrenceByCellSplitter);
conf.setNumMapTasks(10);
conf.setNumReduceTasks(5);

conf.setOutputKeyClass(Text.class);
conf.setOutputValueClass(Text.class);
conf.setMapperClass(OccurrenceBy1DegCellMapper.class);
conf.setInputFormat(TextInputFormat.class);
conf.setOutputFormat(CellBasedOutputFormat.class);

FileInputFormat.setInputPaths(conf, inputFile);
FileOutputFormat.setOutputPath(conf, outputDirectory);

long time = System.currentTimeMillis();
conf.setJarByClass(OccurrenceBy1DegCellMapper.class);
JobClient.runJob(conf);


Many thanks for any advice,

Tim


Re: Getting Reduce Output Bytes

2008-11-25 Thread Paco NATHAN
Hi Lohit,

Our teams collects those kinds of measurements using this patch:
   https://issues.apache.org/jira/browse/HADOOP-4559

Some example Java code in the comments shows how to access the data,
which is serialized as JSON.  Looks like the red_hdfs_bytes_written
value would give you that.

Best,
Paco


On Tue, Nov 25, 2008 at 00:28, lohit [EMAIL PROTECTED] wrote:
 Hello,

 Is there an easy way to get Reduce Output Bytes?

 Thanks,
 Lohit




Re: Block placement in HDFS

2008-11-25 Thread Dhruba Borthakur
Hi Dennis,

There were some discussions on this topic earlier:

http://issues.apache.org/jira/browse/HADOOP-3799

Do you have any specific use-case for this feature?

thanks,
dhruba

On Mon, Nov 24, 2008 at 10:22 PM, Owen O'Malley [EMAIL PROTECTED] wrote:


 On Nov 24, 2008, at 8:44 PM, Mahadev Konar wrote:

  Hi Dennis,
  I don't think that is possible to do.


 No, it is not possible.

   The block placement is determined
 by HDFS internally (which is local, rack local and off rack).


 Actually, it was changed in 0.17 or so to be node-local, off-rack, and a
 second node off rack.

 -- Owen



Hadoop complex calculations

2008-11-25 Thread Chris Quach
Hi,

I'm testing Hadoop to see if we could use for complex calculations next to
the 'standard' implementation. I've set up a grid with 10 nodes and if I run
the RandomTextWriter example only 2 nodes are used as mappers, while I
specified 10 mappers to be used. The other nodes are used for storage, but I
want them to also execute the map function. (I've had this same behaviour
with my own test program..)

Is there a way to tell the framework to use all available nodes as mappers?
Thanks in advance,

Chris


Re: Getting Reduce Output Bytes

2008-11-25 Thread Lohit
Thanks sharad and paco.

Lohit

On Nov 25, 2008, at 5:34 AM, Paco NATHAN [EMAIL PROTECTED] wrote:

Hi Lohit,

Our teams collects those kinds of measurements using this patch:
  https://issues.apache.org/jira/browse/HADOOP-4559

Some example Java code in the comments shows how to access the data,
which is serialized as JSON.  Looks like the red_hdfs_bytes_written
value would give you that.

Best,
Paco


On Tue, Nov 25, 2008 at 00:28, lohit [EMAIL PROTECTED] wrote:
Hello,

Is there an easy way to get Reduce Output Bytes?

Thanks,
Lohit





Re: Hadoop Installation

2008-11-25 Thread Mithila Nagendra
Hey steve
The version is: Linux enpc3740.eas.asu.edu 2.6.9-67.0.20.EL #1 Wed Jun 18
12:23:46 EDT 2008 i686 i686 i386 GNU/Linux, this is what I got when I used
the command uname -a

On Tue, Nov 25, 2008 at 1:50 PM, Steve Loughran [EMAIL PROTECTED] wrote:

 Mithila Nagendra wrote:

 Hey Steve
 I deleted what ever I needed to.. still no luck..

 You said that the classpath might be messed up.. Is there some way I can
 reset it? For the root user? What path do I set it to.


 Let's start with what kind of machine is this? Windows? or Linux. If Linux,
 which one?



Re: Hadoop Installation

2008-11-25 Thread Steve Loughran

Mithila Nagendra wrote:

Hey steve
The version is: Linux enpc3740.eas.asu.edu 2.6.9-67.0.20.EL #1 Wed Jun 18
12:23:46 EDT 2008 i686 i686 i386 GNU/Linux, this is what I got when I used
the command uname -a

On Tue, Nov 25, 2008 at 1:50 PM, Steve Loughran [EMAIL PROTECTED] wrote:


Mithila Nagendra wrote:


Hey Steve
I deleted what ever I needed to.. still no luck..

You said that the classpath might be messed up.. Is there some way I can
reset it? For the root user? What path do I set it to.



Let's start with what kind of machine is this? Windows? or Linux. If Linux,
which one?





OK

1. In yum (redhat) or the synaptic package manager, is there any package 
called log4j installed? or liblog4j?

2. Install ant, and run
  ant -diagnostics
email us the results


Question about ChainMapper and ChainReducer

2008-11-25 Thread Tarandeep Singh
Hi,

I would like to know how does ChainMapper and ChainReducer save IO ?

The doc says the output of first mapper becomes the input of second and so
on. So does this mean, the output of first map is *not* written to HDFS and
a second map process is started that operates on the data generated by first
map only?

In other words, is it safe to assume that if a map1 ran on node1 and
produced D1 output, then this D1 is stored locally on node1 and a second map
process (from chained map job) operates only on this local D1?

Thanks,
Taran


Lookup HashMap available within the Map

2008-11-25 Thread tim robertson
Hi all,

If I want to have an in memory lookup Hashmap that is available in
my Map class, where is the best place to initialise this please?

I have a shapefile with polygons, and I wish to create the polygon
objects in memory on each node's JVM and have the map able to pull
back the objects by id from some HashMapInteger, Geometry.

Is perhaps the best way to just have a static initialiser that is
synchronised so that it only gets run once and called during the
Map.configure() ?   This feels a little dirty.

Thanks for advice on this,

Tim


Re: Lookup HashMap available within the Map

2008-11-25 Thread Alex Loddengaard
You should use the DistributedCache:

http://www.cloudera.com/blog/2008/11/14/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/


and


http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache


Hope this helps!

Alex

On Tue, Nov 25, 2008 at 11:09 AM, tim robertson
[EMAIL PROTECTED]wrote:

 Hi all,

 If I want to have an in memory lookup Hashmap that is available in
 my Map class, where is the best place to initialise this please?

 I have a shapefile with polygons, and I wish to create the polygon
 objects in memory on each node's JVM and have the map able to pull
 back the objects by id from some HashMapInteger, Geometry.

 Is perhaps the best way to just have a static initialiser that is
 synchronised so that it only gets run once and called during the
 Map.configure() ?   This feels a little dirty.

 Thanks for advice on this,

 Tim



Re: Lookup HashMap available within the Map

2008-11-25 Thread tim robertson
Hi

Thanks Alex - this will allow me to share the shapefile, but I need to
one time only per job per jvm read it, parse it and store the
objects in the index.
Is the Mapper.configure() the best place to do this?  E.g. will it
only be called once per job?

Thanks

Tim


On Tue, Nov 25, 2008 at 8:12 PM, Alex Loddengaard [EMAIL PROTECTED] wrote:
 You should use the DistributedCache:
 
 http://www.cloudera.com/blog/2008/11/14/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/


 and

 
 http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache


 Hope this helps!

 Alex

 On Tue, Nov 25, 2008 at 11:09 AM, tim robertson
 [EMAIL PROTECTED]wrote:

 Hi all,

 If I want to have an in memory lookup Hashmap that is available in
 my Map class, where is the best place to initialise this please?

 I have a shapefile with polygons, and I wish to create the polygon
 objects in memory on each node's JVM and have the map able to pull
 back the objects by id from some HashMapInteger, Geometry.

 Is perhaps the best way to just have a static initialiser that is
 synchronised so that it only gets run once and called during the
 Map.configure() ?   This feels a little dirty.

 Thanks for advice on this,

 Tim




Re: Lookup HashMap available within the Map

2008-11-25 Thread Doug Cutting

tim robertson wrote:

Thanks Alex - this will allow me to share the shapefile, but I need to
one time only per job per jvm read it, parse it and store the
objects in the index.
Is the Mapper.configure() the best place to do this?  E.g. will it
only be called once per job?


In 0.19, with HADOOP-249, all tasks from a job can be run in a single 
JVM.  So, yes, you could access a static cache from Mapper.configure().


Doug



Re: Lookup HashMap available within the Map

2008-11-25 Thread tim robertson
Hi Doug,

Thanks - it is not so much I want to run in a single JVM - I do want a
bunch of machines doing the work, it is just I want them all to have
this in-memory lookup index, that is configured once per job.  Is
there some hook somewhere that I can trigger a read from the
distributed cache, or is a Mapper.configure() the best place for this?
 Can it be called multiple times per Job meaning I need to keep some
static synchronised indicator flag?

Thanks again,

Tim


On Tue, Nov 25, 2008 at 8:41 PM, Doug Cutting [EMAIL PROTECTED] wrote:
 tim robertson wrote:

 Thanks Alex - this will allow me to share the shapefile, but I need to
 one time only per job per jvm read it, parse it and store the
 objects in the index.
 Is the Mapper.configure() the best place to do this?  E.g. will it
 only be called once per job?

 In 0.19, with HADOOP-249, all tasks from a job can be run in a single JVM.
  So, yes, you could access a static cache from Mapper.configure().

 Doug




Re: Block placement in HDFS

2008-11-25 Thread Pete Wyckoff

Fyi - Owen is referring to:

https://issues.apache.org/jira/browse/HADOOP-2559


On 11/24/08 10:22 PM, Owen O'Malley [EMAIL PROTECTED] wrote:



On Nov 24, 2008, at 8:44 PM, Mahadev Konar wrote:

 Hi Dennis,
  I don't think that is possible to do.

No, it is not possible.

  The block placement is determined
 by HDFS internally (which is local, rack local and off rack).

Actually, it was changed in 0.17 or so to be node-local, off-rack, and
a second node off rack.

-- Owen




Re: Lookup HashMap available within the Map

2008-11-25 Thread tim robertson
Thanks Chris,

I have a different test running, then will implement that.  Might give
cascading a shot for what I am doing.

Cheers

Tim


On Tue, Nov 25, 2008 at 9:24 PM, Chris K Wensel [EMAIL PROTECTED] wrote:
 Hey Tim

 The .configure() method is what you are looking for i believe.

 It is called once per task, which in the default case, is once per jvm.

 Note Jobs are broken into parallel tasks, each task handles a portion of the
 input data. So you may create your map 100 times, because there are 100
 tasks, it will only be created once per jvm.

 I hope this makes sense.

 chris

 On Nov 25, 2008, at 11:46 AM, tim robertson wrote:

 Hi Doug,

 Thanks - it is not so much I want to run in a single JVM - I do want a
 bunch of machines doing the work, it is just I want them all to have
 this in-memory lookup index, that is configured once per job.  Is
 there some hook somewhere that I can trigger a read from the
 distributed cache, or is a Mapper.configure() the best place for this?
 Can it be called multiple times per Job meaning I need to keep some
 static synchronised indicator flag?

 Thanks again,

 Tim


 On Tue, Nov 25, 2008 at 8:41 PM, Doug Cutting [EMAIL PROTECTED] wrote:

 tim robertson wrote:

 Thanks Alex - this will allow me to share the shapefile, but I need to
 one time only per job per jvm read it, parse it and store the
 objects in the index.
 Is the Mapper.configure() the best place to do this?  E.g. will it
 only be called once per job?

 In 0.19, with HADOOP-249, all tasks from a job can be run in a single
 JVM.
 So, yes, you could access a static cache from Mapper.configure().

 Doug



 --
 Chris K Wensel
 [EMAIL PROTECTED]
 http://chris.wensel.net/
 http://www.cascading.org/




Re: Lookup HashMap available within the Map

2008-11-25 Thread Chris K Wensel
cool. If you need a hand with Cascading stuff, feel free to ping me on  
the mail list or #cascading irc. lots of other friendly folk there  
already.


ckw

On Nov 25, 2008, at 12:35 PM, tim robertson wrote:


Thanks Chris,

I have a different test running, then will implement that.  Might give
cascading a shot for what I am doing.

Cheers

Tim


On Tue, Nov 25, 2008 at 9:24 PM, Chris K Wensel [EMAIL PROTECTED]  
wrote:

Hey Tim

The .configure() method is what you are looking for i believe.

It is called once per task, which in the default case, is once per  
jvm.


Note Jobs are broken into parallel tasks, each task handles a  
portion of the
input data. So you may create your map 100 times, because there are  
100

tasks, it will only be created once per jvm.

I hope this makes sense.

chris

On Nov 25, 2008, at 11:46 AM, tim robertson wrote:


Hi Doug,

Thanks - it is not so much I want to run in a single JVM - I do  
want a

bunch of machines doing the work, it is just I want them all to have
this in-memory lookup index, that is configured once per job.  Is
there some hook somewhere that I can trigger a read from the
distributed cache, or is a Mapper.configure() the best place for  
this?

Can it be called multiple times per Job meaning I need to keep some
static synchronised indicator flag?

Thanks again,

Tim


On Tue, Nov 25, 2008 at 8:41 PM, Doug Cutting [EMAIL PROTECTED]  
wrote:


tim robertson wrote:


Thanks Alex - this will allow me to share the shapefile, but I  
need to

one time only per job per jvm read it, parse it and store the
objects in the index.
Is the Mapper.configure() the best place to do this?  E.g. will it
only be called once per job?


In 0.19, with HADOOP-249, all tasks from a job can be run in a  
single

JVM.
So, yes, you could access a static cache from Mapper.configure().

Doug




--
Chris K Wensel
[EMAIL PROTECTED]
http://chris.wensel.net/
http://www.cascading.org/






Problems running TestDFSIO to a non-default directory

2008-11-25 Thread Joel Welling
Hi Konstantin (et al.);
  A while ago you gave me the following trick to run TestDFSIO to an
output directory other than the default- just use
-Dtest.build.data=/output/dir to pass the new directory to the
executable.  I recall this working, but it is failing now under 0.18.1,
and looking at it I can't see how it ever worked.  The -D option will
set the property on the Java virtual machine which runs as a direct
child of /bin/hadoop, but I see no way the property would get set on the
mapper virtual machines.  Should this still work?  

Thanks,
-Joel

On Thu, 2008-09-04 at 13:05 -0700, Konstantin Shvachko wrote:
 Sure.
 
 bin/hadoop
-Dtest.build.data=/bessemer/welling/hadoop_test/benchmarks/TestDFSIO/
org.apache.hadoop.fs.TestDFSIO -write -nrFiles 2*N -fileSize 360
 
 --Konst
 
 Joel Welling wrote:
  With my setup, I need to change the file directory
  from /benchmarks/TestDFSIO/io_control to something
  like /bessemer/welling/hadoop_test/benchmarks/TestDFSIO/io_control .
Is
  there a command line argument or parameter that will do this?
  Basically, I have to point it explicitly into my Lustre filesystem.
  
  -Joel
  



64 bit namenode and secondary namenode 32 bit datanode

2008-11-25 Thread Sagar Naik

I am trying to migrate from 32 bit jvm and 64 bit for namenode only.
*setup*
NN - 64 bit
Secondary namenode (instance 1) - 64 bit
Secondary namenode (instance 2)  - 32 bit
datanode- 32 bit

From the mailing list I deduced that NN-64 bit and Datanode -32 bit 
combo works
But, I am not sure if S-NN-(instance 1--- 64 bit ) and S-NN (instance 2 
-- 32 bit) will work with this setup.


Also, do shud I be aware of any other issues for migrating over to 64 
bit namenode


Thanks in advance for all the suggestions


-Sagar


Re: 64 bit namenode and secondary namenode 32 bit datanode

2008-11-25 Thread Allen Wittenauer
On 11/25/08 3:58 PM, Sagar Naik [EMAIL PROTECTED] wrote:

 I am trying to migrate from 32 bit jvm and 64 bit for namenode only.
 *setup*
 NN - 64 bit
 Secondary namenode (instance 1) - 64 bit
 Secondary namenode (instance 2)  - 32 bit
 datanode- 32 bit
 
  From the mailing list I deduced that NN-64 bit and Datanode -32 bit
 combo works

Yup.  That's how we run it.

 But, I am not sure if S-NN-(instance 1--- 64 bit ) and S-NN (instance 2
 -- 32 bit) will work with this setup.

Considering that the primary and secondary process essentially the same
data, they should have the same memory requirements.  In other words, if you
need 64-bit for the name node, your secondary is going to require it too.

I'm also not sure if you can have two secondaries.  I'll let someone else
chime in on that. :)



Re: 64 bit namenode and secondary namenode 32 bit datanode

2008-11-25 Thread lohit

I might be wrong, but my assumption is running SN either in 64/32 shouldn't 
matter. 
But I am curious how two instances of Secondary namenode is setup, will both of 
them talk to same NN and running in parallel? 
what are the advantages here.
Wondering if there are chances of image corruption.

Thanks,
lohit

- Original Message 
From: Sagar Naik [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Sent: Tuesday, November 25, 2008 3:58:53 PM
Subject: 64 bit namenode and secondary namenode  32 bit datanode

I am trying to migrate from 32 bit jvm and 64 bit for namenode only.
*setup*
NN - 64 bit
Secondary namenode (instance 1) - 64 bit
Secondary namenode (instance 2)  - 32 bit
datanode- 32 bit

From the mailing list I deduced that NN-64 bit and Datanode -32 bit combo works
But, I am not sure if S-NN-(instance 1--- 64 bit ) and S-NN (instance 2 -- 32 
bit) will work with this setup.

Also, do shud I be aware of any other issues for migrating over to 64 bit 
namenode

Thanks in advance for all the suggestions


-Sagar



Re: 64 bit namenode and secondary namenode 32 bit datanode

2008-11-25 Thread Sagar Naik



lohit wrote:
I might be wrong, but my assumption is running SN either in 64/32 shouldn't matter. 
But I am curious how two instances of Secondary namenode is setup, will both of them talk to same NN and running in parallel? 
what are the advantages here.
  
I just have multiple entries master file. I am not aware of image 
corruption (did not take look into it). I did for SNN redundancy

Pl correct me if I am wrong
Thanks
Sagar

Wondering if there are chances of image corruption.

Thanks,
lohit

- Original Message 
From: Sagar Naik [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Sent: Tuesday, November 25, 2008 3:58:53 PM
Subject: 64 bit namenode and secondary namenode  32 bit datanode

I am trying to migrate from 32 bit jvm and 64 bit for namenode only.
*setup*
NN - 64 bit
Secondary namenode (instance 1) - 64 bit
Secondary namenode (instance 2)  - 32 bit
datanode- 32 bit

From the mailing list I deduced that NN-64 bit and Datanode -32 bit combo works
But, I am not sure if S-NN-(instance 1--- 64 bit ) and S-NN (instance 2 -- 32 
bit) will work with this setup.

Also, do shud I be aware of any other issues for migrating over to 64 bit 
namenode

Thanks in advance for all the suggestions


-Sagar

  




Re: 64 bit namenode and secondary namenode 32 bit datanode

2008-11-25 Thread lohit
Well, if I think about,  image corruption might not happen, since each 
checkpoint initiation would have unique number.

I was just wondering what would happen in this case
Consider this scenario.
Time 1 -- SN1 asks NN image and edits to merge
Time 2 -- SN2 asks NN image and edits to merge
Time 2 -- SN2 returns new image
Time 3 -- SN1 returns new image. 
I am not sure what happens here, but its best to test it out before setting up 
something like this.

And if you have multiple entries in NN file, then one SNN checkpoint would 
update all NN entries, so redundant SNN isnt buying you much.

Thanks,
Lohit



- Original Message 
From: Sagar Naik [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Sent: Tuesday, November 25, 2008 4:32:26 PM
Subject: Re: 64 bit namenode and secondary namenode  32 bit datanode



lohit wrote:
 I might be wrong, but my assumption is running SN either in 64/32 shouldn't 
 matter. 
 But I am curious how two instances of Secondary namenode is setup, will both 
 of them talk to same NN and running in parallel? 
 what are the advantages here.
  
I just have multiple entries master file. I am not aware of image 
corruption (did not take look into it). I did for SNN redundancy
Pl correct me if I am wrong
Thanks
Sagar
 Wondering if there are chances of image corruption.

 Thanks,
 lohit

 - Original Message 
 From: Sagar Naik [EMAIL PROTECTED]
 To: core-user@hadoop.apache.org
 Sent: Tuesday, November 25, 2008 3:58:53 PM
 Subject: 64 bit namenode and secondary namenode  32 bit datanode

 I am trying to migrate from 32 bit jvm and 64 bit for namenode only.
 *setup*
 NN - 64 bit
 Secondary namenode (instance 1) - 64 bit
 Secondary namenode (instance 2)  - 32 bit
 datanode- 32 bit

 From the mailing list I deduced that NN-64 bit and Datanode -32 bit combo 
 works
 But, I am not sure if S-NN-(instance 1--- 64 bit ) and S-NN (instance 2 -- 32 
 bit) will work with this setup.

 Also, do shud I be aware of any other issues for migrating over to 64 bit 
 namenode

 Thanks in advance for all the suggestions


 -Sagar

  


Filesystem closed errors

2008-11-25 Thread Bryan Duxbury
I have an app that runs for a long time with no problems, but when I  
signal it to shut down, I get errors like this:


java.io.IOException: Filesystem closed
at org.apache.hadoop.dfs.DFSClient.checkOpen(DFSClient.java:196)
at org.apache.hadoop.dfs.DFSClient.rename(DFSClient.java:502)
	at org.apache.hadoop.dfs.DistributedFileSystem.rename 
(DistributedFileSystem.java:176)


The problems occur when I am trying to close open HDFS files. Any  
ideas why I might be seeing this? I though it was because I was  
abruptly shutting down without giving the streams a chance to get  
closed, but after some refactoring, that's not the case.


-Bryan


Re: Block placement in HDFS

2008-11-25 Thread Hyunsik Choi
Hi All,

I try to divide some data into partitions explicitly (like regions of
Hbase). I wonder the following way to do is the best method.

For example, when we assume a block size 64MB, a file potion
corresponding to 0~63MB is allocated to first block?

I have three questions as follows:

Is the above method valid?
Is it the best method?
Is there alternative method?

Thank in advance.

-- 
Hyunsik Choi
Database  Information Systems Group
Dept. of Computer Science  Engineering, Korea University


On Mon, 2008-11-24 at 20:44 -0800, Mahadev Konar wrote:
 Hi Dennis,
   I don't think that is possible to do.  The block placement is determined
 by HDFS internally (which is local, rack local and off rack).
 
 
 mahadev
 
 
 On 11/24/08 6:59 PM, dennis81 [EMAIL PROTECTED] wrote:
 
  
  Hi everyone,
  
  I was wondering whether it is possible to control the placement of the
  blocks of a file in HDFS. Is it possible to instruct HDFS about which nodes
  will hold the block replicas?
  
  Thanks!




HDFS directory listing from the Java API?

2008-11-25 Thread Shane Butler
Hi all,

Can someone pls guide me on how to get a directory listing of files on
HDFS using the java API (0.19.0)?

Regards,
Shane


Re: Filesystem closed errors

2008-11-25 Thread David B. Ritch
Do you have speculative execution enabled?  I've seen error messages
like this caused by speculative execution.

David

Bryan Duxbury wrote:
 I have an app that runs for a long time with no problems, but when I
 signal it to shut down, I get errors like this:

 java.io.IOException: Filesystem closed
 at org.apache.hadoop.dfs.DFSClient.checkOpen(DFSClient.java:196)
 at org.apache.hadoop.dfs.DFSClient.rename(DFSClient.java:502)
 at
 org.apache.hadoop.dfs.DistributedFileSystem.rename(DistributedFileSystem.java:176)


 The problems occur when I am trying to close open HDFS files. Any
 ideas why I might be seeing this? I though it was because I was
 abruptly shutting down without giving the streams a chance to get
 closed, but after some refactoring, that's not the case.

 -Bryan




Re: Filesystem closed errors

2008-11-25 Thread Hong Tang
Does your code ever call fs.close()? If so, https://issues.apache.org/ 
jira/browse/HADOOP-4655 might be relevant to your problem.


On Nov 25, 2008, at 9:07 PM, David B. Ritch wrote:


Do you have speculative execution enabled?  I've seen error messages
like this caused by speculative execution.

David

Bryan Duxbury wrote:

I have an app that runs for a long time with no problems, but when I
signal it to shut down, I get errors like this:

java.io.IOException: Filesystem closed
at org.apache.hadoop.dfs.DFSClient.checkOpen(DFSClient.java:196)
at org.apache.hadoop.dfs.DFSClient.rename(DFSClient.java:502)
at
org.apache.hadoop.dfs.DistributedFileSystem.rename 
(DistributedFileSystem.java:176)



The problems occur when I am trying to close open HDFS files. Any
ideas why I might be seeing this? I though it was because I was
abruptly shutting down without giving the streams a chance to get
closed, but after some refactoring, that's not the case.

-Bryan







How to retrieve rack ID of a datanode

2008-11-25 Thread Ramya R
Hi all,

 

I want to retrieve the Rack ID of every datanode. How can I do this?

I tried using getNetworkLocation() in
org.apache.hadoop.hdfs.protocol.DatanodeInfo. I am getting /default-rack
as the output for all datanodes.

 

Any advice?

 

Thank in advance

Ramya

 



Re: How to retrieve rack ID of a datanode

2008-11-25 Thread Amar Kamat

Ramya R wrote:

Hi all,

 


I want to retrieve the Rack ID of every datanode. How can I do this?

I tried using getNetworkLocation() in
org.apache.hadoop.hdfs.protocol.DatanodeInfo. I am getting /default-rack
as the output for all datanodes.

  
Have you setup the cluster to be rack-aware? Atleast in MR we have to 
provide a script that resolves the rack for a given node. Might be 
similar for DFS too. See topology.script.file.name parameter in 
hadoop-default.conf for more details.

Amar
 


Any advice?

 


Thank in advance

Ramya

 



  




Re: 64 bit namenode and secondary namenode 32 bit datanod

2008-11-25 Thread Dhruba Borthakur
The design is such that running multiple secondary namenodes should not
corrupt the image (modulo any bugs). Are you seeing image corruptions when
this happens?

You can run all or any daemons in 32-bit mode or 64 bit-mode. You can
mix-and-match. If you have many millions of files, then you might want to
allocte more than 3GB heap space to the namenode and secondary namenode. In
that case, you will jave to run the namenode and secondary namenode using 64
bit JVM.

dhruba


On Tue, Nov 25, 2008 at 4:39 PM, lohit [EMAIL PROTECTED] wrote:

 Well, if I think about,  image corruption might not happen, since each
 checkpoint initiation would have unique number.

 I was just wondering what would happen in this case
 Consider this scenario.
 Time 1 -- SN1 asks NN image and edits to merge
 Time 2 -- SN2 asks NN image and edits to merge
 Time 2 -- SN2 returns new image
 Time 3 -- SN1 returns new image.
 I am not sure what happens here, but its best to test it out before setting
 up something like this.

 And if you have multiple entries in NN file, then one SNN checkpoint would
 update all NN entries, so redundant SNN isnt buying you much.

 Thanks,
 Lohit



 - Original Message 
 From: Sagar Naik [EMAIL PROTECTED]
 To: core-user@hadoop.apache.org
 Sent: Tuesday, November 25, 2008 4:32:26 PM
 Subject: Re: 64 bit namenode and secondary namenode  32 bit datanode



 lohit wrote:
  I might be wrong, but my assumption is running SN either in 64/32
 shouldn't matter.
  But I am curious how two instances of Secondary namenode is setup, will
 both of them talk to same NN and running in parallel?
  what are the advantages here.
 
 I just have multiple entries master file. I am not aware of image
 corruption (did not take look into it). I did for SNN redundancy
 Pl correct me if I am wrong
 Thanks
 Sagar
  Wondering if there are chances of image corruption.
 
  Thanks,
  lohit
 
  - Original Message 
  From: Sagar Naik [EMAIL PROTECTED]
  To: core-user@hadoop.apache.org
  Sent: Tuesday, November 25, 2008 3:58:53 PM
  Subject: 64 bit namenode and secondary namenode  32 bit datanode
 
  I am trying to migrate from 32 bit jvm and 64 bit for namenode only.
  *setup*
  NN - 64 bit
  Secondary namenode (instance 1) - 64 bit
  Secondary namenode (instance 2)  - 32 bit
  datanode- 32 bit
 
  From the mailing list I deduced that NN-64 bit and Datanode -32 bit combo
 works
  But, I am not sure if S-NN-(instance 1--- 64 bit ) and S-NN (instance 2
 -- 32 bit) will work with this setup.
 
  Also, do shud I be aware of any other issues for migrating over to 64 bit
 namenode
 
  Thanks in advance for all the suggestions
 
 
  -Sagar
 
 



RE: How to retrieve rack ID of a datanode

2008-11-25 Thread Ramya R
Hi Lohit,
  I have not set the datanode to tell namenode which rack it belongs to.
Can you please tell me how do I do it? Is it using setNetworkLocation()?
 
My intention is to kill the datanodes in a given rack. So it would be
useful even if I obtain the subnet each datanode belongs to.

Thanks
Ramya

-Original Message-
From: lohit [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, November 26, 2008 12:26 PM
To: core-user@hadoop.apache.org
Subject: Re: How to retrieve rack ID of a datanode

/default-rack is set when datanode has not set rackID. It is upto the
datanode to tell namenode which rack it belongs to.
Is your datanode doing that explicitly ?
-Lohit



- Original Message 
From: Ramya R [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Sent: Tuesday, November 25, 2008 10:36:46 PM
Subject: How to retrieve rack ID of a datanode

Hi all,



I want to retrieve the Rack ID of every datanode. How can I do this?

I tried using getNetworkLocation() in
org.apache.hadoop.hdfs.protocol.DatanodeInfo. I am getting /default-rack
as the output for all datanodes.



Any advice?



Thank in advance

Ramya


Switching to HBase from HDFS

2008-11-25 Thread Shimi K
I have a system which uses HDFS to store files on multiple nodes. On
each HDFS node machine I have another application which reads the
local files. Until know my system worked only with files, HDFS seemed
like the right solution and everything worked fine. Now I need to save
additional information for every file. I thought that I might create a
central database and in this database I will create a table which will
map file name with the new data. I don't think that this is a good
solution since I will need to query this new data for each file. I
thought that since HBase is built on top of HDFS it might be better to
use it instead of a database. With HBase I will have each file
together with the new data locally on each node. I can read each file
together with any additional information.

Since I never used HBase I want to ask the community if HBase is the
right solution for my case?

--Shimi


Re: How to retrieve rack ID of a datanode

2008-11-25 Thread Yi-Kai Tsai

hi Ramya

Setup topology.script.file.name in your hadoop-site.xml  and the script.

check http://hadoop.apache.org/core/docs/current/cluster_setup.html , 
Hadoop Rack Awareness section.




Hi Lohit,
  I have not set the datanode to tell namenode which rack it belongs to.
Can you please tell me how do I do it? Is it using setNetworkLocation()?
 
My intention is to kill the datanodes in a given rack. So it would be

useful even if I obtain the subnet each datanode belongs to.

Thanks
Ramya

-Original Message-
From: lohit [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, November 26, 2008 12:26 PM

To: core-user@hadoop.apache.org
Subject: Re: How to retrieve rack ID of a datanode

/default-rack is set when datanode has not set rackID. It is upto the
datanode to tell namenode which rack it belongs to.
Is your datanode doing that explicitly ?
-Lohit



- Original Message 
From: Ramya R [EMAIL PROTECTED]
To: core-user@hadoop.apache.org
Sent: Tuesday, November 25, 2008 10:36:46 PM
Subject: How to retrieve rack ID of a datanode

Hi all,



I want to retrieve the Rack ID of every datanode. How can I do this?

I tried using getNetworkLocation() in
org.apache.hadoop.hdfs.protocol.DatanodeInfo. I am getting /default-rack
as the output for all datanodes.



Any advice?



Thank in advance

Ramya
  



--
Yi-Kai Tsai (cuma) [EMAIL PROTECTED], Asia Regional Search Engineering.



Re: Switching to HBase from HDFS

2008-11-25 Thread Yi-Kai Tsai

Hi Shimi

HBase (or BigTable) is a sparse, distributed, persistent 
multidimensional sorted map ,

Jim R. Wilson have a excellent article for understanding it :
http://jimbojw.com/wiki/index.php?title=Understanding_HBase_and_BigTable



I have a system which uses HDFS to store files on multiple nodes. On
each HDFS node machine I have another application which reads the
local files. Until know my system worked only with files, HDFS seemed
like the right solution and everything worked fine. Now I need to save
additional information for every file. I thought that I might create a
central database and in this database I will create a table which will
map file name with the new data. I don't think that this is a good
solution since I will need to query this new data for each file. I
thought that since HBase is built on top of HDFS it might be better to
use it instead of a database. With HBase I will have each file
together with the new data locally on each node. I can read each file
together with any additional information.

Since I never used HBase I want to ask the community if HBase is the
right solution for my case?

--Shimi
  



--
Yi-Kai Tsai (cuma) [EMAIL PROTECTED], Asia Regional Search Engineering.



How We get old version of Haddop

2008-11-25 Thread Rashid Ahmad
Dear Freinds,

How we get hadoop old version.

-- 
Regards,
Rashid Ahmad


how can I decommission nodes on-the-fly?

2008-11-25 Thread Jeremy Chow
Hi list,

 I added a property dfs.hosts.exclude to my conf/hadoop-site.xml. Then
refreshed my cluster with command
 bin/hadoop dfsadmin -refreshNodes
It showed that it can only shut down the DataNode process but not included
the TaskTracker process on each slaver specified in the excludes file.
The jobtracker web still show that I hadnot shut down these nodes.
How can i totally decommission these slaver nodes on-the-fly? Is it can be
achieved only by operation on the master node?

Thanks,
Jeremy
-- 
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

http://coderplay.javaeye.com


how can I decommission nodes on-the-fly?

2008-11-25 Thread Jeremy Chow
Hi list,

 I added a property dfs.hosts.exclude to my conf/hadoop-site.xml. Then
refreshed my cluster with command
 bin/hadoop dfsadmin -refreshNodes
It showed that it can only shut down the DataNode process but not included
the TaskTracker process on each slaver specified in the excludes file.
The jobtracker web still show that I hadnot shut down these nodes.
How can i totally decommission these slaver nodes on-the-fly? Is it can be
achieved only by operation on the master node?

Thanks,
Jeremy

-- 
My research interests are distributed systems, parallel computing and
bytecode based virtual machine.

http://coderplay.javaeye.com


Re: how can I decommission nodes on-the-fly?

2008-11-25 Thread Amareshwari Sriramadasu

Jeremy Chow wrote:

Hi list,

 I added a property dfs.hosts.exclude to my conf/hadoop-site.xml. Then
refreshed my cluster with command
 bin/hadoop dfsadmin -refreshNodes
It showed that it can only shut down the DataNode process but not included
the TaskTracker process on each slaver specified in the excludes file.
  

Presently, decommissioning TaskTracker on-the-fly is not available.

The jobtracker web still show that I hadnot shut down these nodes.
How can i totally decommission these slaver nodes on-the-fly? Is it can be
achieved only by operation on the master node?

  

I think one way to shutdown a TaskTracker is to kill it.

Thanks
Amareshwari

Thanks,
Jeremy