Re: Hadoop Installation
Mithila Nagendra wrote: Hey Steve I deleted what ever I needed to.. still no luck.. You said that the classpath might be messed up.. Is there some way I can reset it? For the root user? What path do I set it to. Let's start with what kind of machine is this? Windows? or Linux. If Linux, which one?
Datanode log for errors
Hi, I have encountered some IOExceptions in Datanode, while some intermediate/temporary map-reduce data is written to HDFS. 2008-11-25 18:27:08,070 INFO org.apache.hadoop.dfs.DataNode: writeBlock blk_-460494523413678075 received exception java.io.IOException: Block blk_-460494523413678075 is valid, and cannot be written to. 2008-11-25 18:27:08,070 ERROR org.apache.hadoop.dfs.DataNode: 10.31.xx.xxx:50010:DataXceiver: java.io.IOException: Block blk_-460494523413678075 is valid, and cannot be written to. at org.apache.hadoop.dfs.FSDataset.writeToBlock(FSDataset.java:616) at org.apache.hadoop.dfs.DataNode$BlockReceiver.init(DataNode.java:1995) at org.apache.hadoop.dfs.DataNode$DataXceiver.writeBlock(DataNode.java:1074) at org.apache.hadoop.dfs.DataNode$DataXceiver.run(DataNode.java:938) at java.lang.Thread.run(Thread.java:619) It looks like one of the HDD partitons has a problem with being written to, but the log doesn't show which partition. Is there a way to find it out? (Or it could be a new feature for the next version...) Thanks in advance, /Taeho
Re: Getting Reduce Output Bytes
Is there an easy way to get Reduce Output Bytes? Reduce Output bytes not available directly but perhaps can be inferred from File system Read/Write bytes counters.
java.lang.OutOfMemoryError: Direct buffer memory
Hi all, I am doing a very simple Map that determines an integer value to assign to an input (1-64000). The reduction does nothing, but I then use this output formatter to put the data in a file per Key. public class CellBasedOutputFormat extends MultipleTextOutputFormatWritableComparable, Writable { @Override protected String generateFileNameForKeyValue(WritableComparable key,Writable value, String name) { return cell_ + key.toString(); } } I get an out of memory error: java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:633) at java.nio.DirectByteBuffer.(DirectByteBuffer.java:95) at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:288) at org.apache.hadoop.io.compress.zlib.ZlibCompressor.(ZlibCompressor.java:198) at org.apache.hadoop.io.compress.zlib.ZlibCompressor.(ZlibCompressor.java:211) at org.apache.hadoop.io.compress.zlib.ZlibFactory.getZlibCompressor(ZlibFactory.java:83) at org.apache.hadoop.io.compress.DefaultCodec.createCompressor(DefaultCodec.java:59) at org.apache.hadoop.io.compress.DefaultCodec.createOutputStream(DefaultCodec.java:43) at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:131) at org.apache.hadoop.mapred.lib.MultipleTextOutputFormat.getBaseRecordWriter(MultipleTextOutputFormat.java:44) at org.apache.hadoop.mapred.lib.MultipleOutputFormat$1.write(MultipleOutputFormat.java:99) at org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.java:300) at org.apache.hadoop.mapred.lib.IdentityReducer.reduce(IdentityReducer.java:39) at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:318) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209) I will keep this alive for about 24 hours but you can see the errors here: http://ec2-67-202-42-36.compute-1.amazonaws.com:50030/jobtasks.jsp?jobid=job_200811250345_0001type=reducepagenum=1 Please can you offer some advice? Are my tuning parameters (Map tasks, Reduce tasks) perhaps wrong? My configuration is: JobConf conf = new JobConf(); conf.setJobName(OccurrenceByCellSplitter); conf.setNumMapTasks(10); conf.setNumReduceTasks(5); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(Text.class); conf.setMapperClass(OccurrenceBy1DegCellMapper.class); conf.setInputFormat(TextInputFormat.class); conf.setOutputFormat(CellBasedOutputFormat.class); FileInputFormat.setInputPaths(conf, inputFile); FileOutputFormat.setOutputPath(conf, outputDirectory); long time = System.currentTimeMillis(); conf.setJarByClass(OccurrenceBy1DegCellMapper.class); JobClient.runJob(conf); Many thanks for any advice, Tim
Re: Getting Reduce Output Bytes
Hi Lohit, Our teams collects those kinds of measurements using this patch: https://issues.apache.org/jira/browse/HADOOP-4559 Some example Java code in the comments shows how to access the data, which is serialized as JSON. Looks like the red_hdfs_bytes_written value would give you that. Best, Paco On Tue, Nov 25, 2008 at 00:28, lohit [EMAIL PROTECTED] wrote: Hello, Is there an easy way to get Reduce Output Bytes? Thanks, Lohit
Re: Block placement in HDFS
Hi Dennis, There were some discussions on this topic earlier: http://issues.apache.org/jira/browse/HADOOP-3799 Do you have any specific use-case for this feature? thanks, dhruba On Mon, Nov 24, 2008 at 10:22 PM, Owen O'Malley [EMAIL PROTECTED] wrote: On Nov 24, 2008, at 8:44 PM, Mahadev Konar wrote: Hi Dennis, I don't think that is possible to do. No, it is not possible. The block placement is determined by HDFS internally (which is local, rack local and off rack). Actually, it was changed in 0.17 or so to be node-local, off-rack, and a second node off rack. -- Owen
Hadoop complex calculations
Hi, I'm testing Hadoop to see if we could use for complex calculations next to the 'standard' implementation. I've set up a grid with 10 nodes and if I run the RandomTextWriter example only 2 nodes are used as mappers, while I specified 10 mappers to be used. The other nodes are used for storage, but I want them to also execute the map function. (I've had this same behaviour with my own test program..) Is there a way to tell the framework to use all available nodes as mappers? Thanks in advance, Chris
Re: Getting Reduce Output Bytes
Thanks sharad and paco. Lohit On Nov 25, 2008, at 5:34 AM, Paco NATHAN [EMAIL PROTECTED] wrote: Hi Lohit, Our teams collects those kinds of measurements using this patch: https://issues.apache.org/jira/browse/HADOOP-4559 Some example Java code in the comments shows how to access the data, which is serialized as JSON. Looks like the red_hdfs_bytes_written value would give you that. Best, Paco On Tue, Nov 25, 2008 at 00:28, lohit [EMAIL PROTECTED] wrote: Hello, Is there an easy way to get Reduce Output Bytes? Thanks, Lohit
Re: Hadoop Installation
Hey steve The version is: Linux enpc3740.eas.asu.edu 2.6.9-67.0.20.EL #1 Wed Jun 18 12:23:46 EDT 2008 i686 i686 i386 GNU/Linux, this is what I got when I used the command uname -a On Tue, Nov 25, 2008 at 1:50 PM, Steve Loughran [EMAIL PROTECTED] wrote: Mithila Nagendra wrote: Hey Steve I deleted what ever I needed to.. still no luck.. You said that the classpath might be messed up.. Is there some way I can reset it? For the root user? What path do I set it to. Let's start with what kind of machine is this? Windows? or Linux. If Linux, which one?
Re: Hadoop Installation
Mithila Nagendra wrote: Hey steve The version is: Linux enpc3740.eas.asu.edu 2.6.9-67.0.20.EL #1 Wed Jun 18 12:23:46 EDT 2008 i686 i686 i386 GNU/Linux, this is what I got when I used the command uname -a On Tue, Nov 25, 2008 at 1:50 PM, Steve Loughran [EMAIL PROTECTED] wrote: Mithila Nagendra wrote: Hey Steve I deleted what ever I needed to.. still no luck.. You said that the classpath might be messed up.. Is there some way I can reset it? For the root user? What path do I set it to. Let's start with what kind of machine is this? Windows? or Linux. If Linux, which one? OK 1. In yum (redhat) or the synaptic package manager, is there any package called log4j installed? or liblog4j? 2. Install ant, and run ant -diagnostics email us the results
Question about ChainMapper and ChainReducer
Hi, I would like to know how does ChainMapper and ChainReducer save IO ? The doc says the output of first mapper becomes the input of second and so on. So does this mean, the output of first map is *not* written to HDFS and a second map process is started that operates on the data generated by first map only? In other words, is it safe to assume that if a map1 ran on node1 and produced D1 output, then this D1 is stored locally on node1 and a second map process (from chained map job) operates only on this local D1? Thanks, Taran
Lookup HashMap available within the Map
Hi all, If I want to have an in memory lookup Hashmap that is available in my Map class, where is the best place to initialise this please? I have a shapefile with polygons, and I wish to create the polygon objects in memory on each node's JVM and have the map able to pull back the objects by id from some HashMapInteger, Geometry. Is perhaps the best way to just have a static initialiser that is synchronised so that it only gets run once and called during the Map.configure() ? This feels a little dirty. Thanks for advice on this, Tim
Re: Lookup HashMap available within the Map
You should use the DistributedCache: http://www.cloudera.com/blog/2008/11/14/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/ and http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache Hope this helps! Alex On Tue, Nov 25, 2008 at 11:09 AM, tim robertson [EMAIL PROTECTED]wrote: Hi all, If I want to have an in memory lookup Hashmap that is available in my Map class, where is the best place to initialise this please? I have a shapefile with polygons, and I wish to create the polygon objects in memory on each node's JVM and have the map able to pull back the objects by id from some HashMapInteger, Geometry. Is perhaps the best way to just have a static initialiser that is synchronised so that it only gets run once and called during the Map.configure() ? This feels a little dirty. Thanks for advice on this, Tim
Re: Lookup HashMap available within the Map
Hi Thanks Alex - this will allow me to share the shapefile, but I need to one time only per job per jvm read it, parse it and store the objects in the index. Is the Mapper.configure() the best place to do this? E.g. will it only be called once per job? Thanks Tim On Tue, Nov 25, 2008 at 8:12 PM, Alex Loddengaard [EMAIL PROTECTED] wrote: You should use the DistributedCache: http://www.cloudera.com/blog/2008/11/14/sending-files-to-remote-task-nodes-with-hadoop-mapreduce/ and http://hadoop.apache.org/core/docs/current/mapred_tutorial.html#DistributedCache Hope this helps! Alex On Tue, Nov 25, 2008 at 11:09 AM, tim robertson [EMAIL PROTECTED]wrote: Hi all, If I want to have an in memory lookup Hashmap that is available in my Map class, where is the best place to initialise this please? I have a shapefile with polygons, and I wish to create the polygon objects in memory on each node's JVM and have the map able to pull back the objects by id from some HashMapInteger, Geometry. Is perhaps the best way to just have a static initialiser that is synchronised so that it only gets run once and called during the Map.configure() ? This feels a little dirty. Thanks for advice on this, Tim
Re: Lookup HashMap available within the Map
tim robertson wrote: Thanks Alex - this will allow me to share the shapefile, but I need to one time only per job per jvm read it, parse it and store the objects in the index. Is the Mapper.configure() the best place to do this? E.g. will it only be called once per job? In 0.19, with HADOOP-249, all tasks from a job can be run in a single JVM. So, yes, you could access a static cache from Mapper.configure(). Doug
Re: Lookup HashMap available within the Map
Hi Doug, Thanks - it is not so much I want to run in a single JVM - I do want a bunch of machines doing the work, it is just I want them all to have this in-memory lookup index, that is configured once per job. Is there some hook somewhere that I can trigger a read from the distributed cache, or is a Mapper.configure() the best place for this? Can it be called multiple times per Job meaning I need to keep some static synchronised indicator flag? Thanks again, Tim On Tue, Nov 25, 2008 at 8:41 PM, Doug Cutting [EMAIL PROTECTED] wrote: tim robertson wrote: Thanks Alex - this will allow me to share the shapefile, but I need to one time only per job per jvm read it, parse it and store the objects in the index. Is the Mapper.configure() the best place to do this? E.g. will it only be called once per job? In 0.19, with HADOOP-249, all tasks from a job can be run in a single JVM. So, yes, you could access a static cache from Mapper.configure(). Doug
Re: Block placement in HDFS
Fyi - Owen is referring to: https://issues.apache.org/jira/browse/HADOOP-2559 On 11/24/08 10:22 PM, Owen O'Malley [EMAIL PROTECTED] wrote: On Nov 24, 2008, at 8:44 PM, Mahadev Konar wrote: Hi Dennis, I don't think that is possible to do. No, it is not possible. The block placement is determined by HDFS internally (which is local, rack local and off rack). Actually, it was changed in 0.17 or so to be node-local, off-rack, and a second node off rack. -- Owen
Re: Lookup HashMap available within the Map
Thanks Chris, I have a different test running, then will implement that. Might give cascading a shot for what I am doing. Cheers Tim On Tue, Nov 25, 2008 at 9:24 PM, Chris K Wensel [EMAIL PROTECTED] wrote: Hey Tim The .configure() method is what you are looking for i believe. It is called once per task, which in the default case, is once per jvm. Note Jobs are broken into parallel tasks, each task handles a portion of the input data. So you may create your map 100 times, because there are 100 tasks, it will only be created once per jvm. I hope this makes sense. chris On Nov 25, 2008, at 11:46 AM, tim robertson wrote: Hi Doug, Thanks - it is not so much I want to run in a single JVM - I do want a bunch of machines doing the work, it is just I want them all to have this in-memory lookup index, that is configured once per job. Is there some hook somewhere that I can trigger a read from the distributed cache, or is a Mapper.configure() the best place for this? Can it be called multiple times per Job meaning I need to keep some static synchronised indicator flag? Thanks again, Tim On Tue, Nov 25, 2008 at 8:41 PM, Doug Cutting [EMAIL PROTECTED] wrote: tim robertson wrote: Thanks Alex - this will allow me to share the shapefile, but I need to one time only per job per jvm read it, parse it and store the objects in the index. Is the Mapper.configure() the best place to do this? E.g. will it only be called once per job? In 0.19, with HADOOP-249, all tasks from a job can be run in a single JVM. So, yes, you could access a static cache from Mapper.configure(). Doug -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: Lookup HashMap available within the Map
cool. If you need a hand with Cascading stuff, feel free to ping me on the mail list or #cascading irc. lots of other friendly folk there already. ckw On Nov 25, 2008, at 12:35 PM, tim robertson wrote: Thanks Chris, I have a different test running, then will implement that. Might give cascading a shot for what I am doing. Cheers Tim On Tue, Nov 25, 2008 at 9:24 PM, Chris K Wensel [EMAIL PROTECTED] wrote: Hey Tim The .configure() method is what you are looking for i believe. It is called once per task, which in the default case, is once per jvm. Note Jobs are broken into parallel tasks, each task handles a portion of the input data. So you may create your map 100 times, because there are 100 tasks, it will only be created once per jvm. I hope this makes sense. chris On Nov 25, 2008, at 11:46 AM, tim robertson wrote: Hi Doug, Thanks - it is not so much I want to run in a single JVM - I do want a bunch of machines doing the work, it is just I want them all to have this in-memory lookup index, that is configured once per job. Is there some hook somewhere that I can trigger a read from the distributed cache, or is a Mapper.configure() the best place for this? Can it be called multiple times per Job meaning I need to keep some static synchronised indicator flag? Thanks again, Tim On Tue, Nov 25, 2008 at 8:41 PM, Doug Cutting [EMAIL PROTECTED] wrote: tim robertson wrote: Thanks Alex - this will allow me to share the shapefile, but I need to one time only per job per jvm read it, parse it and store the objects in the index. Is the Mapper.configure() the best place to do this? E.g. will it only be called once per job? In 0.19, with HADOOP-249, all tasks from a job can be run in a single JVM. So, yes, you could access a static cache from Mapper.configure(). Doug -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Problems running TestDFSIO to a non-default directory
Hi Konstantin (et al.); A while ago you gave me the following trick to run TestDFSIO to an output directory other than the default- just use -Dtest.build.data=/output/dir to pass the new directory to the executable. I recall this working, but it is failing now under 0.18.1, and looking at it I can't see how it ever worked. The -D option will set the property on the Java virtual machine which runs as a direct child of /bin/hadoop, but I see no way the property would get set on the mapper virtual machines. Should this still work? Thanks, -Joel On Thu, 2008-09-04 at 13:05 -0700, Konstantin Shvachko wrote: Sure. bin/hadoop -Dtest.build.data=/bessemer/welling/hadoop_test/benchmarks/TestDFSIO/ org.apache.hadoop.fs.TestDFSIO -write -nrFiles 2*N -fileSize 360 --Konst Joel Welling wrote: With my setup, I need to change the file directory from /benchmarks/TestDFSIO/io_control to something like /bessemer/welling/hadoop_test/benchmarks/TestDFSIO/io_control . Is there a command line argument or parameter that will do this? Basically, I have to point it explicitly into my Lustre filesystem. -Joel
64 bit namenode and secondary namenode 32 bit datanode
I am trying to migrate from 32 bit jvm and 64 bit for namenode only. *setup* NN - 64 bit Secondary namenode (instance 1) - 64 bit Secondary namenode (instance 2) - 32 bit datanode- 32 bit From the mailing list I deduced that NN-64 bit and Datanode -32 bit combo works But, I am not sure if S-NN-(instance 1--- 64 bit ) and S-NN (instance 2 -- 32 bit) will work with this setup. Also, do shud I be aware of any other issues for migrating over to 64 bit namenode Thanks in advance for all the suggestions -Sagar
Re: 64 bit namenode and secondary namenode 32 bit datanode
On 11/25/08 3:58 PM, Sagar Naik [EMAIL PROTECTED] wrote: I am trying to migrate from 32 bit jvm and 64 bit for namenode only. *setup* NN - 64 bit Secondary namenode (instance 1) - 64 bit Secondary namenode (instance 2) - 32 bit datanode- 32 bit From the mailing list I deduced that NN-64 bit and Datanode -32 bit combo works Yup. That's how we run it. But, I am not sure if S-NN-(instance 1--- 64 bit ) and S-NN (instance 2 -- 32 bit) will work with this setup. Considering that the primary and secondary process essentially the same data, they should have the same memory requirements. In other words, if you need 64-bit for the name node, your secondary is going to require it too. I'm also not sure if you can have two secondaries. I'll let someone else chime in on that. :)
Re: 64 bit namenode and secondary namenode 32 bit datanode
I might be wrong, but my assumption is running SN either in 64/32 shouldn't matter. But I am curious how two instances of Secondary namenode is setup, will both of them talk to same NN and running in parallel? what are the advantages here. Wondering if there are chances of image corruption. Thanks, lohit - Original Message From: Sagar Naik [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Tuesday, November 25, 2008 3:58:53 PM Subject: 64 bit namenode and secondary namenode 32 bit datanode I am trying to migrate from 32 bit jvm and 64 bit for namenode only. *setup* NN - 64 bit Secondary namenode (instance 1) - 64 bit Secondary namenode (instance 2) - 32 bit datanode- 32 bit From the mailing list I deduced that NN-64 bit and Datanode -32 bit combo works But, I am not sure if S-NN-(instance 1--- 64 bit ) and S-NN (instance 2 -- 32 bit) will work with this setup. Also, do shud I be aware of any other issues for migrating over to 64 bit namenode Thanks in advance for all the suggestions -Sagar
Re: 64 bit namenode and secondary namenode 32 bit datanode
lohit wrote: I might be wrong, but my assumption is running SN either in 64/32 shouldn't matter. But I am curious how two instances of Secondary namenode is setup, will both of them talk to same NN and running in parallel? what are the advantages here. I just have multiple entries master file. I am not aware of image corruption (did not take look into it). I did for SNN redundancy Pl correct me if I am wrong Thanks Sagar Wondering if there are chances of image corruption. Thanks, lohit - Original Message From: Sagar Naik [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Tuesday, November 25, 2008 3:58:53 PM Subject: 64 bit namenode and secondary namenode 32 bit datanode I am trying to migrate from 32 bit jvm and 64 bit for namenode only. *setup* NN - 64 bit Secondary namenode (instance 1) - 64 bit Secondary namenode (instance 2) - 32 bit datanode- 32 bit From the mailing list I deduced that NN-64 bit and Datanode -32 bit combo works But, I am not sure if S-NN-(instance 1--- 64 bit ) and S-NN (instance 2 -- 32 bit) will work with this setup. Also, do shud I be aware of any other issues for migrating over to 64 bit namenode Thanks in advance for all the suggestions -Sagar
Re: 64 bit namenode and secondary namenode 32 bit datanode
Well, if I think about, image corruption might not happen, since each checkpoint initiation would have unique number. I was just wondering what would happen in this case Consider this scenario. Time 1 -- SN1 asks NN image and edits to merge Time 2 -- SN2 asks NN image and edits to merge Time 2 -- SN2 returns new image Time 3 -- SN1 returns new image. I am not sure what happens here, but its best to test it out before setting up something like this. And if you have multiple entries in NN file, then one SNN checkpoint would update all NN entries, so redundant SNN isnt buying you much. Thanks, Lohit - Original Message From: Sagar Naik [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Tuesday, November 25, 2008 4:32:26 PM Subject: Re: 64 bit namenode and secondary namenode 32 bit datanode lohit wrote: I might be wrong, but my assumption is running SN either in 64/32 shouldn't matter. But I am curious how two instances of Secondary namenode is setup, will both of them talk to same NN and running in parallel? what are the advantages here. I just have multiple entries master file. I am not aware of image corruption (did not take look into it). I did for SNN redundancy Pl correct me if I am wrong Thanks Sagar Wondering if there are chances of image corruption. Thanks, lohit - Original Message From: Sagar Naik [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Tuesday, November 25, 2008 3:58:53 PM Subject: 64 bit namenode and secondary namenode 32 bit datanode I am trying to migrate from 32 bit jvm and 64 bit for namenode only. *setup* NN - 64 bit Secondary namenode (instance 1) - 64 bit Secondary namenode (instance 2) - 32 bit datanode- 32 bit From the mailing list I deduced that NN-64 bit and Datanode -32 bit combo works But, I am not sure if S-NN-(instance 1--- 64 bit ) and S-NN (instance 2 -- 32 bit) will work with this setup. Also, do shud I be aware of any other issues for migrating over to 64 bit namenode Thanks in advance for all the suggestions -Sagar
Filesystem closed errors
I have an app that runs for a long time with no problems, but when I signal it to shut down, I get errors like this: java.io.IOException: Filesystem closed at org.apache.hadoop.dfs.DFSClient.checkOpen(DFSClient.java:196) at org.apache.hadoop.dfs.DFSClient.rename(DFSClient.java:502) at org.apache.hadoop.dfs.DistributedFileSystem.rename (DistributedFileSystem.java:176) The problems occur when I am trying to close open HDFS files. Any ideas why I might be seeing this? I though it was because I was abruptly shutting down without giving the streams a chance to get closed, but after some refactoring, that's not the case. -Bryan
Re: Block placement in HDFS
Hi All, I try to divide some data into partitions explicitly (like regions of Hbase). I wonder the following way to do is the best method. For example, when we assume a block size 64MB, a file potion corresponding to 0~63MB is allocated to first block? I have three questions as follows: Is the above method valid? Is it the best method? Is there alternative method? Thank in advance. -- Hyunsik Choi Database Information Systems Group Dept. of Computer Science Engineering, Korea University On Mon, 2008-11-24 at 20:44 -0800, Mahadev Konar wrote: Hi Dennis, I don't think that is possible to do. The block placement is determined by HDFS internally (which is local, rack local and off rack). mahadev On 11/24/08 6:59 PM, dennis81 [EMAIL PROTECTED] wrote: Hi everyone, I was wondering whether it is possible to control the placement of the blocks of a file in HDFS. Is it possible to instruct HDFS about which nodes will hold the block replicas? Thanks!
HDFS directory listing from the Java API?
Hi all, Can someone pls guide me on how to get a directory listing of files on HDFS using the java API (0.19.0)? Regards, Shane
Re: Filesystem closed errors
Do you have speculative execution enabled? I've seen error messages like this caused by speculative execution. David Bryan Duxbury wrote: I have an app that runs for a long time with no problems, but when I signal it to shut down, I get errors like this: java.io.IOException: Filesystem closed at org.apache.hadoop.dfs.DFSClient.checkOpen(DFSClient.java:196) at org.apache.hadoop.dfs.DFSClient.rename(DFSClient.java:502) at org.apache.hadoop.dfs.DistributedFileSystem.rename(DistributedFileSystem.java:176) The problems occur when I am trying to close open HDFS files. Any ideas why I might be seeing this? I though it was because I was abruptly shutting down without giving the streams a chance to get closed, but after some refactoring, that's not the case. -Bryan
Re: Filesystem closed errors
Does your code ever call fs.close()? If so, https://issues.apache.org/ jira/browse/HADOOP-4655 might be relevant to your problem. On Nov 25, 2008, at 9:07 PM, David B. Ritch wrote: Do you have speculative execution enabled? I've seen error messages like this caused by speculative execution. David Bryan Duxbury wrote: I have an app that runs for a long time with no problems, but when I signal it to shut down, I get errors like this: java.io.IOException: Filesystem closed at org.apache.hadoop.dfs.DFSClient.checkOpen(DFSClient.java:196) at org.apache.hadoop.dfs.DFSClient.rename(DFSClient.java:502) at org.apache.hadoop.dfs.DistributedFileSystem.rename (DistributedFileSystem.java:176) The problems occur when I am trying to close open HDFS files. Any ideas why I might be seeing this? I though it was because I was abruptly shutting down without giving the streams a chance to get closed, but after some refactoring, that's not the case. -Bryan
How to retrieve rack ID of a datanode
Hi all, I want to retrieve the Rack ID of every datanode. How can I do this? I tried using getNetworkLocation() in org.apache.hadoop.hdfs.protocol.DatanodeInfo. I am getting /default-rack as the output for all datanodes. Any advice? Thank in advance Ramya
Re: How to retrieve rack ID of a datanode
Ramya R wrote: Hi all, I want to retrieve the Rack ID of every datanode. How can I do this? I tried using getNetworkLocation() in org.apache.hadoop.hdfs.protocol.DatanodeInfo. I am getting /default-rack as the output for all datanodes. Have you setup the cluster to be rack-aware? Atleast in MR we have to provide a script that resolves the rack for a given node. Might be similar for DFS too. See topology.script.file.name parameter in hadoop-default.conf for more details. Amar Any advice? Thank in advance Ramya
Re: 64 bit namenode and secondary namenode 32 bit datanod
The design is such that running multiple secondary namenodes should not corrupt the image (modulo any bugs). Are you seeing image corruptions when this happens? You can run all or any daemons in 32-bit mode or 64 bit-mode. You can mix-and-match. If you have many millions of files, then you might want to allocte more than 3GB heap space to the namenode and secondary namenode. In that case, you will jave to run the namenode and secondary namenode using 64 bit JVM. dhruba On Tue, Nov 25, 2008 at 4:39 PM, lohit [EMAIL PROTECTED] wrote: Well, if I think about, image corruption might not happen, since each checkpoint initiation would have unique number. I was just wondering what would happen in this case Consider this scenario. Time 1 -- SN1 asks NN image and edits to merge Time 2 -- SN2 asks NN image and edits to merge Time 2 -- SN2 returns new image Time 3 -- SN1 returns new image. I am not sure what happens here, but its best to test it out before setting up something like this. And if you have multiple entries in NN file, then one SNN checkpoint would update all NN entries, so redundant SNN isnt buying you much. Thanks, Lohit - Original Message From: Sagar Naik [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Tuesday, November 25, 2008 4:32:26 PM Subject: Re: 64 bit namenode and secondary namenode 32 bit datanode lohit wrote: I might be wrong, but my assumption is running SN either in 64/32 shouldn't matter. But I am curious how two instances of Secondary namenode is setup, will both of them talk to same NN and running in parallel? what are the advantages here. I just have multiple entries master file. I am not aware of image corruption (did not take look into it). I did for SNN redundancy Pl correct me if I am wrong Thanks Sagar Wondering if there are chances of image corruption. Thanks, lohit - Original Message From: Sagar Naik [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Tuesday, November 25, 2008 3:58:53 PM Subject: 64 bit namenode and secondary namenode 32 bit datanode I am trying to migrate from 32 bit jvm and 64 bit for namenode only. *setup* NN - 64 bit Secondary namenode (instance 1) - 64 bit Secondary namenode (instance 2) - 32 bit datanode- 32 bit From the mailing list I deduced that NN-64 bit and Datanode -32 bit combo works But, I am not sure if S-NN-(instance 1--- 64 bit ) and S-NN (instance 2 -- 32 bit) will work with this setup. Also, do shud I be aware of any other issues for migrating over to 64 bit namenode Thanks in advance for all the suggestions -Sagar
RE: How to retrieve rack ID of a datanode
Hi Lohit, I have not set the datanode to tell namenode which rack it belongs to. Can you please tell me how do I do it? Is it using setNetworkLocation()? My intention is to kill the datanodes in a given rack. So it would be useful even if I obtain the subnet each datanode belongs to. Thanks Ramya -Original Message- From: lohit [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 26, 2008 12:26 PM To: core-user@hadoop.apache.org Subject: Re: How to retrieve rack ID of a datanode /default-rack is set when datanode has not set rackID. It is upto the datanode to tell namenode which rack it belongs to. Is your datanode doing that explicitly ? -Lohit - Original Message From: Ramya R [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Tuesday, November 25, 2008 10:36:46 PM Subject: How to retrieve rack ID of a datanode Hi all, I want to retrieve the Rack ID of every datanode. How can I do this? I tried using getNetworkLocation() in org.apache.hadoop.hdfs.protocol.DatanodeInfo. I am getting /default-rack as the output for all datanodes. Any advice? Thank in advance Ramya
Switching to HBase from HDFS
I have a system which uses HDFS to store files on multiple nodes. On each HDFS node machine I have another application which reads the local files. Until know my system worked only with files, HDFS seemed like the right solution and everything worked fine. Now I need to save additional information for every file. I thought that I might create a central database and in this database I will create a table which will map file name with the new data. I don't think that this is a good solution since I will need to query this new data for each file. I thought that since HBase is built on top of HDFS it might be better to use it instead of a database. With HBase I will have each file together with the new data locally on each node. I can read each file together with any additional information. Since I never used HBase I want to ask the community if HBase is the right solution for my case? --Shimi
Re: How to retrieve rack ID of a datanode
hi Ramya Setup topology.script.file.name in your hadoop-site.xml and the script. check http://hadoop.apache.org/core/docs/current/cluster_setup.html , Hadoop Rack Awareness section. Hi Lohit, I have not set the datanode to tell namenode which rack it belongs to. Can you please tell me how do I do it? Is it using setNetworkLocation()? My intention is to kill the datanodes in a given rack. So it would be useful even if I obtain the subnet each datanode belongs to. Thanks Ramya -Original Message- From: lohit [mailto:[EMAIL PROTECTED] Sent: Wednesday, November 26, 2008 12:26 PM To: core-user@hadoop.apache.org Subject: Re: How to retrieve rack ID of a datanode /default-rack is set when datanode has not set rackID. It is upto the datanode to tell namenode which rack it belongs to. Is your datanode doing that explicitly ? -Lohit - Original Message From: Ramya R [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Tuesday, November 25, 2008 10:36:46 PM Subject: How to retrieve rack ID of a datanode Hi all, I want to retrieve the Rack ID of every datanode. How can I do this? I tried using getNetworkLocation() in org.apache.hadoop.hdfs.protocol.DatanodeInfo. I am getting /default-rack as the output for all datanodes. Any advice? Thank in advance Ramya -- Yi-Kai Tsai (cuma) [EMAIL PROTECTED], Asia Regional Search Engineering.
Re: Switching to HBase from HDFS
Hi Shimi HBase (or BigTable) is a sparse, distributed, persistent multidimensional sorted map , Jim R. Wilson have a excellent article for understanding it : http://jimbojw.com/wiki/index.php?title=Understanding_HBase_and_BigTable I have a system which uses HDFS to store files on multiple nodes. On each HDFS node machine I have another application which reads the local files. Until know my system worked only with files, HDFS seemed like the right solution and everything worked fine. Now I need to save additional information for every file. I thought that I might create a central database and in this database I will create a table which will map file name with the new data. I don't think that this is a good solution since I will need to query this new data for each file. I thought that since HBase is built on top of HDFS it might be better to use it instead of a database. With HBase I will have each file together with the new data locally on each node. I can read each file together with any additional information. Since I never used HBase I want to ask the community if HBase is the right solution for my case? --Shimi -- Yi-Kai Tsai (cuma) [EMAIL PROTECTED], Asia Regional Search Engineering.
How We get old version of Haddop
Dear Freinds, How we get hadoop old version. -- Regards, Rashid Ahmad
how can I decommission nodes on-the-fly?
Hi list, I added a property dfs.hosts.exclude to my conf/hadoop-site.xml. Then refreshed my cluster with command bin/hadoop dfsadmin -refreshNodes It showed that it can only shut down the DataNode process but not included the TaskTracker process on each slaver specified in the excludes file. The jobtracker web still show that I hadnot shut down these nodes. How can i totally decommission these slaver nodes on-the-fly? Is it can be achieved only by operation on the master node? Thanks, Jeremy -- My research interests are distributed systems, parallel computing and bytecode based virtual machine. http://coderplay.javaeye.com
how can I decommission nodes on-the-fly?
Hi list, I added a property dfs.hosts.exclude to my conf/hadoop-site.xml. Then refreshed my cluster with command bin/hadoop dfsadmin -refreshNodes It showed that it can only shut down the DataNode process but not included the TaskTracker process on each slaver specified in the excludes file. The jobtracker web still show that I hadnot shut down these nodes. How can i totally decommission these slaver nodes on-the-fly? Is it can be achieved only by operation on the master node? Thanks, Jeremy -- My research interests are distributed systems, parallel computing and bytecode based virtual machine. http://coderplay.javaeye.com
Re: how can I decommission nodes on-the-fly?
Jeremy Chow wrote: Hi list, I added a property dfs.hosts.exclude to my conf/hadoop-site.xml. Then refreshed my cluster with command bin/hadoop dfsadmin -refreshNodes It showed that it can only shut down the DataNode process but not included the TaskTracker process on each slaver specified in the excludes file. Presently, decommissioning TaskTracker on-the-fly is not available. The jobtracker web still show that I hadnot shut down these nodes. How can i totally decommission these slaver nodes on-the-fly? Is it can be achieved only by operation on the master node? I think one way to shutdown a TaskTracker is to kill it. Thanks Amareshwari Thanks, Jeremy