Re: hbase and hypertable comparison
Thanks for the clear answer Andy. The comparison actually was conducted by hypertable dev team, so I guess it wasn't all that fair to hbase. I have regained the confidence in hbase once more :) Ed From mp2893's iPhone On 2011. 5. 26., at 오전 12:03, Andrew Purtell apurt...@apache.org wrote: I think I can speak for all of the HBase devs that in our opinion this vendor benchmark was designed by hypertable to demonstrate a specific feature of their system -- autotuning -- in such a way that HBase was, obviously, not tuned. Nobody from the HBase project was consulted on the results or to do such tuning, as is common courtesy when running a competitive benchmark, if the goal is a fair test. Furthermore the benchmark code was not a community accepted benchmark such as YCSB. I do not think the results are valid beyond being some vendor FUD and do not warrant much comment beyond this. Best regards, - Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) --- On Wed, 5/25/11, edward choi mp2...@gmail.com wrote: From: edward choi mp2...@gmail.com Subject: hbase and hypertable comparison To: u...@hbase.apache.org, common-user@hadoop.apache.org Date: Wednesday, May 25, 2011, 12:47 AM I'm planning to use a NoSQL distributed database. I did some searching and came across a lot of database systems such as MongoDB, CouchDB, Hbase, Cassandra, Hypertable, etc. Since what I'll be doing is frequently reading a varying amount of data, and less frequently writing a massive amount of data, I thought Hbase, or Hypertable is the way to go. I did some internet and found some performance comparison between HBase and HyperTable. Obviously HT dominated Hbase in every aspect (random read/write and a couple of more) But the comparison was made with Hbase 0.20.4, and Hbase had much improvements since the current version is 0.90.3. I am curious if the performance gap is still large between Hbase and HT. I am running Hadoop already so I wanted to go with Hbase but the performance gap was so big that it made me reconsider. Any opinions please?
Re: Sorting ...
On May 25, 2011 22:15:50 Mark question wrote: I'm using SequenceFileInputFormat, but then what to write in my mappers? each mapper is taking a split from the SequenceInputFile then sort its split ?! I don't want that.. Thanks, Mark On Wed, May 25, 2011 at 2:09 AM, Luca Pireddu pire...@crs4.it wrote: On May 25, 2011 01:43:22 Mark question wrote: Thanks Luca, but what other way to sort a directory of sequence files? I don't plan to write a sorting algorithm in mappers/reducers, but hoping to use the sequenceFile.sorter instead. Any ideas? Mark If you want to achieve a global sort, then look at how TeraSort does it: http://sortbenchmark.org/YahooHadoop.pdf The idea is to partition the data so that all keys in part[i] are all keys in part[i+1]. Each partition in individually sorted, so to read the data in globally sorted order you simply have to traverse it starting from the first partition and working your way to the last one. If your keys are already what you want to sort by, then you don't even need a mapper (just use the default identity map). -- Luca Pireddu CRS4 - Distributed Computing Group Loc. Pixina Manna Edificio 1 Pula 09010 (CA), Italy Tel: +39 0709250452
Re: one question about hadoop
Hadoop embeds jetty directly into hadoop servers with the org.apache.hadoop.http.HttpServer class for servlets. For jsp, web.xml is auto generated with the jasper compiler during the build phase. The new web framework for mapreduce 2.0 (MAPREDUCE-2399) wraps the hadoop HttpServer and doesn't need web.xml and/or jsp support either. On Thu, May 26, 2011 at 12:14 AM, 王晓峰 sanlang2...@gmail.com wrote: hi,admin: I'm a fresh fish from China. I want to know how the Jetty combines with the hadoop. I can't find the file named web.xml that should exist in usual system that combine with Jetty. I'll be very happy to receive your answer. If you have any question,please feel free to contract with me. Best Regards, Jack
can our problem be handled by hadoop
Hello, we are working on a scientific project to analyze information spread in networks. Our simulations are independent from each other but we need a large amount of runs and we have to collect all data for the interpretation of results by our reporting tools. So my was to use hadoop as a base, with its distributed filesystem. So we could start independent runs on each node of the cluster and at the end we collect the data for the calculation of average values. The simulation tool is written in java and consists of about 2 MB jar files. Is this a situation there hadoop can help us? One fact is, that we want to parallize the production of large data sets. Best wishes Mirko
Re: can our problem be handled by hadoop
Hi, seems like the perfect use case for Map Reduce yep. 2011/5/26 Mirko Kämpf mirko.kae...@googlemail.com: Hello, we are working on a scientific project to analyze information spread in networks. Our simulations are independent from each other but we need a large amount of runs and we have to collect all data for the interpretation of results by our reporting tools. So my was to use hadoop as a base, with its distributed filesystem. So we could start independent runs on each node of the cluster and at the end we collect the data for the calculation of average values. The simulation tool is written in java and consists of about 2 MB jar files. Is this a situation there hadoop can help us? One fact is, that we want to parallize the production of large data sets. Best wishes Mirko
Re: Comparing
Harsh, Thanks for your response, it was very helpful. There are still a couple of things which are not really clear to me though. You say that Keys have got to be compared by the MR framework. But I'm still not 100% sure why keys are sorted. I thought what hadoop did was, during shuffling it chose which keys went to which reducer and then for each key/value it checked the key and sent them to the correct node. If that was the case then a good equals implementation could be enough. So why instead of just *shuffling* does the MP framework *sort* the items? Also, you were very clear about the use of RawComparator, thank you. Do you know how RawComparable works though? Again, thanks for your help! Cheers, Pony On Thu, May 26, 2011 at 1:58 AM, Harsh J ha...@cloudera.com wrote: Pony, Keys have got to be compared by the MR framework somehow, and the way it does when you use Writables is by ensuring that your Key is of a Writable + Comparable type (WritableComparable). If you specify a specific comparator class, then that will be used; else the default WritableComparator will get asked if it can supply a comparator for use with your key type. AFAIK, the default WritableComparator wraps around RawComparator and does indeed deserialize the writables before applying the compare operation. The RawComparator's primary idea is to give you a pair of raw byte sequences to compare directly. Certain other serialization libraries (Apache Avro is one) provide ways to compare using bytes itself (Across different types), which can end up being faster when used in jobs. Hope this clears up your confusion. On Tue, May 24, 2011 at 2:06 AM, Juan P. gordoslo...@gmail.com wrote: Hi guys, I wanted to get your help with a couple of questions which came up while looking at the Hadoop Comparator/Comparable architecture. As I see it before each reducer operates on each key, a sorting algorithm is applied to them. *Why does Hadoop need to do that?* If I implement my own class and I intend to use it as a Key I must allow for instances of my class to be compared. So I have 2 choices: I can implement WritableComparable or I can register a WritableComparator for my class. Should I fail to do either, would the Job fail? If I register my WritableComparator which does not use the Comparable interface at all, does my Key need to implement WritableComparable? If I don't implement my Comparator and my Key implements WritableComparable, does it mean that Hadoop will deserialize my Keys twice? (once for sorting, and once for reducing) What is RawComparable used for? Thanks for your help! Pony -- Harsh J
Re: Sorting ...
Also if you want something that is fairly fast and a lot less dev work to get going you might want to look at pig. They can do a distributed order by that is fairly good. --Bobby Evans On 5/26/11 2:45 AM, Luca Pireddu pire...@crs4.it wrote: On May 25, 2011 22:15:50 Mark question wrote: I'm using SequenceFileInputFormat, but then what to write in my mappers? each mapper is taking a split from the SequenceInputFile then sort its split ?! I don't want that.. Thanks, Mark On Wed, May 25, 2011 at 2:09 AM, Luca Pireddu pire...@crs4.it wrote: On May 25, 2011 01:43:22 Mark question wrote: Thanks Luca, but what other way to sort a directory of sequence files? I don't plan to write a sorting algorithm in mappers/reducers, but hoping to use the sequenceFile.sorter instead. Any ideas? Mark If you want to achieve a global sort, then look at how TeraSort does it: http://sortbenchmark.org/YahooHadoop.pdf The idea is to partition the data so that all keys in part[i] are all keys in part[i+1]. Each partition in individually sorted, so to read the data in globally sorted order you simply have to traverse it starting from the first partition and working your way to the last one. If your keys are already what you want to sort by, then you don't even need a mapper (just use the default identity map). -- Luca Pireddu CRS4 - Distributed Computing Group Loc. Pixina Manna Edificio 1 Pula 09010 (CA), Italy Tel: +39 0709250452
Help with pigsetup
I sent this to pig apache user mailing list but have got no response. Not sure if that list is still active. thought I will post here if someone is able to help me. I am in process of installing and learning pig. I have a hadoop cluster and when I try to run pig in mapreduce mode it errors out: Hadoop version is hadoop-0.20.203.0 and pig version is pig-0.8.1 Error before Pig is launched ERROR 2999: Unexpected internal error. Failed to create DataStorage java.lang.RuntimeException: Failed to create DataStorage at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:58) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134) at org.apache.pig.impl.PigContext.connect(PigContext.java:183) at org.apache.pig.PigServer.init(PigServer.java:226) at org.apache.pig.PigServer.init(PigServer.java:215) at org.apache.pig.tools.grunt.Grunt.init(Grunt.java:55) at org.apache.pig.Main.run(Main.java:452) at org.apache.pig.Main.main(Main.java:107) Caused by: java.io.IOException: Call to dsdb1/172.18.60.96:54310 failed on local exception: java.io.EOFException at org.apache.hadoop.ipc.Client.wrapException(Client.java:775) at org.apache.hadoop.ipc.Client.call(Client.java:743) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy0.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:207) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:170) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72) ... 9 more Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
Re: Help with pigsetup
I think Jonathan Coveney's reply on user@pig answered your question. Its basically an issue of hadoop version differences between the one Pig 0.8.1 release got bundled with vs. Hadoop 0.20.203 release which is newer. On Thu, May 26, 2011 at 10:26 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I sent this to pig apache user mailing list but have got no response. Not sure if that list is still active. thought I will post here if someone is able to help me. I am in process of installing and learning pig. I have a hadoop cluster and when I try to run pig in mapreduce mode it errors out: Hadoop version is hadoop-0.20.203.0 and pig version is pig-0.8.1 Error before Pig is launched ERROR 2999: Unexpected internal error. Failed to create DataStorage java.lang.RuntimeException: Failed to create DataStorage at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:58) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134) at org.apache.pig.impl.PigContext.connect(PigContext.java:183) at org.apache.pig.PigServer.init(PigServer.java:226) at org.apache.pig.PigServer.init(PigServer.java:215) at org.apache.pig.tools.grunt.Grunt.init(Grunt.java:55) at org.apache.pig.Main.run(Main.java:452) at org.apache.pig.Main.main(Main.java:107) Caused by: java.io.IOException: Call to dsdb1/172.18.60.96:54310 failed on local exception: java.io.EOFException at org.apache.hadoop.ipc.Client.wrapException(Client.java:775) at org.apache.hadoop.ipc.Client.call(Client.java:743) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy0.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:207) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:170) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72) ... 9 more Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446) -- Harsh J
Re: Help with pigsetup
For some reason I don't see that reply from Jonathan in my Inbox. I'll try to google it. What should be my next step in that case? I can't use pig then? On Thu, May 26, 2011 at 10:00 AM, Harsh J ha...@cloudera.com wrote: I think Jonathan Coveney's reply on user@pig answered your question. Its basically an issue of hadoop version differences between the one Pig 0.8.1 release got bundled with vs. Hadoop 0.20.203 release which is newer. On Thu, May 26, 2011 at 10:26 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I sent this to pig apache user mailing list but have got no response. Not sure if that list is still active. thought I will post here if someone is able to help me. I am in process of installing and learning pig. I have a hadoop cluster and when I try to run pig in mapreduce mode it errors out: Hadoop version is hadoop-0.20.203.0 and pig version is pig-0.8.1 Error before Pig is launched ERROR 2999: Unexpected internal error. Failed to create DataStorage java.lang.RuntimeException: Failed to create DataStorage at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:58) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134) at org.apache.pig.impl.PigContext.connect(PigContext.java:183) at org.apache.pig.PigServer.init(PigServer.java:226) at org.apache.pig.PigServer.init(PigServer.java:215) at org.apache.pig.tools.grunt.Grunt.init(Grunt.java:55) at org.apache.pig.Main.run(Main.java:452) at org.apache.pig.Main.main(Main.java:107) Caused by: java.io.IOException: Call to dsdb1/172.18.60.96:54310 failed on local exception: java.io.EOFException at org.apache.hadoop.ipc.Client.wrapException(Client.java:775) at org.apache.hadoop.ipc.Client.call(Client.java:743) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy0.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:207) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:170) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72) ... 9 more Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446) -- Harsh J
Re: Help with pigsetup
I'll repost it here then :) Here is what I had to do to get pig running with a different version of Hadoop (in my case, the cloudera build but I'd try this as well): build pig-withouthadoop.jar by running ant jar-withouthadoop. Then, when you run pig, put the pig-withouthadoop.jar on your classpath as well as your hadoop jar. In my case, I found that scripts only worked if I additionally manually registered the antlr jar: register /path/to/pig/build/ivy/lib/Pig/antlr-runtime-3.2.jar; 2011/5/26 Mohit Anchlia mohitanch...@gmail.com For some reason I don't see that reply from Jonathan in my Inbox. I'll try to google it. What should be my next step in that case? I can't use pig then? On Thu, May 26, 2011 at 10:00 AM, Harsh J ha...@cloudera.com wrote: I think Jonathan Coveney's reply on user@pig answered your question. Its basically an issue of hadoop version differences between the one Pig 0.8.1 release got bundled with vs. Hadoop 0.20.203 release which is newer. On Thu, May 26, 2011 at 10:26 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I sent this to pig apache user mailing list but have got no response. Not sure if that list is still active. thought I will post here if someone is able to help me. I am in process of installing and learning pig. I have a hadoop cluster and when I try to run pig in mapreduce mode it errors out: Hadoop version is hadoop-0.20.203.0 and pig version is pig-0.8.1 Error before Pig is launched ERROR 2999: Unexpected internal error. Failed to create DataStorage java.lang.RuntimeException: Failed to create DataStorage at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:58) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134) at org.apache.pig.impl.PigContext.connect(PigContext.java:183) at org.apache.pig.PigServer.init(PigServer.java:226) at org.apache.pig.PigServer.init(PigServer.java:215) at org.apache.pig.tools.grunt.Grunt.init(Grunt.java:55) at org.apache.pig.Main.run(Main.java:452) at org.apache.pig.Main.main(Main.java:107) Caused by: java.io.IOException: Call to dsdb1/172.18.60.96:54310 failed on local exception: java.io.EOFException at org.apache.hadoop.ipc.Client.wrapException(Client.java:775) at org.apache.hadoop.ipc.Client.call(Client.java:743) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy0.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:207) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:170) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72) ... 9 more Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446) -- Harsh J
Re: Help with pigsetup
On Thu, May 26, 2011 at 10:06 AM, Jonathan Coveney jcove...@gmail.com wrote: I'll repost it here then :) Here is what I had to do to get pig running with a different version of Hadoop (in my case, the cloudera build but I'd try this as well): build pig-withouthadoop.jar by running ant jar-withouthadoop. Then, when you run pig, put the pig-withouthadoop.jar on your classpath as well as your hadoop jar. In my case, I found that scripts only worked if I additionally manually registered the antlr jar: Thanks Jonathan! I will give it a shot. register /path/to/pig/build/ivy/lib/Pig/antlr-runtime-3.2.jar; Is this a windows command? Sorry, have not used this before. 2011/5/26 Mohit Anchlia mohitanch...@gmail.com For some reason I don't see that reply from Jonathan in my Inbox. I'll try to google it. What should be my next step in that case? I can't use pig then? On Thu, May 26, 2011 at 10:00 AM, Harsh J ha...@cloudera.com wrote: I think Jonathan Coveney's reply on user@pig answered your question. Its basically an issue of hadoop version differences between the one Pig 0.8.1 release got bundled with vs. Hadoop 0.20.203 release which is newer. On Thu, May 26, 2011 at 10:26 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I sent this to pig apache user mailing list but have got no response. Not sure if that list is still active. thought I will post here if someone is able to help me. I am in process of installing and learning pig. I have a hadoop cluster and when I try to run pig in mapreduce mode it errors out: Hadoop version is hadoop-0.20.203.0 and pig version is pig-0.8.1 Error before Pig is launched ERROR 2999: Unexpected internal error. Failed to create DataStorage java.lang.RuntimeException: Failed to create DataStorage at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:58) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134) at org.apache.pig.impl.PigContext.connect(PigContext.java:183) at org.apache.pig.PigServer.init(PigServer.java:226) at org.apache.pig.PigServer.init(PigServer.java:215) at org.apache.pig.tools.grunt.Grunt.init(Grunt.java:55) at org.apache.pig.Main.run(Main.java:452) at org.apache.pig.Main.main(Main.java:107) Caused by: java.io.IOException: Call to dsdb1/172.18.60.96:54310 failed on local exception: java.io.EOFException at org.apache.hadoop.ipc.Client.wrapException(Client.java:775) at org.apache.hadoop.ipc.Client.call(Client.java:743) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy0.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:207) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:170) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72) ... 9 more Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446) -- Harsh J
Re: Help with pigsetup
I've built pig-withouthadoop.jar and have copied it to my linux box. Now how do I put hadoop-core-0.20.203.0.jar and pig-withouthadoop.jar in the classpath. Is it by using CLASSPATH variable? On Thu, May 26, 2011 at 10:18 AM, Mohit Anchlia mohitanch...@gmail.com wrote: On Thu, May 26, 2011 at 10:06 AM, Jonathan Coveney jcove...@gmail.com wrote: I'll repost it here then :) Here is what I had to do to get pig running with a different version of Hadoop (in my case, the cloudera build but I'd try this as well): build pig-withouthadoop.jar by running ant jar-withouthadoop. Then, when you run pig, put the pig-withouthadoop.jar on your classpath as well as your hadoop jar. In my case, I found that scripts only worked if I additionally manually registered the antlr jar: Thanks Jonathan! I will give it a shot. register /path/to/pig/build/ivy/lib/Pig/antlr-runtime-3.2.jar; Is this a windows command? Sorry, have not used this before. 2011/5/26 Mohit Anchlia mohitanch...@gmail.com For some reason I don't see that reply from Jonathan in my Inbox. I'll try to google it. What should be my next step in that case? I can't use pig then? On Thu, May 26, 2011 at 10:00 AM, Harsh J ha...@cloudera.com wrote: I think Jonathan Coveney's reply on user@pig answered your question. Its basically an issue of hadoop version differences between the one Pig 0.8.1 release got bundled with vs. Hadoop 0.20.203 release which is newer. On Thu, May 26, 2011 at 10:26 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I sent this to pig apache user mailing list but have got no response. Not sure if that list is still active. thought I will post here if someone is able to help me. I am in process of installing and learning pig. I have a hadoop cluster and when I try to run pig in mapreduce mode it errors out: Hadoop version is hadoop-0.20.203.0 and pig version is pig-0.8.1 Error before Pig is launched ERROR 2999: Unexpected internal error. Failed to create DataStorage java.lang.RuntimeException: Failed to create DataStorage at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:58) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134) at org.apache.pig.impl.PigContext.connect(PigContext.java:183) at org.apache.pig.PigServer.init(PigServer.java:226) at org.apache.pig.PigServer.init(PigServer.java:215) at org.apache.pig.tools.grunt.Grunt.init(Grunt.java:55) at org.apache.pig.Main.run(Main.java:452) at org.apache.pig.Main.main(Main.java:107) Caused by: java.io.IOException: Call to dsdb1/172.18.60.96:54310 failed on local exception: java.io.EOFException at org.apache.hadoop.ipc.Client.wrapException(Client.java:775) at org.apache.hadoop.ipc.Client.call(Client.java:743) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy0.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:207) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:170) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72) ... 9 more Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446) -- Harsh J
Re: Help with pigsetup
I added to PIG_CLASSPATH and went past the error but now I get a different error. Looks like I need to add some other jars but not sure which one. export PIG_CLASSPATH=$HADOOP_CONF_DIR:$HADOOP_HOME/hadoop-core-0.20.203.0.jar:$PIG_HOME/../pig-withouthadoop.jar ERROR 2998: Unhandled internal error. org/apache/commons/configuration/Configuration java.lang.NoClassDefFoundError: org/apache/commons/configuration/Configuration at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.init(DefaultMetricsSystem.java:37) at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.clinit(DefaultMetricsSystem.java:34) at org.apache.hadoop.security.UgiInstrumentation.create(UgiInstrumentation.java:51) at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:196) at org.apache.hadoop.security.UserGroupInformation.ensureInitialized(UserGroupInformation.java:159) at org.apache.hadoop.security.UserGroupInformation.isSecurityEnabled(UserGroupInformation.java:216) at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:409) at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:395) at org.apache.hadoop.fs.FileSystem$Cache$Key.init(FileSystem.java:1418) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1319) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:226) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:109) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:58) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:196) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:116) at org.apache.pig.impl.PigContext.connect(PigContext.java:187) at org.apache.pig.PigServer.init(PigServer.java:243) at org.apache.pig.PigServer.init(PigServer.java:228) at org.apache.pig.tools.grunt.Grunt.init(Grunt.java:46) at org.apache.pig.Main.run(Main.java:484) at org.apache.pig.Main.main(Main.java:108) Caused by: java.lang.ClassNotFoundException: org.apache.commons.configuration.Configuration at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:307) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:248) On Thu, May 26, 2011 at 10:55 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I've built pig-withouthadoop.jar and have copied it to my linux box. Now how do I put hadoop-core-0.20.203.0.jar and pig-withouthadoop.jar in the classpath. Is it by using CLASSPATH variable? On Thu, May 26, 2011 at 10:18 AM, Mohit Anchlia mohitanch...@gmail.com wrote: On Thu, May 26, 2011 at 10:06 AM, Jonathan Coveney jcove...@gmail.com wrote: I'll repost it here then :) Here is what I had to do to get pig running with a different version of Hadoop (in my case, the cloudera build but I'd try this as well): build pig-withouthadoop.jar by running ant jar-withouthadoop. Then, when you run pig, put the pig-withouthadoop.jar on your classpath as well as your hadoop jar. In my case, I found that scripts only worked if I additionally manually registered the antlr jar: Thanks Jonathan! I will give it a shot. register /path/to/pig/build/ivy/lib/Pig/antlr-runtime-3.2.jar; Is this a windows command? Sorry, have not used this before. 2011/5/26 Mohit Anchlia mohitanch...@gmail.com For some reason I don't see that reply from Jonathan in my Inbox. I'll try to google it. What should be my next step in that case? I can't use pig then? On Thu, May 26, 2011 at 10:00 AM, Harsh J ha...@cloudera.com wrote: I think Jonathan Coveney's reply on user@pig answered your question. Its basically an issue of hadoop version differences between the one Pig 0.8.1 release got bundled with vs. Hadoop 0.20.203 release which is newer. On Thu, May 26, 2011 at 10:26 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I sent this to pig apache user mailing list but have got no response. Not sure if that list is still active. thought I will post here if someone is able to help me. I am in process of installing and learning pig. I have a hadoop cluster and when I try to run pig in mapreduce mode it errors out: Hadoop version is hadoop-0.20.203.0 and pig version is pig-0.8.1 Error before Pig is launched ERROR 2999: Unexpected internal error. Failed to create DataStorage java.lang.RuntimeException: Failed to create
java.lang.NoClassDefFoundError: com.sun.security.auth.UnixPrincipal
Hello Geeks, I am a new bee to use hadoop and i am currently installed hadoop-0.20.203.0 I am running the sample programs part of this package but getting this error Any pointer to fix this ??? ~/Hadoop/hadoop-0.20.203.0 788 bin/hadoop jar hadoop-examples-0.20.203.0.jar sort java.lang.NoClassDefFoundError: com.sun.security.auth.UnixPrincipal at org.apache.hadoop.security.UserGroupInformation.clinit(UserGroupInformation.java:246) at java.lang.J9VMInternals.initializeImpl(Native Method) at java.lang.J9VMInternals.initialize(J9VMInternals.java:200) at org.apache.hadoop.mapred.JobClient.init(JobClient.java:449) at org.apache.hadoop.mapred.JobClient.init(JobClient.java:437) at org.apache.hadoop.examples.Sort.run(Sort.java:82) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.examples.Sort.main(Sort.java:187) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37) at java.lang.reflect.Method.invoke(Method.java:611) at org.apache.hadoop.util.ProgramDriver$ProgramDescription.invoke(ProgramDriver.java:68) at org.apache.hadoop.util.ProgramDriver.driver(ProgramDriver.java:139) at org.apache.hadoop.examples.ExampleDriver.main(ExampleDriver.java:64) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:60) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:37) at java.lang.reflect.Method.invoke(Method.java:611) at org.apache.hadoop.util.RunJar.main(RunJar.java:156) Caused by: java.lang.ClassNotFoundException: com.sun.security.auth.UnixPrincipal at java.net.URLClassLoader.findClass(URLClassLoader.java:434) at java.lang.ClassLoader.loadClass(ClassLoader.java:653) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:358) at java.lang.ClassLoader.loadClass(ClassLoader.java:619) ... 20 more Also if there are some doc/steps/ thread to see how to use a Hello world hadop program please send me it will be a great help. -- View this message in context: http://hadoop-common.472056.n3.nabble.com/java-lang-NoClassDefFoundError-com-sun-security-auth-UnixPrincipal-tp2989927p2989927.html Sent from the Users mailing list archive at Nabble.com.
Re: Help with pigsetup
I added all the jars in the classpath in HADOOP_HOME/lib and now I get to the grunt prompt. Will try the tutorials and see how it behaves :) Thanks for your help! On Thu, May 26, 2011 at 9:56 AM, Mohit Anchlia mohitanch...@gmail.com wrote: I sent this to pig apache user mailing list but have got no response. Not sure if that list is still active. thought I will post here if someone is able to help me. I am in process of installing and learning pig. I have a hadoop cluster and when I try to run pig in mapreduce mode it errors out: Hadoop version is hadoop-0.20.203.0 and pig version is pig-0.8.1 Error before Pig is launched ERROR 2999: Unexpected internal error. Failed to create DataStorage java.lang.RuntimeException: Failed to create DataStorage at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:75) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:58) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:214) at org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.init(HExecutionEngine.java:134) at org.apache.pig.impl.PigContext.connect(PigContext.java:183) at org.apache.pig.PigServer.init(PigServer.java:226) at org.apache.pig.PigServer.init(PigServer.java:215) at org.apache.pig.tools.grunt.Grunt.init(Grunt.java:55) at org.apache.pig.Main.run(Main.java:452) at org.apache.pig.Main.main(Main.java:107) Caused by: java.io.IOException: Call to dsdb1/172.18.60.96:54310 failed on local exception: java.io.EOFException at org.apache.hadoop.ipc.Client.wrapException(Client.java:775) at org.apache.hadoop.ipc.Client.call(Client.java:743) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:220) at $Proxy0.getProtocolVersion(Unknown Source) at org.apache.hadoop.ipc.RPC.getProxy(RPC.java:359) at org.apache.hadoop.hdfs.DFSClient.createRPCNamenode(DFSClient.java:106) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:207) at org.apache.hadoop.hdfs.DFSClient.init(DFSClient.java:170) at org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:82) at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:1378) at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:66) at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1390) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:196) at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:95) at org.apache.pig.backend.hadoop.datastorage.HDataStorage.init(HDataStorage.java:72) ... 9 more Caused by: java.io.EOFException at java.io.DataInputStream.readInt(DataInputStream.java:375) at org.apache.hadoop.ipc.Client$Connection.receiveResponse(Client.java:501) at org.apache.hadoop.ipc.Client$Connection.run(Client.java:446)
Re: Sorting ...
Well, I want something like TeraSort but for sequenceFiles instead of Lines in Text. My goal is efficiency and I'm currently working with Hadoop only. Thanks for your suggestions, Mark On Thu, May 26, 2011 at 8:34 AM, Robert Evans ev...@yahoo-inc.com wrote: Also if you want something that is fairly fast and a lot less dev work to get going you might want to look at pig. They can do a distributed order by that is fairly good. --Bobby Evans On 5/26/11 2:45 AM, Luca Pireddu pire...@crs4.it wrote: On May 25, 2011 22:15:50 Mark question wrote: I'm using SequenceFileInputFormat, but then what to write in my mappers? each mapper is taking a split from the SequenceInputFile then sort its split ?! I don't want that.. Thanks, Mark On Wed, May 25, 2011 at 2:09 AM, Luca Pireddu pire...@crs4.it wrote: On May 25, 2011 01:43:22 Mark question wrote: Thanks Luca, but what other way to sort a directory of sequence files? I don't plan to write a sorting algorithm in mappers/reducers, but hoping to use the sequenceFile.sorter instead. Any ideas? Mark If you want to achieve a global sort, then look at how TeraSort does it: http://sortbenchmark.org/YahooHadoop.pdf The idea is to partition the data so that all keys in part[i] are all keys in part[i+1]. Each partition in individually sorted, so to read the data in globally sorted order you simply have to traverse it starting from the first partition and working your way to the last one. If your keys are already what you want to sort by, then you don't even need a mapper (just use the default identity map). -- Luca Pireddu CRS4 - Distributed Computing Group Loc. Pixina Manna Edificio 1 Pula 09010 (CA), Italy Tel: +39 0709250452
Re: one question about hadoop
web.xml is in: hadoop-releaseNo/webapps/job/WEB-INF/web.xml Mark On Thu, May 26, 2011 at 1:29 AM, Luke Lu l...@vicaya.com wrote: Hadoop embeds jetty directly into hadoop servers with the org.apache.hadoop.http.HttpServer class for servlets. For jsp, web.xml is auto generated with the jasper compiler during the build phase. The new web framework for mapreduce 2.0 (MAPREDUCE-2399) wraps the hadoop HttpServer and doesn't need web.xml and/or jsp support either. On Thu, May 26, 2011 at 12:14 AM, 王晓峰 sanlang2...@gmail.com wrote: hi,admin: I'm a fresh fish from China. I want to know how the Jetty combines with the hadoop. I can't find the file named web.xml that should exist in usual system that combine with Jetty. I'll be very happy to receive your answer. If you have any question,please feel free to contract with me. Best Regards, Jack
No. of Map and reduce tasks
How can I tell how the map and reduce tasks were spread accross the cluster? I looked at the jobtracker web page but can't find that info. Also, can I specify how many map or reduce tasks I want to be launched? From what I understand is that it's based on the number of input files passed to hadoop. So if I have 4 files there will be 4 Map taks that will be launced and reducer is dependent on the hashpartitioner.
How to debug why I don't get hadoop logs?
Hello, I'm running nutch on a hadoop cluster but unfortunately I don't find under hadoop_home/logs datanote logs but only a jobtracker log. I've not modified nutch log4j.properties nor hadoops. To the console I get printed mapred.JobClient stuff and also nutch stuff the nutch class logs directly before running as a job. -- Regards, K. Gabriele --- unchanged since 20/9/10 --- P.S. If the subject contains [LON] or the addressee acknowledges the receipt within 48 hours then I don't resend the email. subject(this) ∈ L(LON*) ∨ ∃x. (x ∈ MyInbox ∧ Acknowledges(x, this) ∧ time(x) Now + 48h) ⇒ ¬resend(I, this). If an email is sent by a sender that is not a trusted contact or the email does not contain a valid code then the email is not received. A valid code starts with a hyphen and ends with X. ∀x. x ∈ MyInbox ⇒ from(x) ∈ MySafeSenderList ∨ (∃y. y ∈ subject(x) ∧ y ∈ L(-[a-z]+[0-9]X)).
Re: No. of Map and reduce tasks
Hi Mohit, No of Maps - It depends on what is the Total File Size / Block Size No of Reducers - You can specify. Regards, Jagaran From: Mohit Anchlia mohitanch...@gmail.com To: common-user@hadoop.apache.org Sent: Thu, 26 May, 2011 2:48:20 PM Subject: No. of Map and reduce tasks How can I tell how the map and reduce tasks were spread accross the cluster? I looked at the jobtracker web page but can't find that info. Also, can I specify how many map or reduce tasks I want to be launched? From what I understand is that it's based on the number of input files passed to hadoop. So if I have 4 files there will be 4 Map taks that will be launced and reducer is dependent on the hashpartitioner.
web site doc link broken
Th Hadoop Common home page: http://hadoop.apache.org/common/ has a broken link (Learn About) to the docs. It tries to use: http://hadoop.apache.org/common/docs/stable/ which doesn't exist (404). It should probably be: http://hadoop.apache.org/common/docs/current/ Or, someone has deleted the stable docs, which I can't help you with. :-) Thanks.
Re: No. of Map and reduce tasks
I ran a simple pig script on this file: -rw-r--r-- 1 root root 208348 May 26 13:43 excite-small.log that orders the contents by name. But it only created one mapper. How can I change this to distribute accross multiple machines? On Thu, May 26, 2011 at 3:08 PM, jagaran das jagaran_...@yahoo.co.in wrote: Hi Mohit, No of Maps - It depends on what is the Total File Size / Block Size No of Reducers - You can specify. Regards, Jagaran From: Mohit Anchlia mohitanch...@gmail.com To: common-user@hadoop.apache.org Sent: Thu, 26 May, 2011 2:48:20 PM Subject: No. of Map and reduce tasks How can I tell how the map and reduce tasks were spread accross the cluster? I looked at the jobtracker web page but can't find that info. Also, can I specify how many map or reduce tasks I want to be launched? From what I understand is that it's based on the number of input files passed to hadoop. So if I have 4 files there will be 4 Map taks that will be launced and reducer is dependent on the hashpartitioner.
Re: No. of Map and reduce tasks
have more data for it to process :) On 2011-05-26, at 4:30 PM, Mohit Anchlia wrote: I ran a simple pig script on this file: -rw-r--r-- 1 root root 208348 May 26 13:43 excite-small.log that orders the contents by name. But it only created one mapper. How can I change this to distribute accross multiple machines? On Thu, May 26, 2011 at 3:08 PM, jagaran das jagaran_...@yahoo.co.in wrote: Hi Mohit, No of Maps - It depends on what is the Total File Size / Block Size No of Reducers - You can specify. Regards, Jagaran From: Mohit Anchlia mohitanch...@gmail.com To: common-user@hadoop.apache.org Sent: Thu, 26 May, 2011 2:48:20 PM Subject: No. of Map and reduce tasks How can I tell how the map and reduce tasks were spread accross the cluster? I looked at the jobtracker web page but can't find that info. Also, can I specify how many map or reduce tasks I want to be launched? From what I understand is that it's based on the number of input files passed to hadoop. So if I have 4 files there will be 4 Map taks that will be launced and reducer is dependent on the hashpartitioner.
Re: No. of Map and reduce tasks
I think I understand that by last 2 replies :) But my question is can I change this configuration to say split file into 250K so that multiple mappers can be invoked? On Thu, May 26, 2011 at 3:41 PM, James Seigel ja...@tynt.com wrote: have more data for it to process :) On 2011-05-26, at 4:30 PM, Mohit Anchlia wrote: I ran a simple pig script on this file: -rw-r--r-- 1 root root 208348 May 26 13:43 excite-small.log that orders the contents by name. But it only created one mapper. How can I change this to distribute accross multiple machines? On Thu, May 26, 2011 at 3:08 PM, jagaran das jagaran_...@yahoo.co.in wrote: Hi Mohit, No of Maps - It depends on what is the Total File Size / Block Size No of Reducers - You can specify. Regards, Jagaran From: Mohit Anchlia mohitanch...@gmail.com To: common-user@hadoop.apache.org Sent: Thu, 26 May, 2011 2:48:20 PM Subject: No. of Map and reduce tasks How can I tell how the map and reduce tasks were spread accross the cluster? I looked at the jobtracker web page but can't find that info. Also, can I specify how many map or reduce tasks I want to be launched? From what I understand is that it's based on the number of input files passed to hadoop. So if I have 4 files there will be 4 Map taks that will be launced and reducer is dependent on the hashpartitioner.
Re: No. of Map and reduce tasks
Set input split size really low, you might get something. I'd rather you fire up some nix commands and pack together that file onto itself a bunch if times and the put it back into hdfs and let 'er rip Sent from my mobile. Please excuse the typos. On 2011-05-26, at 4:56 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I think I understand that by last 2 replies :) But my question is can I change this configuration to say split file into 250K so that multiple mappers can be invoked? On Thu, May 26, 2011 at 3:41 PM, James Seigel ja...@tynt.com wrote: have more data for it to process :) On 2011-05-26, at 4:30 PM, Mohit Anchlia wrote: I ran a simple pig script on this file: -rw-r--r-- 1 root root 208348 May 26 13:43 excite-small.log that orders the contents by name. But it only created one mapper. How can I change this to distribute accross multiple machines? On Thu, May 26, 2011 at 3:08 PM, jagaran das jagaran_...@yahoo.co.in wrote: Hi Mohit, No of Maps - It depends on what is the Total File Size / Block Size No of Reducers - You can specify. Regards, Jagaran From: Mohit Anchlia mohitanch...@gmail.com To: common-user@hadoop.apache.org Sent: Thu, 26 May, 2011 2:48:20 PM Subject: No. of Map and reduce tasks How can I tell how the map and reduce tasks were spread accross the cluster? I looked at the jobtracker web page but can't find that info. Also, can I specify how many map or reduce tasks I want to be launched? From what I understand is that it's based on the number of input files passed to hadoop. So if I have 4 files there will be 4 Map taks that will be launced and reducer is dependent on the hashpartitioner.
Unable to start hadoop-0.20.2 but able to start hadoop-0.20.203 cluster
Hi Folks, We try to get hbase and hadoop running on clusters, take 2 Solaris servers for now. Because of the incompatibility issue between hbase and hadoop, we have to stick with hadoop 0.20.2-append release. It is very straight forward to make hadoop-0.20.203 running, but stuck for several days with hadoop-0.20.2, even the official release, not the append version. 1. Once try to run start-mapred.sh(hadoop-daemon.sh --config $HADOOP_CONF_DIR start jobtracker), following errors shown in namenode and jobtracker logs: 2011-05-26 12:30:29,169 WARN org.apache.hadoop.hdfs.server.namenode.FSNamesystem: Not able to place enough replicas, still in need of 1 2011-05-26 12:30:29,175 INFO org.apache.hadoop.ipc.Server: IPC Server handler 4 on 9000, call addBlock(/tmp/hadoop-cfadm/mapred/system/jobtracker.info, DFSCl ient_2146408809) from 169.193.181.212:55334: error: java.io.IOException: File /tmp/hadoop-cfadm/mapred/system/jobtracker.info could only be replicated to 0 n odes, instead of 1 java.io.IOException: File /tmp/hadoop-cfadm/mapred/system/jobtracker.info could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1271) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:422) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:508) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:959) at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:955) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:953) 2. Also, Configured Capacity is 0, cannot put any file to HDFS. 3. in datanode server, no error in logs, but tasktracker logs has the following suspicious thing: 2011-05-25 23:36:10,839 INFO org.apache.hadoop.ipc.Server: IPC Server Responder: starting 2011-05-25 23:36:10,839 INFO org.apache.hadoop.ipc.Server: IPC Server listener on 41904: starting 2011-05-25 23:36:10,852 INFO org.apache.hadoop.ipc.Server: IPC Server handler 0 on 41904: starting 2011-05-25 23:36:10,853 INFO org.apache.hadoop.ipc.Server: IPC Server handler 1 on 41904: starting 2011-05-25 23:36:10,853 INFO org.apache.hadoop.ipc.Server: IPC Server handler 2 on 41904: starting 2011-05-25 23:36:10,853 INFO org.apache.hadoop.ipc.Server: IPC Server handler 3 on 41904: starting . 2011-05-25 23:36:10,855 INFO org.apache.hadoop.ipc.Server: IPC Server handler 63 on 41904: starting 2011-05-25 23:36:10,950 INFO org.apache.hadoop.mapred.TaskTracker: TaskTracker up at: localhost/127.0.0.1:41904 2011-05-25 23:36:10,950 INFO org.apache.hadoop.mapred.TaskTracker: Starting tracker tracker_loanps3d:localhost/127.0.0.1:41904 I have tried all suggestions found so far, including 1) remove hadoop-name and hadoop-data folders and reformat namenode; 2) clean up all temp files/folders under /tmp; But nothing works. Your help is greatly appreciated. Thanks, RX
Re: No. of Map and reduce tasks
If you give really low size files, then the use of Big Block Size of Hadoop goes away. Instead try merging files. Hope that helps From: James Seigel ja...@tynt.com To: common-user@hadoop.apache.org common-user@hadoop.apache.org Sent: Thu, 26 May, 2011 6:04:07 PM Subject: Re: No. of Map and reduce tasks Set input split size really low, you might get something. I'd rather you fire up some nix commands and pack together that file onto itself a bunch if times and the put it back into hdfs and let 'er rip Sent from my mobile. Please excuse the typos. On 2011-05-26, at 4:56 PM, Mohit Anchlia mohitanch...@gmail.com wrote: I think I understand that by last 2 replies :) But my question is can I change this configuration to say split file into 250K so that multiple mappers can be invoked? On Thu, May 26, 2011 at 3:41 PM, James Seigel ja...@tynt.com wrote: have more data for it to process :) On 2011-05-26, at 4:30 PM, Mohit Anchlia wrote: I ran a simple pig script on this file: -rw-r--r-- 1 root root 208348 May 26 13:43 excite-small.log that orders the contents by name. But it only created one mapper. How can I change this to distribute accross multiple machines? On Thu, May 26, 2011 at 3:08 PM, jagaran das jagaran_...@yahoo.co.in wrote: Hi Mohit, No of Maps - It depends on what is the Total File Size / Block Size No of Reducers - You can specify. Regards, Jagaran From: Mohit Anchlia mohitanch...@gmail.com To: common-user@hadoop.apache.org Sent: Thu, 26 May, 2011 2:48:20 PM Subject: No. of Map and reduce tasks How can I tell how the map and reduce tasks were spread accross the cluster? I looked at the jobtracker web page but can't find that info. Also, can I specify how many map or reduce tasks I want to be launched? From what I understand is that it's based on the number of input files passed to hadoop. So if I have 4 files there will be 4 Map taks that will be launced and reducer is dependent on the hashpartitioner.
Re: Are hadoop fs commands serial or parallel
Hi guys, Another question related to it is that when you do hadoop fs -copyFromLocal or use API to call fs.write(), does it write to local filesystem first before writing to HDFS. I read and found out that it writes on local file-system until block-size is reached and then writes on HDFS. Wouldn't HDFS Client choke if it writes to local filesystem if multiple such fs -copyFromLocal commands are running. I thought atleast in fs.write(), if you provide byte array, it should not write on local file-system ? Could somebody tell how fs -copyFromLocal and fs.write() work ? Do they write on local-filesystem beofre block size is reached and then write to HDFS or write directly to HDFS ? Thanks in advance, -JJ On Wed, May 18, 2011 at 9:39 AM, Patrick Angeles patr...@cloudera.comwrote: kinda clunky but you could do this via shell: for $FILE in $LIST_OF_FILES ; do hadoop fs -copyFromLocal $FILE $DEST_PATH done If doing this via the Java API, then, yes you will have to use multiple threads. On Wed, May 18, 2011 at 1:04 AM, Mapred Learn mapred.le...@gmail.com wrote: Thanks harsh ! That means basically both APIs as well as hadoop client commands allow only serial writes. I was wondering what could be other ways to write data in parallel to HDFS other than using multiple parallel threads. Thanks, JJ Sent from my iPhone On May 17, 2011, at 10:59 PM, Harsh J ha...@cloudera.com wrote: Hello, Adding to Joey's response, copyFromLocal's current implementation is serial given a list of files. On Wed, May 18, 2011 at 9:57 AM, Mapred Learn mapred.le...@gmail.com wrote: Thanks Joey ! I will try to find out abt copyFromLocal. Looks like Hadoop Apis write serially as you pointed out. Thanks, -JJ On May 17, 2011, at 8:32 PM, Joey Echeverria j...@cloudera.com wrote: The sequence file writer definitely does it serially as you can only ever write to the end of a file in Hadoop. Doing copyFromLocal could write multiple files in parallel (I'm not sure if it does or not), but a single file would be written serially. -Joey On Tue, May 17, 2011 at 5:44 PM, Mapred Learn mapred.le...@gmail.com wrote: Hi, My question is when I run a command from hdfs client, for eg. hadoop fs -copyFromLocal or create a sequence file writer in java code and append key/values to it through Hadoop APIs, does it internally transfer/write data to HDFS serially or in parallel ? Thanks in advance, -JJ -- Joseph Echeverria Cloudera, Inc. 443.305.9434 -- Harsh J