Re: Basic code organization questions + scheduling
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Alex (and others). You should take a look at Nutch. It's a search-engine built on Lucene, though it can be setup on top of Hadoop. Take a look: This didn't help me much. Although the description I gave of the basic flow of the app seems to be close to what Nutch is doing (and I've been looking at the Nutch code), the questions are more general and not related to indexing as such, but about code organization. If someone has more input to those, feel free to add it. On Mon, Sep 8, 2008 at 2:54 AM, Tarjei Huse [EMAIL PROTECTED] wrote: Hi, I'm planning to use Hadoop in for a set of typical crawler/indexer tasks. The basic flow is input:array of urls actions: | 1. get pages | 2. extract new urls from pages - start new job extract text - index / filter (as new jobs) What I'm considering is how I should build this application to fit into the map/reduce context. I'm thinking that step 1 and 2 should be separate map/reduce tasks that then pipe things on to the next step. This is where I am a bit at loss to see how it is smart to organize the code in logical units and also how to spawn new tasks when an old one is over. Is the usual way to control the flow of a set of tasks to have an external application running that listens to jobs ending via the endNotificationUri and then spawns new tasks or should the job itself contain code to create new jobs? Would it be a good idea to use Cascading here? I'm also considering how I should do job scheduling (I got a lot of reoccurring tasks). Has anyone found a good framework for job control of reoccurring tasks or should I plan to build my own using quartz ? Any tips/best practices with regard to the issues described above are most welcome. Feel free to ask further questions if you find my descriptions of the issues lacking. Kind regards, Tarjei Kind regards, Tarjei -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIxNdWYVRKCnSvzfIRAnJ0AJ9EcXzdyZgouN8q6wtad63SUHP/twCfZ88o 9km8MTJcTQxnc7bijR1Oxs0= =79fZ -END PGP SIGNATURE-
RE: task assignment managemens.
How about just specify machines to run the task on? I haven't seen it anywhere.. -Original Message- From: Devaraj Das [mailto:[EMAIL PROTECTED] Sent: Sunday, September 07, 2008 9:55 PM To: core-user@hadoop.apache.org Subject: Re: task assignment managemens. No that is not possible today. However, you might want to look at the TaskScheduler to see if you can implement a scheduler to provide this kind of task scheduling. In the current hadoop, one point regarding computationally intensive task is that if the machine is not able to keep up with the rest of the machines (and the task on that machine is running slower than others), speculative execution, if enabled, can help a lot. Also, implicitly, faster/better machines get more work than the slower machines. On 9/8/08 3:27 AM, Dmitry Pushkarev [EMAIL PROTECTED] wrote: Dear Hadoop users, Is it possible without using java manage task assignment to implement some simple rules? Like do not launch more that 1 instance of crawling task on a machine, and do not run data intensive tasks on remote machines, and do not run computationally intensive tasks on single-core machines:etc. Now it's done by failing tasks that decided to run on a wrong machine, but I hope to find some solution on jobtracker side.. --- Dmitry
Re: HBase, Hive, Pig and other Hadoop based technologies
Both Pig and Hive are written on top of Hadoop, is that correct ? Does this mean for instance that they would work with any implementation of the Hadoop File System interface ? Whether it is HDFS, KFS, LocalFS, or any future implementation that may be implemented ? Is that true for HBase as well ? I am assuming Pig and Hive use the Hadoop standard M/R framework and API, is that right ? Thanks, Naama On Wed, Sep 3, 2008 at 3:04 PM, Naama Kraus [EMAIL PROTECTED] wrote: Hi, There are various technologies on top of Hadoop such as HBase, Hive, Pig and more. I was wondering what are the differences between them. What are the usage scenarios that fit each one of them. For instance, is it true to say that Pig and Hive belong to the same family ? Or is Hive more close to HBase ? My understanding is that HBase allows direct lookup and low latency queries, while Pig and Hive provide batch processing operations which are M/R based. Both define a data model and an SQL-like query language. Is this true ? Could anyone shed light on when to use each technology ? Main differences ? Pros and Cons ? Information on other technologies such as Jaql is also welcome. Thanks, Naama -- oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo If you want your children to be intelligent, read them fairy tales. If you want them to be more intelligent, read them more fairy tales. (Albert Einstein) -- oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo If you want your children to be intelligent, read them fairy tales. If you want them to be more intelligent, read them more fairy tales. (Albert Einstein)
Re: Getting started questions
Dennis Kubes wrote: John Howland wrote: I've been reading up on Hadoop for a while now and I'm excited that I'm finally getting my feet wet with the examples + my own variations. If anyone could answer any of the following questions, I'd greatly appreciate it. 1. I'm processing document collections, with the number of documents ranging from 10,000 - 10,000,000. What is the best way to store this data for effective processing? AFAIK hadoop doesn't do well with, although it can handle, a large number of small files. So it would be better to read in the documents and store them in SequenceFile or MapFile format. This would be similar to the way the Fetcher works in Nutch. 10M documents in a sequence/map file on DFS is comparatively small and can be handled efficiently. - The bodies of the documents usually range from 1K-100KB in size, but some outliers can be as big as 4-5GB. I would say store your document objects as Text objects, not sure if Text has a max size. I think it does but not sure what that is. If it does you can always store as a BytesWritable which is just an array of bytes. But you are going to have memory issues reading in and writing out that large of a record. - I will also need to store some metadata for each document which I figure could be stored as JSON or XML. - I'll typically filter on the metadata and then doing standard operations on the bodies, like word frequency and searching. It is possible to create an OutputFormat that writes out multiple files. You could also use a MapWritable as the value to store the document and associated metadata. Is there a canned FileInputFormat that makes sense? Should I roll my own? How can I access the bodies as streams so I don't have to read them into RAM A writable is read into RAM so even treating it like a stream doesn't get around that. One thing you might want to consider is to tar up say X documents at a time and store that as a file in DFS. You would have many of these files. Then have an index that has the offsets of the files and their keys (document ids). That index can be passed as input into a MR job that can then go to DFS and stream out the file as you need it. The job will be slower because you are doing it this way but it is a solution to handling such large documents as streams. all at once? Am I right in thinking that I should treat each document as a record and map across them, or do I need to be more creative in what I'm mapping across? 2. Some of the tasks I want to run are pure map operations (no reduction), where I'm calculating new metadata fields on each document. To end up with a good result set, I'll need to copy the entire input record + new fields into another set of output files. Is there a better way? I haven't wanted to go down the HBase road because it can't handle very large values (for the bodies) and it seems to make the most sense to keep the document bodies together with the metadata, to allow for the greatest locality of reference on the datanodes. If you don't specify a reducer, the IdentityReducer is run which simply passes through output. One can set number of reducers to zero and reduce phase will not take place. 3. I'm sure this is not a new idea, but I haven't seen anything regarding it... I'll need to run several MR jobs as a pipeline... is there any way for the map tasks in a subsequent stage to begin processing data from previous stage's reduce task before that reducer has fully finished? Yup, just use FileOutputFormat.getOutputPath(previousJobConf); Dennis Whatever insight folks could lend me would be a big help in crossing the chasm from the Word Count and associated examples to something more real. A whole heap of thanks in advance, John
Re: Could not obtain block: blk_-2634319951074439134_1129 file=/user/root/crawl_debug/segments/20080825053518/content/part-00002/data
There's a JIRA on this already: https://issues.apache.org/jira/browse/HADOOP-3831 Setting dfs.datanode.socket.write.timeout=0 in hadoop-site.xml seems to do the trick for now. Espen On Mon, Sep 8, 2008 at 11:24 AM, Espen Amble Kolstad [EMAIL PROTECTED] wrote: Hi, Thanks for the tip! I tried revision 692572 of the 0.18 branch, but I still get the same errors. On Sunday 07 September 2008 09:42:43 Dhruba Borthakur wrote: The DFS errors might have been caused by http://issues.apache.org/jira/browse/HADOOP-4040 thanks, dhruba On Sat, Sep 6, 2008 at 6:59 AM, Devaraj Das [EMAIL PROTECTED] wrote: These exceptions are apparently coming from the dfs side of things. Could someone from the dfs side please look at these? On 9/5/08 3:04 PM, Espen Amble Kolstad [EMAIL PROTECTED] wrote: Hi, Thanks! The patch applies without change to hadoop-0.18.0, and should be included in a 0.18.1. However, I'm still seeing: in hadoop.log: 2008-09-05 11:13:54,805 WARN dfs.DFSClient - Exception while reading from blk_3428404120239503595_2664 of /user/trank/segments/20080905102650/crawl_generate/part-00010 from somehost:50010: java.io.IOException: Premeture EOF from in putStream in datanode.log: 2008-09-05 11:15:09,554 WARN dfs.DataNode - DatanodeRegistration(somehost:50010, storageID=DS-751763840-somehost-50010-1219931304453, infoPort=50075, ipcPort=50020):Got exception while serving blk_-4682098638573619471_2662 to /somehost: java.net.SocketTimeoutException: 48 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/somehost:50010 remote=/somehost:45244] These entries in datanode.log happens a few minutes apart repeatedly. I've reduced # map-tasks so load on this node is below 1.0 with 5GB of free memory (so it's not resource starvation). Espen On Thu, Sep 4, 2008 at 3:33 PM, Devaraj Das [EMAIL PROTECTED] wrote: I started a profile of the reduce-task. I've attached the profiling output. It seems from the samples that ramManager.waitForDataToMerge() doesn't actually wait. Has anybody seen this behavior. This has been fixed in HADOOP-3940 On 9/4/08 6:36 PM, Espen Amble Kolstad [EMAIL PROTECTED] wrote: I have the same problem on our cluster. It seems the reducer-tasks are using all cpu, long before there's anything to shuffle. I started a profile of the reduce-task. I've attached the profiling output. It seems from the samples that ramManager.waitForDataToMerge() doesn't actually wait. Has anybody seen this behavior. Espen On Thursday 28 August 2008 06:11:42 wangxu wrote: Hi,all I am using hadoop-0.18.0-core.jar and nutch-2008-08-18_04-01-55.jar, and running hadoop on one namenode and 4 slaves. attached is my hadoop-site.xml, and I didn't change the file hadoop-default.xml when data in segments are large,this kind of errors occure: java.io.IOException: Could not obtain block: blk_-2634319951074439134_1129 file=/user/root/crawl_debug/segments/20080825053518/content/part- 2/data at org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClie nt.jav a:1462) at org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient. java:1 312) at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:14 17) at java.io.DataInputStream.readFully(DataInputStream.java:178) at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.j ava:64 ) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102 ) at org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java :1646) at org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceF ile.ja va:1712) at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile .java: 1787) at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(Seq uenceF ileRecordReader.java:104) at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRe cordRe ader.java:79) at org.apache.hadoop.mapred.join.WrappedRecordReader.next(WrappedRecordR eader. java:112) at org.apache.hadoop.mapred.join.WrappedRecordReader.accept(WrappedRecor dReade r.java:130) at org.apache.hadoop.mapred.join.CompositeRecordReader.fillJoinCollector (Compo siteRecordReader.java:398) at org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader. java:5 6) at org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader. java:3 3) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.jav a:165) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209 ) how can I correct this? thanks. Xu
Re: critical name node problem
Allen Wittenauer wrote: On 9/5/08 5:53 AM, Andreas Kostyrka [EMAIL PROTECTED] wrote: Another idea would be a tool or namenode startup mode that would make it ignore EOFExceptions to recover as much of the edits as possible. We clearly need to change the how to configure docs to make sure people put at least two directories on two different storage systems for the dfs.name.dir . This problem seems to happen quite often, and having two+ dirs helps protect against it. We recently had one of the disks on one of our copies go bad. The system kept going just fine until we had a chance to reconfig the name node. That said, I've just HADOOP-4080 to help alert admins in these situations. that and HADOOP-4081. Apache Axis has this production/development switch; in develop mode it sends stack traces over the wire and is generally more forgiving. By default it assumes you are in production rather than development, so you have to explicitly flip the switch to get slighly reduced security. Hadoop could have something similar, where if the production flag is set, the cluster would simply refuse to come up if it felt the configuration wasn't robust enough.
Re: Could not obtain block: blk_-2634319951074439134_1129 file=/user/root/crawl_debug/segments/20080825053518/content/part-00002/data
Hi, Thanks for the tip! I tried revision 692572 of the 0.18 branch, but I still get the same errors. On Sunday 07 September 2008 09:42:43 Dhruba Borthakur wrote: The DFS errors might have been caused by http://issues.apache.org/jira/browse/HADOOP-4040 thanks, dhruba On Sat, Sep 6, 2008 at 6:59 AM, Devaraj Das [EMAIL PROTECTED] wrote: These exceptions are apparently coming from the dfs side of things. Could someone from the dfs side please look at these? On 9/5/08 3:04 PM, Espen Amble Kolstad [EMAIL PROTECTED] wrote: Hi, Thanks! The patch applies without change to hadoop-0.18.0, and should be included in a 0.18.1. However, I'm still seeing: in hadoop.log: 2008-09-05 11:13:54,805 WARN dfs.DFSClient - Exception while reading from blk_3428404120239503595_2664 of /user/trank/segments/20080905102650/crawl_generate/part-00010 from somehost:50010: java.io.IOException: Premeture EOF from in putStream in datanode.log: 2008-09-05 11:15:09,554 WARN dfs.DataNode - DatanodeRegistration(somehost:50010, storageID=DS-751763840-somehost-50010-1219931304453, infoPort=50075, ipcPort=50020):Got exception while serving blk_-4682098638573619471_2662 to /somehost: java.net.SocketTimeoutException: 48 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/somehost:50010 remote=/somehost:45244] These entries in datanode.log happens a few minutes apart repeatedly. I've reduced # map-tasks so load on this node is below 1.0 with 5GB of free memory (so it's not resource starvation). Espen On Thu, Sep 4, 2008 at 3:33 PM, Devaraj Das [EMAIL PROTECTED] wrote: I started a profile of the reduce-task. I've attached the profiling output. It seems from the samples that ramManager.waitForDataToMerge() doesn't actually wait. Has anybody seen this behavior. This has been fixed in HADOOP-3940 On 9/4/08 6:36 PM, Espen Amble Kolstad [EMAIL PROTECTED] wrote: I have the same problem on our cluster. It seems the reducer-tasks are using all cpu, long before there's anything to shuffle. I started a profile of the reduce-task. I've attached the profiling output. It seems from the samples that ramManager.waitForDataToMerge() doesn't actually wait. Has anybody seen this behavior. Espen On Thursday 28 August 2008 06:11:42 wangxu wrote: Hi,all I am using hadoop-0.18.0-core.jar and nutch-2008-08-18_04-01-55.jar, and running hadoop on one namenode and 4 slaves. attached is my hadoop-site.xml, and I didn't change the file hadoop-default.xml when data in segments are large,this kind of errors occure: java.io.IOException: Could not obtain block: blk_-2634319951074439134_1129 file=/user/root/crawl_debug/segments/20080825053518/content/part- 2/data at org.apache.hadoop.dfs.DFSClient$DFSInputStream.chooseDataNode(DFSClie nt.jav a:1462) at org.apache.hadoop.dfs.DFSClient$DFSInputStream.blockSeekTo(DFSClient. java:1 312) at org.apache.hadoop.dfs.DFSClient$DFSInputStream.read(DFSClient.java:14 17) at java.io.DataInputStream.readFully(DataInputStream.java:178) at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.j ava:64 ) at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102 ) at org.apache.hadoop.io.SequenceFile$Reader.readBuffer(SequenceFile.java :1646) at org.apache.hadoop.io.SequenceFile$Reader.seekToCurrentValue(SequenceF ile.ja va:1712) at org.apache.hadoop.io.SequenceFile$Reader.getCurrentValue(SequenceFile .java: 1787) at org.apache.hadoop.mapred.SequenceFileRecordReader.getCurrentValue(Seq uenceF ileRecordReader.java:104) at org.apache.hadoop.mapred.SequenceFileRecordReader.next(SequenceFileRe cordRe ader.java:79) at org.apache.hadoop.mapred.join.WrappedRecordReader.next(WrappedRecordR eader. java:112) at org.apache.hadoop.mapred.join.WrappedRecordReader.accept(WrappedRecor dReade r.java:130) at org.apache.hadoop.mapred.join.CompositeRecordReader.fillJoinCollector (Compo siteRecordReader.java:398) at org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader. java:5 6) at org.apache.hadoop.mapred.join.JoinRecordReader.next(JoinRecordReader. java:3 3) at org.apache.hadoop.mapred.MapTask$TrackedRecordReader.next(MapTask.jav a:165) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:45) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2209 ) how can I correct this? thanks. Xu
Re: DFS temporary files?
Owen O'Malley wrote: Currently there isn't a way to do that. In Hadoop 0.19, there will be a way to have a clean up method that runs at the end of the job. See HADOOP-3150https://issues.apache.org/jira/browse/HADOOP-3150 . another bit of feature creep would be an expires: attribute on files, and something to purge expired files every so often. Which ensures that even if a job dies or the entire cluster is reset, stuff gets cleaned up Before someone rushes to implement this, I've been burned in the past by differences in a clusters machines and clocks. Even if everything really is in sync with NTP, and not configured to talk to a NTP server that the production site can't see, you still need to be 100% that all your boxes are in the same time zone. -steve
race condition in SequenceFileOutputFormat.getReaders() ?
Hi Using the nutch tools I see fairly frequent crashes in SequenceFileOutputFormat.getReaders() with stack traces like the one below. What appears to be happending is that there's a temporary file inside the generate-temp-1220879127849 which exists when getReaders() lists the contents of the directory, but has been deleted by the time it goes to examine the contents. Since I'm using nutch, this is in Hadoop version 0.15, but the code for getReaders() doesn't seem to have changed in 0.18. Is this a known problem? regards Barry 2008-09-08 14:07:04,429 FATAL crawl.Generator - Generator: org.apache.hadoop.ipc.RemoteException: java.io.IOException: Cannot open filename /tmp/hadoop-bhaddow/mapred/temp/generate-temp-1220879127849/_task_200809081337_0019_r_04_1 at org.apache.hadoop.dfs.NameNode.open(NameNode.java:238) at sun.reflect.GeneratedMethodAccessor8.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:379) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:596) at org.apache.hadoop.ipc.Client.call(Client.java:482) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184) at org.apache.hadoop.dfs.$Proxy0.open(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at org.apache.hadoop.dfs.$Proxy0.open(Unknown Source) at org.apache.hadoop.dfs.DFSClient$DFSInputStream.openInfo(DFSClient.java:848) at org.apache.hadoop.dfs.DFSClient$DFSInputStream.init(DFSClient.java:840) at org.apache.hadoop.dfs.DFSClient.open(DFSClient.java:285) at org.apache.hadoop.dfs.DistributedFileSystem.open(DistributedFileSystem.java:114) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1356) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1349) at org.apache.hadoop.io.SequenceFile$Reader.init(SequenceFile.java:1344) at org.apache.hadoop.mapred.SequenceFileOutputFormat.getReaders(SequenceFileOutputFormat.java:87) at org.apache.nutch.crawl.Generator.generate(Generator.java:443) at org.apache.nutch.crawl.Generator.run(Generator.java:580) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:54) at org.apache.nutch.crawl.Generator.main(Generator.java:543) -- The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.
About the nama of second-namenode!!
Because the name of second-namenode making so much confusing, does the hadoop team consider to change it? -- Sorry for my english!! 明
Re: Basic code organization questions + scheduling
If you wrote a simple URL fetcher function for Cascading, you would have a very powerful web crawler that would dwarf Nutch in flexibility. That said, Nutch is optimized for storage, has supporting tools, ranking algorithms, and has been up against some nasty html and other document types. building a really robust crawler is non-trivial. If i was just starting out and needed to implement a proprietary process, I would use Nutch for fetching raw content, and refreshing it. then use Cascading for parsing, indexing, etc. cheers, chris On Sep 8, 2008, at 12:42 AM, tarjei wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Alex (and others). You should take a look at Nutch. It's a search-engine built on Lucene, though it can be setup on top of Hadoop. Take a look: This didn't help me much. Although the description I gave of the basic flow of the app seems to be close to what Nutch is doing (and I've been looking at the Nutch code), the questions are more general and not related to indexing as such, but about code organization. If someone has more input to those, feel free to add it. On Mon, Sep 8, 2008 at 2:54 AM, Tarjei Huse [EMAIL PROTECTED] wrote: Hi, I'm planning to use Hadoop in for a set of typical crawler/ indexer tasks. The basic flow is input:array of urls actions: | 1. get pages | 2. extract new urls from pages - start new job extract text - index / filter (as new jobs) What I'm considering is how I should build this application to fit into the map/reduce context. I'm thinking that step 1 and 2 should be separate map/reduce tasks that then pipe things on to the next step. This is where I am a bit at loss to see how it is smart to organize the code in logical units and also how to spawn new tasks when an old one is over. Is the usual way to control the flow of a set of tasks to have an external application running that listens to jobs ending via the endNotificationUri and then spawns new tasks or should the job itself contain code to create new jobs? Would it be a good idea to use Cascading here? I'm also considering how I should do job scheduling (I got a lot of reoccurring tasks). Has anyone found a good framework for job control of reoccurring tasks or should I plan to build my own using quartz ? Any tips/best practices with regard to the issues described above are most welcome. Feel free to ask further questions if you find my descriptions of the issues lacking. Kind regards, Tarjei Kind regards, Tarjei -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFIxNdWYVRKCnSvzfIRAnJ0AJ9EcXzdyZgouN8q6wtad63SUHP/twCfZ88o 9km8MTJcTQxnc7bijR1Oxs0= =79fZ -END PGP SIGNATURE- -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Thinking about retriving DFS metadata from datanodes!!!
Hi all. We all know the importance of NameNode for a cluster, when the NameNode whith all matadata of the DFS is breaked down, the whole of the clustes'data is gone. So, if we can retrive DFS metadata from datanodes, It should be a additional great robustness for hadoop. Does any one thinking It's valuable for it? -- Sorry for my english!! 明
Re: public IP for datanode on EC2
I have tried using *slave.host.name* and give it the public address of my data node. I can now see the node listed with its public address on the dfshealth.jsp, however when I try to send a file to the HDFS from my external server I still get : *08/09/08 15:58:41 INFO dfs.DFSClient: Waiting to find target node: 10.251.75.177:50010 08/09/08 15:59:50 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException 08/09/08 15:59:50 INFO dfs.DFSClient: Abandoning block blk_-8257572465338588575 08/09/08 15:59:50 INFO dfs.DFSClient: Waiting to find target node: 10.251.75.177:50010 08/09/08 15:59:56 WARN dfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block. at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2246) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1700(DFSClient.java:1702) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1842) 08/09/08 15:59:56 WARN dfs.DFSClient: Error Recovery for block blk_-8257572465338588575 bad datanode[0]* Is there another parameter I could specify to force the address of my datanode? I have been searching on the EC2 forums and documentation and apparently there is no way I can use *dfs.datanode.dns.interface* or * dfs.datanode.dns.nameserver* to specify the public IP of my instance. Has anyone else managed to send/retrieve stuff from HDFS on an EC2 cluster from an external machine? Thanks Julien 2008/9/5 Julien Nioche [EMAIL PROTECTED] Hi guys, I am using Hadoop on a EC2 cluster and am trying to send files onto the HDFS from an external machine. It works up to the point where I get this error message : *Waiting to find target node: 10.250.7.148:50010* I've seen a discussion about a similar issue on * http://thread.gmane.org/gmane.comp.jakarta.lucene.hadoop.user/2446/focus=2449 * but there are no details on how to fix the problem. Any idea about how I can set up my EC2 instances so that they return their public IPs and not the internal Amazon ones? Anything I can specify for the parameters *dfs.datanode.dns.interface* and *dfs.datanode.dns.nameserver*? What I am trying to do is to put my input to be processed onto the HDFS and retrieve the output from there. What I am not entirely sure of is whether I can launch my job from the external machine. Most people seem to SSH to the master to do that. Thanks Julien -- DigitalPebble Ltd http://www.digitalpebble.com -- DigitalPebble Ltd http://www.digitalpebble.com
RE: HBase, Hive, Pig and other Hadoop based technologies
Comments inline below: -Original Message- From: Naama Kraus [mailto:[EMAIL PROTECTED] Sent: Monday, September 08, 2008 1:00 AM To: hadoop-core; [EMAIL PROTECTED] Subject: Re: HBase, Hive, Pig and other Hadoop based technologies Both Pig and Hive are written on top of Hadoop, is that correct ? Yes. Does this mean for instance that they would work with any implementation of the Hadoop File System interface ? Whether it is HDFS, KFS, LocalFS, or any future implementation that may be implemented ? Yes. Is that true for HBase as well ? Yes. I am assuming Pig and Hive use the Hadoop standard M/R framework and API, is that right ? Yes. Thanks, Naama On Wed, Sep 3, 2008 at 3:04 PM, Naama Kraus [EMAIL PROTECTED] wrote: Hi, There are various technologies on top of Hadoop such as HBase, Hive, Pig and more. I was wondering what are the differences between them. What are the usage scenarios that fit each one of them. For instance, is it true to say that Pig and Hive belong to the same family ? Or is Hive more close to HBase ? My understanding is that HBase allows direct lookup and low latency queries, while Pig and Hive provide batch processing operations which are M/R based. Both define a data model and an SQL-like query language. Is this true ? Could anyone shed light on when to use each technology ? Main differences ? Pros and Cons ? Information on other technologies such as Jaql is also welcome. Thanks, Naama -- oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo If you want your children to be intelligent, read them fairy tales. If you want them to be more intelligent, read them more fairy tales. (Albert Einstein) -- oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo 00 oo If you want your children to be intelligent, read them fairy tales. If you want them to be more intelligent, read them more fairy tales. (Albert Einstein)
Re: public IP for datanode on EC2
Julien Nioche wrote: I have tried using *slave.host.name* and give it the public address of my data node. I can now see the node listed with its public address on the dfshealth.jsp, however when I try to send a file to the HDFS from my external server I still get : *08/09/08 15:58:41 INFO dfs.DFSClient: Waiting to find target node: 10.251.75.177:50010 08/09/08 15:59:50 INFO dfs.DFSClient: Exception in createBlockOutputStream java.net.SocketTimeoutException 08/09/08 15:59:50 INFO dfs.DFSClient: Abandoning block blk_-8257572465338588575 08/09/08 15:59:50 INFO dfs.DFSClient: Waiting to find target node: 10.251.75.177:50010 08/09/08 15:59:56 WARN dfs.DFSClient: DataStreamer Exception: java.io.IOException: Unable to create new block. at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.nextBlockOutputStream(DFSClient.java:2246) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream.access$1700(DFSClient.java:1702) at org.apache.hadoop.dfs.DFSClient$DFSOutputStream$DataStreamer.run(DFSClient.java:1842) 08/09/08 15:59:56 WARN dfs.DFSClient: Error Recovery for block blk_-8257572465338588575 bad datanode[0]* Is there another parameter I could specify to force the address of my datanode? I have been searching on the EC2 forums and documentation and apparently there is no way I can use *dfs.datanode.dns.interface* or * dfs.datanode.dns.nameserver* to specify the public IP of my instance. Has anyone else managed to send/retrieve stuff from HDFS on an EC2 cluster from an external machine? I think most people try to avoid allowing remote access for security reasons. If you can add a file, I can mount your filesystem too, maybe even delete things. Whereas with EC2-only filesystems, your files are *only* exposed to everyone else that knows or can scan for your IPAddr and ports.
Re: Getting started questions
Dennis, Thanks for the detailed response. I need to play with the SequenceFile format a bit -- I found the documentation for it on the wiki. I think I could build on top of the format to handle storage of very large documents. The vast majority of documents will fit into RAM and in a standard HDFS block (64MB, maybe up it to 128MB). For very large documents, I can split them into consecutive records in the SequenceFile. I can overload the key to be a combination of a real key and a record number... Shouldn't be too hard to extend SequenceFile to do this. Much obliged, John
Re: can i run multiple datanode in one pc?
2008-09-05 10:11:20,106 ERROR org.apache.hadoop.dfs.DataNode: java.net.BindException: Problem binding to /0.0.0.0:50010 Should you check if something is listening on 50010? I strongly encourage users to go through these logs when something does not work. Regd running multiple datanodes : Along with config directory, you should change the following env variables : HADOOP_LOG_DIR and HADOOP_PID_DIR (I set both to the same value). The following variables should be different in hadoop-site.xml : 1. dfs.data.dir or hadoop.tmp.dir 2. dfs.datanode.address 3. dfs.datanode.http.address 4. dfs.datanode.ipc.address Raghu. 叶双明 wrote: In addition, I can't start datanode in another computer use command: bin/hadoop-daemons.sh --config conf start datanode, by the default config, log message is: 2008-09-05 10:11:19,208 INFO org.apache.hadoop.dfs.DataNode: STARTUP_MSG: / STARTUP_MSG: Starting DataNode STARTUP_MSG: host = testcenter/192.168.100.120 STARTUP_MSG: args = [] STARTUP_MSG: version = 0.17.1 STARTUP_MSG: build = http://svn.apache.org/repos/asf/hadoop/core/branches/branch-0.17 -r 669344; compiled by 'hadoopqa' on Thu Jun 19 01:18:25 UTC 2008 / 2008-09-05 10:11:20,097 INFO org.apache.hadoop.dfs.DataNode: Registered FSDatasetStatusMBean 2008-09-05 10:11:20,106 ERROR org.apache.hadoop.dfs.DataNode: java.net.BindException: Problem binding to /0.0.0.0:50010 at org.apache.hadoop.ipc.Server.bind(Server.java:175) at org.apache.hadoop.dfs.DataNode.startDataNode(DataNode.java:264) at org.apache.hadoop.dfs.DataNode.init(DataNode.java:171) at org.apache.hadoop.dfs.DataNode.makeInstance(DataNode.java:2765) at org.apache.hadoop.dfs.DataNode.instantiateDataNode(DataNode.java:2720) at org.apache.hadoop.dfs.DataNode.createDataNode(DataNode.java:2728) at org.apache.hadoop.dfs.DataNode.main(DataNode.java:2850) 2008-09-05 10:11:20,107 INFO org.apache.hadoop.dfs.DataNode: SHUTDOWN_MSG: / SHUTDOWN_MSG: Shutting down DataNode at testcenter/192.168.100.120 / Maybe there is some wrong network config??? 2008/9/5 叶双明 [EMAIL PROTECTED] I can start first datanode by bin/hadoop-daemons.sh --config conf1 start datanode and copy conf1 to conf2, and modify the value for follwing properties, port = port +1 property namedfs.datanode.address/name value0.0.0.0:50011/value description The address where the datanode server will listen to. If the port is 0 then the server will start on a free port. /description /property property namedfs.datanode.http.address/name value0.0.0.0:50076/value description The datanode http server address and port. If the port is 0 then the server will start on a free port. /description /property property namedfs.http.address/name value0.0.0.0:50071/value description The address and the base port where the dfs namenode web ui will listen on. If the port is 0 then the server will start on a free port. /description /property property namedfs.datanode.https.address/name value0.0.0.0:50476/value /property And run bin/hadoop-daemons.sh --config conf2 start datanode but get message: datanode running as process 4137. Stop it first. what other properties could I modify between the two config file. Thanks. 2008/9/5 lohit [EMAIL PROTECTED] Good way is to have different 'conf' dirs So, you would end up with dir conf1 and conf2 and startup of datanode would be ./bin/hadoop-daemons.sh --config conf1 start datanode ./bin/hadoop-daemons.sh --config conf2 start datanode make sure you have different hadoop-site.xml in conf1 and conf2 dirs. -Lohit - Original Message From: 叶双明 [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Thursday, September 4, 2008 12:01:48 AM Subject: Re: can i run multiple datanode in one pc? Thanks lohit. I run start datanod by comman: bin/hadoop datanode -conf conf/hadoop-site.xml, it can't work, but command: bin/hadoop datanode can work. Something wrong have I done? bin/hadoop datanode -conf conf/hadoop-site.xml 2008/9/4 lohit [EMAIL PROTECTED] Yes, each datanode should point to different config. So, if you have conf/hadoop-site.xml make another conf2/hadoop-site.xml with ports for datanode specific stuff and you should be able to start multiple datanodes on same node. -Lohit - Original Message From: 叶双明 [EMAIL PROTECTED] To: core-user@hadoop.apache.org Sent: Wednesday, September 3, 2008 8:19:59 PM Subject: can i run multiple datanode in one pc? i know it is running one datanode in one computer normally。 i wondering can i run multiple datanode in one pc?
Re: Thinking about retriving DFS metadata from datanodes!!!
There is certainly value in it. But it can not be done with the data datanodes currently have. The usual way to protect NameNode metadata in Hadoop is to write the metadata to two different locations. Raghu. 叶双明 wrote: Hi all. We all know the importance of NameNode for a cluster, when the NameNode whith all matadata of the DFS is breaked down, the whole of the clustes'data is gone. So, if we can retrive DFS metadata from datanodes, It should be a additional great robustness for hadoop. Does any one thinking It's valuable for it?
Re: Hadoop + Elastic Block Stores
Ryan LeCompte wrote: I'd really love to one day see some scripts under src/contrib/ec2/bin that can setup/mount the EBS volumes automatically. :-) The fastest way might be to write contribute such scripts! Doug
Simple Survey
Hey all Scale Unlimited is putting together some case studies for an upcoming class and wants to get a snapshot of what the Hadoop user community looks like. If you have 2 minutes, please feel free to take the short anonymous survey below: http://www.scaleunlimited.com/survey.html All results will be public. cheers, chris -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: About the nama of second-namenode!!
Changing it will unfortunately cause confusion too. Sigh. This is why we should take time to name things well the first time. Doug 叶双明 wrote: Because the name of second-namenode making so much confusing, does the hadoop team consider to change it?
Re: is SecondaryNameNode in support for the NameNode?
NameNodeFailover http://wiki.apache.org/hadoop/NameNodeFailover, with a SecondaryNameNode http://wiki.apache.org/hadoop/SecondaryNameNode hosted I think it is wrong, please correct it. You probably look at some cached results. Both pages do not exist. The first one was a cause of confusion and was removed. Regards, --Konstantin 2008/9/6, Jean-Daniel Cryans [EMAIL PROTECTED]: Hi, See http://wiki.apache.org/hadoop/FAQ#7 and http://hadoop.apache.org/core/docs/r0.17.2/hdfs_user_guide.html#Secondary+Namenode Regards, J-D On Sat, Sep 6, 2008 at 5:26 AM, ??? [EMAIL PROTECTED] wrote: Hi all! The NameNode is a Single Point of Failure for the HDFS Cluster. There is support for NameNodeFailover, with a SecondaryNameNode hosted on a separate machine being able to stand in for the original NameNode if it goes down. Is it right? is SecondaryNameNode in support for the NameNode? Sorry for my englist!! ?
Re: specifying number of nodes for job
yeah. snickerdoodle. really. I see.. so if I have a cluster with n nodes, there is no way for me to have it spawn on just 2 of those nodes, or just one of those nodes? And furthermore, there is no way for me to have it spawn on just a subset of the processors? Or am I misunderstanding? Also, when you say specify the number of tasks for each node are you referring to specifying the number of mappers and reducers I can spawn on each node? -SM On Sun, Sep 7, 2008 at 8:29 PM, Mafish Liu [EMAIL PROTECTED] wrote: On Mon, Sep 8, 2008 at 2:25 AM, Sandy [EMAIL PROTECTED] wrote: Hi, This may be a silly question, but I'm strangely having trouble finding an answer for it (perhaps I'm looking in the wrong places?). Suppose I have a cluster with n nodes each with m processors. I wish to test the performance of, say, the wordcount program on k processors, where k is varied from k = 1 ... nm. You can specify the number of tasks for each node in your hadoop-site.xml file. So you can get k varied from k = n, 2*nm*n instead of k = 1...nm. How would I do this? I'm having trouble finding the proper command line option in the commands manual ( http://hadoop.apache.org/core/docs/current/commands_manual.html) Thank you very much for you time. -SM -- [EMAIL PROTECTED] Institute of Computing Technology, Chinese Academy of Sciences, Beijing.
Re: distcp failing
It is likely that you mapred.system.dir and/or fs.default.name settings are incorrect on the non-datanode machine that you are launching the task from. These two settings (in your conf/hadoop-site.xml file) must match the settings on the cluster itself. - Aaron On Sun, Sep 7, 2008 at 8:58 PM, Michael Di Domenico [EMAIL PROTECTED]wrote: I'm attempting to load data into hadoop (version 0.17.1), from a non-datanode machine in the cluster. I can run jobs and copyFromLocal works fine, but when i try to use distcp i get the below. I'm don't understand what the error, can anyone help? Thanks blue:hadoop-0.17.1 mdidomenico$ time bin/hadoop distcp -overwrite file:///Users/mdidomenico/hadoop/1gTestfile /user/mdidomenico/1gTestfile 08/09/07 23:56:06 INFO util.CopyFiles: srcPaths=[file:/Users/mdidomenico/hadoop/1gTestfile] 08/09/07 23:56:06 INFO util.CopyFiles: destPath=/user/mdidomenico/1gTestfile1 08/09/07 23:56:07 INFO util.CopyFiles: srcCount=1 With failures, global counters are inaccurate; consider running with -i Copy failed: org.apache.hadoop.ipc.RemoteException: java.io.IOException: /tmp/hadoop-hadoop/mapred/system/job_200809072254_0005/job.xml: No such file or directory at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:215) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:149) at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1155) at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1136) at org.apache.hadoop.mapred.JobInProgress.init(JobInProgress.java:175) at org.apache.hadoop.mapred.JobTracker.submitJob(JobTracker.java:1755) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source) at java.lang.reflect.Method.invoke(Unknown Source) at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:446) at org.apache.hadoop.ipc.Server$Handler.run(Server.java:896) at org.apache.hadoop.ipc.Client.call(Client.java:557) at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:212) at $Proxy1.submitJob(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:585) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:82) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:59) at $Proxy1.submitJob(Unknown Source) at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:758) at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:973) at org.apache.hadoop.util.CopyFiles.copy(CopyFiles.java:604) at org.apache.hadoop.util.CopyFiles.run(CopyFiles.java:743) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:79) at org.apache.hadoop.util.CopyFiles.main(CopyFiles.java:763)
Re: specifying number of nodes for job
-smiles- It's not nice to poke fun at people's e-mail aliases... and snickerdoodles are delicious cookies. In all seriousness though, why is this not possible? Is there something about the MapReduce model of parallel computation that I am not understanding? Or this more of an arbitrary implementation choice made by the Hadoop framework? If so, I am curious why this is the case. What are the benefits? What I'm talking about is not uncommon for scalability studies. Is being able to specify the number of processors considered a desirable feature by developers? Just curious, of course. Regards, -SM On Mon, Sep 8, 2008 at 3:36 PM, [EMAIL PROTECTED] wrote: yeah. snickerdoodle. really. I see.. so if I have a cluster with n nodes, there is no way for me to have it spawn on just 2 of those nodes, or just one of those nodes? And furthermore, there is no way for me to have it spawn on just a subset of the processors? Or am I misunderstanding? Also, when you say specify the number of tasks for each node are you referring to specifying the number of mappers and reducers I can spawn on each node? -SM On Sun, Sep 7, 2008 at 8:29 PM, Mafish Liu [EMAIL PROTECTED] wrote: On Mon, Sep 8, 2008 at 2:25 AM, Sandy [EMAIL PROTECTED] wrote: Hi, This may be a silly question, but I'm strangely having trouble finding an answer for it (perhaps I'm looking in the wrong places?). Suppose I have a cluster with n nodes each with m processors. I wish to test the performance of, say, the wordcount program on k processors, where k is varied from k = 1 ... nm. You can specify the number of tasks for each node in your hadoop-site.xml file. So you can get k varied from k = n, 2*nm*n instead of k = 1...nm. How would I do this? I'm having trouble finding the proper command line option in the commands manual ( http://hadoop.apache.org/core/docs/current/commands_manual.html) Thank you very much for you time. -SM -- [EMAIL PROTECTED] Institute of Computing Technology, Chinese Academy of Sciences, Beijing.
Monthly Hadoop User Group Meeting (Bay Area)
The next Hadoop User Group (Bay Area) meeting is scheduled for Wednesday, Sept 17th from 6 - 7:30 pm at Yahoo! Mission College, Santa Clara, CA, Building 2, Training Rooms 34. Agenda: Cloud Computing Testbed - Thomas Sandholm, HP Katta on Hadoop - Stefan Groschupf Registration and directions: http://upcoming.yahoo.com/event/1075456/ Look forward to seeing you there! Ajay
Re: About the nama of second-namenode!!
Hoping a best way to solve it! 2008/9/9 Doug Cutting [EMAIL PROTECTED] Changing it will unfortunately cause confusion too. Sigh. This is why we should take time to name things well the first time. Doug 叶双明 wrote: Because the name of second-namenode making so much confusing, does the hadoop team consider to change it? -- Sorry for my englist!! 明
Re: is SecondaryNameNode in support for the NameNode?
Thanks for all ! 2008/9/9 Konstantin Shvachko [EMAIL PROTECTED] NameNodeFailover http://wiki.apache.org/hadoop/NameNodeFailover, with a SecondaryNameNode http://wiki.apache.org/hadoop/SecondaryNameNode hosted I think it is wrong, please correct it. You probably look at some cached results. Both pages do not exist. The first one was a cause of confusion and was removed. Regards, --Konstantin 2008/9/6, Jean-Daniel Cryans [EMAIL PROTECTED]: Hi, See http://wiki.apache.org/hadoop/FAQ#7 and http://hadoop.apache.org/core/docs/r0.17.2/hdfs_user_guide.html#Secondary+Namenode Regards, J-D On Sat, Sep 6, 2008 at 5:26 AM, ??? [EMAIL PROTECTED] wrote: Hi all! The NameNode is a Single Point of Failure for the HDFS Cluster. There is support for NameNodeFailover, with a SecondaryNameNode hosted on a separate machine being able to stand in for the original NameNode if it goes down. Is it right? is SecondaryNameNode in support for the NameNode? Sorry for my englist!! ? -- Sorry for my englist!! 明
Re: Monthly Hadoop User Group Meeting (Bay Area)
doh, conveniently collides with the GridGain and GridDynamics presentations: http://web.meetup.com/66/calendar/8561664/ On Sep 8, 2008, at 4:55 PM, Ajay Anand wrote: The next Hadoop User Group (Bay Area) meeting is scheduled for Wednesday, Sept 17th from 6 - 7:30 pm at Yahoo! Mission College, Santa Clara, CA, Building 2, Training Rooms 34. Agenda: Cloud Computing Testbed - Thomas Sandholm, HP Katta on Hadoop - Stefan Groschupf Registration and directions: http://upcoming.yahoo.com/event/1075456/ Look forward to seeing you there! Ajay -- Chris K Wensel [EMAIL PROTECTED] http://chris.wensel.net/ http://www.cascading.org/
Re: Thinking about retriving DFS metadata from datanodes!!!
I am thinking about that, let datanodes carry additional information about which file that it belong to, sequence number in the order of blocks, and how many blocks of the file. Is it make sense? 2008/9/9 Raghu Angadi [EMAIL PROTECTED] There is certainly value in it. But it can not be done with the data datanodes currently have. The usual way to protect NameNode metadata in Hadoop is to write the metadata to two different locations. Raghu. 叶双明 wrote: Hi all. We all know the importance of NameNode for a cluster, when the NameNode whith all matadata of the DFS is breaked down, the whole of the clustes'data is gone. So, if we can retrive DFS metadata from datanodes, It should be a additional great robustness for hadoop. Does any one thinking It's valuable for it? -- Sorry for my english!! 明
Re: Logging best practices?
Per Jacobsson wrote: Hi all. I've got a beginner question: Are there any best practices for how to do logging from a task? Essentially I want to log warning messages under certain conditions in my map and reduce tasks, and be able to review them later. stdout, stderr and the logs using commons-logging from the task are stored in userlogs directory i.e. ${hadoop.log.dir}/userlogs/taskid . They are also available on the web UI. Is good old commons-logging using the TaskLogAppender the best way to solve this? I think using commons-logging is good. I assume I'd have to configure it to log to StdErr to be able to see the log messages in the jobtracker webapp. The Reporter would be useful to track statistics but not for something like this. And the JobHistory class and history logs are intended for internal use only? JobHistory class is for internal use only. But the history logs can be viewed from the web UI and HistoryViewer. Thanks a lot, Per Thanks Amareshwari
Re: Failing MR jobs!
Do you have some more detailed information? Logs are helpful. On Mon, Sep 8, 2008 at 3:26 AM, Erik Holstad [EMAIL PROTECTED] wrote: Hi! I'm trying to run a MR job, but it keeps on failing and I can't understand why. Sometimes it shows output at 66% and sometimes 98% or so. I had a couple of exception before that I didn't catch that made the job to fail. The log file from the task can be found at: http://pastebin.com/m4414d369 and the code looks like: //Java import java.io.*; import java.util.*; import java.net.*; //Hadoop import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.hbase.io.ImmutableBytesWritable; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; //HBase import org.apache.hadoop.hbase.*; import org.apache.hadoop.hbase.mapred.*; import org.apache.hadoop.hbase.io.*; import org.apache.hadoop.hbase.client.*; // org.apache.hadoop.hbase.client.HTable //Extra import org.apache.commons.cli.ParseException; import org.apache.commons.httpclient.MultiThreadedHttpConnectionManager; import org.apache.commons.httpclient.*; import org.apache.commons.httpclient.methods.*; import org.apache.commons.httpclient.params.HttpMethodParams; public class SerpentMR1 extends TableMap implements Mapper, Tool { //Setting DebugLevel private static final int DL = 0; //Setting up the variables for the MR job private static final String NAME = SerpentMR1; private static final String INPUTTABLE = sources; private final String[] COLS = {content:feedurl, content:ttl, content:updated}; private Configuration conf; public JobConf createSubmittableJob(String[] args) throws IOException{ JobConf c = new JobConf(getConf(), SerpentMR1.class); String jar = /home/hbase/SerpentMR/ +NAME+.jar; c.setJar(jar); c.setJobName(NAME); int mapTasks = 4; int reduceTasks = 20; c.setNumMapTasks(mapTasks); c.setNumReduceTasks(reduceTasks); String inputCols = ; for (int i=0; iCOLS.length; i++){inputCols += COLS[i] + ; } TableMap.initJob(INPUTTABLE, inputCols, this.getClass(), Text.class, BytesWritable.class, c); //Classes between: c.setOutputFormat(TextOutputFormat.class); Path path = new Path(users); //inserting into a temp table FileOutputFormat.setOutputPath(c, path); c.setReducerClass(MyReducer.class); return c; } public void map(ImmutableBytesWritable key, RowResult res, OutputCollector output, Reporter reporter) throws IOException { Cell cellLast= res.get(COLS[2].getBytes());//lastupdate long oldTime = cellLast.getTimestamp(); Cell cell_ttl= res.get(COLS[1].getBytes());//ttl long ttl = StreamyUtil.BytesToLong(cell_ttl.getValue() ); byte[] url = null; long currTime = time.GetTimeInMillis(); if(currTime - oldTime ttl){ url = res.get(COLS[0].getBytes()).getValue();//url output.collect(new Text(Base64.encode_strip(res.getRow())), new BytesWritable(url) );/ } } public static class MyReducer implements Reducer{ //org.apache.hadoop.mapred.Reducer{ private int timeout = 1000; //Sets the connection timeout time ms; public void reduce(Object key, Iterator values, OutputCollector output, Reporter rep) throws IOException { HttpClient client = new HttpClient();//new MultiThreadedHttpConnectionManager()); client.getHttpConnectionManager(). getParams().setConnectionTimeout(timeout); GetMethod method = null; int stat = 0; String content = ; byte[] colFam = select.getBytes(); byte[] column = lastupdate.getBytes(); byte[] currTime = null; HBaseRef hbref = new HBaseRef(); JerlType sendjerl = null; //new JerlType(); ArrayList jd = new ArrayList(); InputStream is = null; while(values.hasNext()){ BytesWritable bw = (BytesWritable)values.next(); String address = new String(bw.get()); try{ System.out.println(address); method = new GetMethod(address); method.setFollowRedirects(true); } catch (Exception e){ System.err.println(Invalid Address); e.printStackTrace(); } if (method != null){ try { // Execute the method. stat = client.executeMethod(method); if(stat == 200){ content = ; is = (InputStream)(method.getResponseBodyAsStream()); //Write to HBase new stamp
Re: Failing MR jobs!
Sorry, I didn't see the log link. On Tue, Sep 9, 2008 at 12:01 PM, Shengkai Zhu [EMAIL PROTECTED] wrote: Do you have some more detailed information? Logs are helpful. On Mon, Sep 8, 2008 at 3:26 AM, Erik Holstad [EMAIL PROTECTED]wrote: Hi! I'm trying to run a MR job, but it keeps on failing and I can't understand why. Sometimes it shows output at 66% and sometimes 98% or so. I had a couple of exception before that I didn't catch that made the job to fail. The log file from the task can be found at: http://pastebin.com/m4414d369 and the code looks like: //Java import java.io.*; import java.util.*; import java.net.*; //Hadoop import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.hbase.io.ImmutableBytesWritable; import org.apache.hadoop.io.*; import org.apache.hadoop.mapred.*; import org.apache.hadoop.util.*; //HBase import org.apache.hadoop.hbase.*; import org.apache.hadoop.hbase.mapred.*; import org.apache.hadoop.hbase.io.*; import org.apache.hadoop.hbase.client.*; // org.apache.hadoop.hbase.client.HTable //Extra import org.apache.commons.cli.ParseException; import org.apache.commons.httpclient.MultiThreadedHttpConnectionManager; import org.apache.commons.httpclient.*; import org.apache.commons.httpclient.methods.*; import org.apache.commons.httpclient.params.HttpMethodParams; public class SerpentMR1 extends TableMap implements Mapper, Tool { //Setting DebugLevel private static final int DL = 0; //Setting up the variables for the MR job private static final String NAME = SerpentMR1; private static final String INPUTTABLE = sources; private final String[] COLS = {content:feedurl, content:ttl, content:updated}; private Configuration conf; public JobConf createSubmittableJob(String[] args) throws IOException{ JobConf c = new JobConf(getConf(), SerpentMR1.class); String jar = /home/hbase/SerpentMR/ +NAME+.jar; c.setJar(jar); c.setJobName(NAME); int mapTasks = 4; int reduceTasks = 20; c.setNumMapTasks(mapTasks); c.setNumReduceTasks(reduceTasks); String inputCols = ; for (int i=0; iCOLS.length; i++){inputCols += COLS[i] + ; } TableMap.initJob(INPUTTABLE, inputCols, this.getClass(), Text.class, BytesWritable.class, c); //Classes between: c.setOutputFormat(TextOutputFormat.class); Path path = new Path(users); //inserting into a temp table FileOutputFormat.setOutputPath(c, path); c.setReducerClass(MyReducer.class); return c; } public void map(ImmutableBytesWritable key, RowResult res, OutputCollector output, Reporter reporter) throws IOException { Cell cellLast= res.get(COLS[2].getBytes());//lastupdate long oldTime = cellLast.getTimestamp(); Cell cell_ttl= res.get(COLS[1].getBytes());//ttl long ttl = StreamyUtil.BytesToLong(cell_ttl.getValue() ); byte[] url = null; long currTime = time.GetTimeInMillis(); if(currTime - oldTime ttl){ url = res.get(COLS[0].getBytes()).getValue();//url output.collect(new Text(Base64.encode_strip(res.getRow())), new BytesWritable(url) );/ } } public static class MyReducer implements Reducer{ //org.apache.hadoop.mapred.Reducer{ private int timeout = 1000; //Sets the connection timeout time ms; public void reduce(Object key, Iterator values, OutputCollector output, Reporter rep) throws IOException { HttpClient client = new HttpClient();//new MultiThreadedHttpConnectionManager()); client.getHttpConnectionManager(). getParams().setConnectionTimeout(timeout); GetMethod method = null; int stat = 0; String content = ; byte[] colFam = select.getBytes(); byte[] column = lastupdate.getBytes(); byte[] currTime = null; HBaseRef hbref = new HBaseRef(); JerlType sendjerl = null; //new JerlType(); ArrayList jd = new ArrayList(); InputStream is = null; while(values.hasNext()){ BytesWritable bw = (BytesWritable)values.next(); String address = new String(bw.get()); try{ System.out.println(address); method = new GetMethod(address); method.setFollowRedirects(true); } catch (Exception e){ System.err.println(Invalid Address); e.printStackTrace(); } if (method != null){ try { // Execute the method. stat = client.executeMethod(method); if(stat == 200){ content = ;
Issue in reduce phase with SortedMapWritable and custom Writables as values
Hello, I'm attempting to use a SortedMapWritable with a LongWritable as the key and a custom implementation of org.apache.hadoop.io.Writable as the value. I notice that my program works fine when I use another primitive wrapper (e.g. Text) as the value, but fails with the following exception when I use my custom Writable instance: 2008-09-08 23:25:02,072 INFO org.apache.hadoop.mapred.ReduceTask: Initiating in-memory merge with 1 segments... 2008-09-08 23:25:02,077 INFO org.apache.hadoop.mapred.Merger: Merging 1 sorted segments 2008-09-08 23:25:02,077 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 1 segments left of total size: 5492 bytes 2008-09-08 23:25:02,099 WARN org.apache.hadoop.mapred.ReduceTask: attempt_200809082247_0005_r_00_0 Merge of the inmemory files threw a n exception: java.io.IOException: Intermedate merge failed at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.doInMemMerge(ReduceTask.java:2133) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$InMemFSMergeThread.run(ReduceTask.java:2064) Caused by: java.lang.RuntimeException: java.lang.NullPointerException at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:80) at org.apache.hadoop.io.SortedMapWritable.readFields(SortedMapWritable.java:179) ... I noticed that the AbstractMapWritable class has a protected addToMap(Class clazz) method. Do I somehow need to let my SortedMapWritable instance know about my custom Writable value? I've properly implemented the custom Writable object (it just contains a few primitives, like longs and ints). Any insight is appreciated. Thanks, Ryan