Getting Job counters on Hadoop 0.20.2
Hi I am trying to retrieve job counters on hadoop 0.20.2 in runtime or using history using the org.apache.hadoop.mapred API. My program inputs a job_id using which I should return the job counters for a running job or by pulling the info from job history. I was able to simulate this on the newer hadoop (trunk) version using the org.apache.hadoop.yarn.api.records.ApplicationReport, org.apache.hadoop.mapreduce.v2.app.job.Job and org.apache.hadoop.mapreduce.v2.hs.JobHistory APIs. Can someone please tell me how we can get the JobCounters in old Hadoop? In 0.20.2 version, the JobClient returns a RunningJob instance if we pass the JobID parameter. However, if I execute the below snippet of code on a job which is run within the same program, I get a valid RunningJob instance and am able to fetch the counters. If I try to pass the job_id of a job running on the cluster to my program and try to get the JobCounters for it, the returned RunningJob is simply null. :( - RunningJob runnningJob = jobClient.getJob(JobID.forName(jobId)); if (runnningJob == null) { System.err.printf(No job with ID %s found.\n, jobId); } if (!runnningJob.isComplete()) { System.err.printf(Job %s is not complete.\n, jobId); } Counters oldCounters = runnningJob.getCounters(); -- Any idea what I am doing wrong here? Regards, Prajakta
Counting records
Hi, I am a complete noob with Hadoop and MapReduce and I have a question that is probably silly, but I still don't know the answer. For the purposes of discussion I'll assume that I'm using a standard TextInputFormat. (I don't think that this changes things too much.) To simplify (a fair bit) I want to count all the records that meet specific criteria. I would like to use MapReduce because I anticipate large sources and I want to get the performance and reliability that MapReduce offers. So the obvious and simple approach is to have my Mapper check whether each record meets the criteria and emit a 0 or a 1. Then I could use a combiner which accumulates (like a LongSumReducer) and use this as a reducer as well, and I am sure that that would work fine. However it seems massive overkill to have all those 1s and 0s emitted and stored on disc. It seems tempting to have the Mapper accumulate the count for all of the records that it sees and then just emit once at the end the total value. This seems simple enough, except that the Mapper doesn't seem to have any easy way to know when it is presented with the last record. Now I could just make the Mapper take a copy of the OutputCollector for each record called and then in the close method it could do a single emit. However, although, this looks like it would work with the current implementation, there seem to be no guarantees that the collector is valid at the time that the close is called. This just seems ugly. Or I could get the Mapper to record the first offset that it sees and read the split length using report.getInputSplit().getLength() and then it could monitor how far it is through the split and it should be able to detect the last record. It looks like the MapRunner class creates a Mapper object and uses it to process a split, and so it looks like it's safe to store state in the mapper class between invocations of the map method. (But is this just an implementation artefact? Is the mapper class supposed to be completely stateless?) Maybe I should have a custom InputFormat class and have it flag the last record by placing some extra information in the key? (Assuming that the InputFormant has enough information from the split to be able to detect the last record, which seems reasonable enough.) Is there some blessed way to do this? Or am I barking up the wrong tree because I should really just generate all those 1s and 0s and accept the overhead? Regards, Peter Marron Trillium Software UK Limited
RE: Datanode error
I am sorry, but I received an error when I sent the message to the list and all responses were sent to my junk mail. So I tried to send it again, and just then noticed your emails. Please do also share if you're seeing an issue that you think is related to these log messages. My datanodes do not have any big problem, but my regionservers are getting shutdown by timeout and I think it is related to the datanodes. I already tried a lot of different configurations but they keep crashing. I asked in the hbase list, but we could not find anything (RSs seem healthy). We have 10 RSs and they get shutdown 7 times per day. So I thought maybe you guys could find what is wrong with my system. Thanks again, Pablo -Original Message- From: Raj Vishwanathan [mailto:rajv...@yahoo.com] Sent: sexta-feira, 20 de julho de 2012 14:38 To: common-user@hadoop.apache.org Subject: Re: Datanode error Could also be due to network issues. Number of sockets could be less or number of threads could be less. Raj From: Harsh J ha...@cloudera.com To: common-user@hadoop.apache.org Sent: Friday, July 20, 2012 9:06 AM Subject: Re: Datanode error Pablo, These all seem to be timeouts from clients when they wish to read a block and drops from clients when they try to write a block. I wouldn't think of them as critical errors. Aside of being worried that a DN is logging these, are you noticing any usability issue in your cluster? If not, I'd simply blame this on stuff like speculative tasks, region servers, general HDFS client misbehavior, etc. Please do also share if you're seeing an issue that you think is related to these log messages. On Fri, Jul 20, 2012 at 6:37 PM, Pablo Musa pa...@psafe.com wrote: Hey guys, I have a cluster with 11 nodes (1 NN and 10 DNs) which is running and working. However my datanodes keep having the same errors, over and over. I googled the problems and tried different flags (ex: -XX:MaxDirectMemorySize=2G) and different configs (xceivers=8192) but could not solve it. Does anyone know what is the problem and how can I solve it? (the stacktrace is at the end) I am running: Java 1.7 Hadoop 0.20.2 Hbase 0.90.6 Zoo 3.3.5 % top - shows low load average (6% most of the time up to 60%), already considering the number of cpus % vmstat - shows no swap at all % sar - shows 75% idle cpu in the worst case Hope you guys can help me. Thanks in advance, Pablo 2012-07-20 00:03:44,455 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /DN01:50010, dest: /DN01:43516, bytes: 396288, op: HDFS_READ, cliID: DFSClient_hb_rs_DN01,60020,1342734302945_1342734303427, offset: 54956544, srvID: DS-798921853-DN01-50010-1328651609047, blockid: blk_914960691839012728_14061688, duration: 480061254006 2012-07-20 00:03:44,455 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(DN01:50010, storageID=DS-798921853-DN01-50010-1328651609047, infoPort=50075, ipcPort=50020):Got exception while serving blk_914960691839012728_14061688 to /DN01: java.net.SocketTimeoutException: 48 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/DN01:50010 remote=/DN01:43516] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeou t.java:246) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputS tream.java:159) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputS tream.java:198) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSen der.java:397) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSend er.java:493) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiv er.java:279) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.jav a:175) 2012-07-20 00:03:44,455 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(DN01:50010, storageID=DS-798921853-DN01-50010-1328651609047, infoPort=50075, ipcPort=50020):DataXceiver java.net.SocketTimeoutException: 48 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/DN01:50010 remote=/DN01:43516] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeou t.java:246) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputS tream.java:159) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputS tream.java:198) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSen der.java:397) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSend er.java:493) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiv er.java:279) at
Re: Counting records
Hi, an additional idea is to use the counter API inside the framework. http://diveintodata.org/2011/03/15/an-example-of-hadoop-mapreduce-counter/ has a good example. Kai Am 23.07.2012 um 16:25 schrieb Peter Marron: I am a complete noob with Hadoop and MapReduce and I have a question that is probably silly, but I still don't know the answer. To simplify (a fair bit) I want to count all the records that meet specific criteria. I would like to use MapReduce because I anticipate large sources and I want to get the performance and reliability that MapReduce offers. -- Kai Voigt k...@123.org
RE: Counting records
You could just use a counter and never emit anything from the Map(). Use the getCounter(MyRecords, RecordTypeToCount).increment(1) whenever you find the type of record you are looking for. Never call output.collect(). Call the job with reduceTasks(0). When the job finishes, you can programmatically get the values of all counters including the one you create in the Map() method. Dave Shine Sr. Software Engineer 321.939.5093 direct | 407.314.0122 mobile CI Boost(tm) Clients Outperform Online(tm) www.ciboost.com -Original Message- From: Peter Marron [mailto:peter.mar...@trilliumsoftware.com] Sent: Monday, July 23, 2012 10:25 AM To: common-user@hadoop.apache.org Subject: Counting records Hi, I am a complete noob with Hadoop and MapReduce and I have a question that is probably silly, but I still don't know the answer. For the purposes of discussion I'll assume that I'm using a standard TextInputFormat. (I don't think that this changes things too much.) To simplify (a fair bit) I want to count all the records that meet specific criteria. I would like to use MapReduce because I anticipate large sources and I want to get the performance and reliability that MapReduce offers. So the obvious and simple approach is to have my Mapper check whether each record meets the criteria and emit a 0 or a 1. Then I could use a combiner which accumulates (like a LongSumReducer) and use this as a reducer as well, and I am sure that that would work fine. However it seems massive overkill to have all those 1s and 0s emitted and stored on disc. It seems tempting to have the Mapper accumulate the count for all of the records that it sees and then just emit once at the end the total value. This seems simple enough, except that the Mapper doesn't seem to have any easy way to know when it is presented with the last record. Now I could just make the Mapper take a copy of the OutputCollector for each record called and then in the close method it could do a single emit. However, although, this looks like it would work with the current implementation, there seem to be no guarantees that the collector is valid at the time that the close is called. This just seems ugly. Or I could get the Mapper to record the first offset that it sees and read the split length using report.getInputSplit().getLength() and then it could monitor how far it is through the split and it should be able to detect the last record. It looks like the MapRunner class creates a Mapper object and uses it to process a split, and so it looks like it's safe to store state in the mapper class between invocations of the map method. (But is this just an implementation artefact? Is the mapper class supposed to be completely stateless?) Maybe I should have a custom InputFormat class and have it flag the last record by placing some extra information in the key? (Assuming that the InputFormant has enough information from the split to be able to detect the last record, which seems reasonable enough.) Is there some blessed way to do this? Or am I barking up the wrong tree because I should really just generate all those 1s and 0s and accept the overhead? Regards, Peter Marron Trillium Software UK Limited The information contained in this email message is considered confidential and proprietary to the sender and is intended solely for review and use by the named recipient. Any unauthorized review, use or distribution is strictly prohibited. If you have received this message in error, please advise the sender by reply email and delete the message.
Re: Datanode error
Pablo, Perhaps you've forgotten about it but you'd ask the same question last week and you did have some responses on it. Please see your earlier thread at http://search-hadoop.com/m/0BOOh17ugmD On Mon, Jul 23, 2012 at 7:27 PM, Pablo Musa pa...@psafe.com wrote: Hey guys, I have a cluster with 11 nodes (1 NN and 10 DNs) which is running and working. However my datanodes keep having the same errors, over and over. I googled the problems and tried different flags (ex: -XX:MaxDirectMemorySize=2G) and different configs (xceivers=8192) but could not solve it. Does anyone know what is the problem and how can I solve it? (the stacktrace is at the end) I am running: Java 1.7 Hadoop 0.20.2 Hbase 0.90.6 Zoo 3.3.5 % top - shows low load average (6% most of the time up to 60%), already considering the number of cpus % vmstat - shows no swap at all % sar - shows 75% idle cpu in the worst case Hope you guys can help me. Thanks in advance, Pablo 2012-07-20 00:03:44,455 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /DN01:50010, dest: /DN01:43516, bytes: 396288, op: HDFS_READ, cliID: DFSClient_hb_rs_DN01,60020,1342734302945_1342734303427, offset: 54956544, srvID: DS-798921853-DN01-50010-1328651609047, blockid: blk_914960691839012728_14061688, duration: 480061254006 2012-07-20 00:03:44,455 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(DN01:50010, storageID=DS-798921853-DN01-50010-1328651609047, infoPort=50075, ipcPort=50020):Got exception while serving blk_914960691839012728_14061688 to /DN01: java.net.SocketTimeoutException: 48 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/DN01:50010 remote=/DN01:43516] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:397) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:493) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:279) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:175) 2012-07-20 00:03:44,455 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(DN01:50010, storageID=DS-798921853-DN01-50010-1328651609047, infoPort=50075, ipcPort=50020):DataXceiver java.net.SocketTimeoutException: 48 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/DN01:50010 remote=/DN01:43516] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:397) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:493) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:279) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.java:175) 2012-07-20 00:12:11,949 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification succeeded for blk_4602445008578088178_5707787 2012-07-20 00:12:11,962 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-8916344806514717841_14081066 received exception java.net.SocketTimeoutException: 63000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/DN01:36634 remote=/DN03:50010] 2012-07-20 00:12:11,962 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(DN01:50010, storageID=DS-798921853-DN01-50010-1328651609047, infoPort=50075, ipcPort=50020):DataXceiver java.net.SocketTimeoutException: 63000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/DN01:36634 remote=/DN03:50010] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:155) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:128) at org.apache.hadoop.net.SocketInputStream.read(SocketInputStream.java:116) at java.io.FilterInputStream.read(FilterInputStream.java:83) at java.io.DataInputStream.readShort(DataInputStream.java:312) at
Re: Counting records
Look at using a dynamic counter. You don't need to set up or declare an enum. The only caveat is that counters are passed back to the JT by each task and are stored in memory. On Jul 23, 2012, at 9:32 AM, Kai Voigt wrote: http://diveintodata.org/2011/03/15/an-example-of-hadoop-mapreduce-counter/
RE: Datanode error
I am sorry, but I received an error when I sent the message to the list and all responses were sent to my junk mail. So I tried to send it again, and just then noticed your emails. Sorry!! -Original Message- From: Harsh J [mailto:ha...@cloudera.com] Sent: segunda-feira, 23 de julho de 2012 11:07 To: common-user@hadoop.apache.org Subject: Re: Datanode error Pablo, Perhaps you've forgotten about it but you'd ask the same question last week and you did have some responses on it. Please see your earlier thread at http://search-hadoop.com/m/0BOOh17ugmD On Mon, Jul 23, 2012 at 7:27 PM, Pablo Musa pa...@psafe.com wrote: Hey guys, I have a cluster with 11 nodes (1 NN and 10 DNs) which is running and working. However my datanodes keep having the same errors, over and over. I googled the problems and tried different flags (ex: -XX:MaxDirectMemorySize=2G) and different configs (xceivers=8192) but could not solve it. Does anyone know what is the problem and how can I solve it? (the stacktrace is at the end) I am running: Java 1.7 Hadoop 0.20.2 Hbase 0.90.6 Zoo 3.3.5 % top - shows low load average (6% most of the time up to 60%), already considering the number of cpus % vmstat - shows no swap at all % sar - shows 75% idle cpu in the worst case Hope you guys can help me. Thanks in advance, Pablo 2012-07-20 00:03:44,455 INFO org.apache.hadoop.hdfs.server.datanode.DataNode.clienttrace: src: /DN01:50010, dest: /DN01:43516, bytes: 396288, op: HDFS_READ, cliID: DFSClient_hb_rs_DN01,60020,1342734302945_1342734303427, offset: 54956544, srvID: DS-798921853-DN01-50010-1328651609047, blockid: blk_914960691839012728_14061688, duration: 480061254006 2012-07-20 00:03:44,455 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(DN01:50010, storageID=DS-798921853-DN01-50010-1328651609047, infoPort=50075, ipcPort=50020):Got exception while serving blk_914960691839012728_14061688 to /DN01: java.net.SocketTimeoutException: 48 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/DN01:50010 remote=/DN01:43516] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:397) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:493) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:279) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.jav a:175) 2012-07-20 00:03:44,455 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(DN01:50010, storageID=DS-798921853-DN01-50010-1328651609047, infoPort=50075, ipcPort=50020):DataXceiver java.net.SocketTimeoutException: 48 millis timeout while waiting for channel to be ready for write. ch : java.nio.channels.SocketChannel[connected local=/DN01:50010 remote=/DN01:43516] at org.apache.hadoop.net.SocketIOWithTimeout.waitForIO(SocketIOWithTimeout.java:246) at org.apache.hadoop.net.SocketOutputStream.waitForWritable(SocketOutputStream.java:159) at org.apache.hadoop.net.SocketOutputStream.transferToFully(SocketOutputStream.java:198) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendChunks(BlockSender.java:397) at org.apache.hadoop.hdfs.server.datanode.BlockSender.sendBlock(BlockSender.java:493) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.readBlock(DataXceiver.java:279) at org.apache.hadoop.hdfs.server.datanode.DataXceiver.run(DataXceiver.jav a:175) 2012-07-20 00:12:11,949 INFO org.apache.hadoop.hdfs.server.datanode.DataBlockScanner: Verification succeeded for blk_4602445008578088178_5707787 2012-07-20 00:12:11,962 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: writeBlock blk_-8916344806514717841_14081066 received exception java.net.SocketTimeoutException: 63000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/DN01:36634 remote=/DN03:50010] 2012-07-20 00:12:11,962 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: DatanodeRegistration(DN01:50010, storageID=DS-798921853-DN01-50010-1328651609047, infoPort=50075, ipcPort=50020):DataXceiver java.net.SocketTimeoutException: 63000 millis timeout while waiting for channel to be ready for read. ch : java.nio.channels.SocketChannel[connected local=/DN01:36634 remote=/DN03:50010] at org.apache.hadoop.net.SocketIOWithTimeout.doIO(SocketIOWithTimeout.java:164) at
Reducer MapFileOutpuFormat
If I set my reducer output to map file output format and the job would say have 100 reducers, will the output generate 100 different index file (one for each reducer) or one index file for all the reducers (basically one index file per job)? If it is one index file per reducer, can rely on HDFS append to change the index write behavior and build one index file from all the reducers by basically making all the parallel reducers to append to one index file? Data files do not matter.
AUTO: Yuan Jin is out of the office. (returning 07/25/2012)
I am out of the office until 07/25/2012. I am out of office. For HAMSTER related things, you can contact Jason(Deng Peng Zhou/China/IBM) For CFM related things, you can contact Daniel(Liang SH Su/China/Contr/IBM) For TMB related things, you can contact Flora(Jun Ying Li/China/IBM) For TWB related things, you can contact Kim(Yuan SH Jin/China/IBM) For others, I will reply you when I am back. Note: This is an automated response to your message Reducer MapFileOutpuFormat sent on 24/07/2012 4:09:51. This is the only notification you will receive while this person is away.
Re: AUTO: Yuan Jin is out of the office. (returning 07/25/2012)
Fifth offense. Yuan Jin is out of the office. - I will be out of the office starting 06/22/2012 and will not return until 06/25/2012. I am out of Jun 21 Yuan Jin is out of the office. - I will be out of the office starting 04/13/2012 and will not return until 04/16/2012. I am out of Apr 12 Yuan Jin is out of the office. - I will be out of the office starting 04/02/2012 and will not return until 04/05/2012. I am out of Apr 2 Yuan Jin is out of the office. - I will be out of the office starting 02/17/2012 and will not return until 02/20/2012. I am out of Feb 16 On Mon, Jul 23, 2012 at 1:09 PM, Yuan Jin jiny...@cn.ibm.com wrote: I am out of the office until 07/25/2012. I am out of office. For HAMSTER related things, you can contact Jason(Deng Peng Zhou/China/IBM) For CFM related things, you can contact Daniel(Liang SH Su/China/Contr/IBM) For TMB related things, you can contact Flora(Jun Ying Li/China/IBM) For TWB related things, you can contact Kim(Yuan SH Jin/China/IBM) For others, I will reply you when I am back. Note: This is an automated response to your message Reducer MapFileOutpuFormat sent on 24/07/2012 4:09:51. This is the only notification you will receive while this person is away.
RE: Counting records
Yeah, I thought about using counters but I was worried about what happens if a Mapper task fails. Does the counter get adjusted to remove any contributions that the failed Mapper made before another replacement Mapper is started? Otherwise in the case of any Mapper failure I'm going to get an overcount am I not? Or is there some way to make sure that counters have the correct semantics in the face of failures? Peter Marron -Original Message- From: Dave Shine [mailto:Dave.Shine@channelintelligence. com] Sent: 23 July 2012 15:35 To: common-user@hadoop.apache.org Subject: RE: Counting records You could just use a counter and never emit anything from the Map(). Use the getCounter(MyRecords, RecordTypeToCount).increment(1) whenever you find the type of record you are looking for. Never call output.collect(). Call the job with reduceTasks(0). When the job finishes, you can programmatically get the values of all counters including the one you create in the Map() method. Dave Shine Sr. Software Engineer 321.939.5093 direct | 407.314.0122 mobile CI Boost(tm) Clients Outperform Online(tm) www.ciboost.com -Original Message- From: Peter Marron [mailto:Peter.Marron@trilliumsoftware. com] Sent: Monday, July 23, 2012 10:25 AM To: common-user@hadoop.apache.org Subject: Counting records Hi, I am a complete noob with Hadoop and MapReduce and I have a question that is probably silly, but I still don't know the answer. For the purposes of discussion I'll assume that I'm using a standard TextInputFormat. (I don't think that this changes things too much.) To simplify (a fair bit) I want to count all the records that meet specific criteria. I would like to use MapReduce because I anticipate large sources and I want to get the performance and reliability that MapReduce offers. So the obvious and simple approach is to have my Mapper check whether each record meets the criteria and emit a 0 or a 1. Then I could use a combiner which accumulates (like a LongSumReducer) and use this as a reducer as well, and I am sure that that would work fine. However it seems massive overkill to have all those 1s and 0s emitted and stored on disc. It seems tempting to have the Mapper accumulate the count for all of the records that it sees and then just emit once at the end the total value. This seems simple enough, except that the Mapper doesn't seem to have any easy way to know when it is presented with the last record. Now I could just make the Mapper take a copy of the OutputCollector for each record called and then in the close method it could do a single emit. However, although, this looks like it would work with the current implementation, there seem to be no guarantees that the collector is valid at the time that the close is called. This just seems ugly. Or I could get the Mapper to record the first offset that it sees and read the split length using report.getInputSplit().getLength() and then it could monitor how far it is through the split and it should be able to detect the last record. It looks like the MapRunner class creates a Mapper object and uses it to process a split, and so it looks like it's safe to store state in the mapper class between invocations of the map method. (But is this just an implementation artefact? Is the mapper class supposed to be completely stateless?) Maybe I should have a custom InputFormat class and have it flag the last record by placing some extra information in the key? (Assuming that the InputFormant has enough information from the split to be able to detect the last record, which seems reasonable enough.) Is there some blessed way to do this? Or am I barking up the wrong tree because I should really just generate all those 1s and 0s and accept the overhead? Regards, Peter Marron Trillium Software UK Limited The information contained in this email message is considered confidential and proprietary to the sender and is intended solely for review and use by the named recipient. Any unauthorized review, use or distribution is strictly prohibited. If you have received this message in error, please advise the sender by reply email and delete the message.
Re: AUTO: Yuan Jin is out of the office. (returning 07/25/2012)
Just kick this junk mail guy out of the group. On Mon, Jul 23, 2012 at 5:22 PM, Jean-Daniel Cryans jdcry...@apache.orgwrote: Fifth offense. Yuan Jin is out of the office. - I will be out of the office starting 06/22/2012 and will not return until 06/25/2012. I am out of Jun 21 Yuan Jin is out of the office. - I will be out of the office starting 04/13/2012 and will not return until 04/16/2012. I am out of Apr 12 Yuan Jin is out of the office. - I will be out of the office starting 04/02/2012 and will not return until 04/05/2012. I am out of Apr 2 Yuan Jin is out of the office. - I will be out of the office starting 02/17/2012 and will not return until 02/20/2012. I am out of Feb 16 On Mon, Jul 23, 2012 at 1:09 PM, Yuan Jin jiny...@cn.ibm.com wrote: I am out of the office until 07/25/2012. I am out of office. For HAMSTER related things, you can contact Jason(Deng Peng Zhou/China/IBM) For CFM related things, you can contact Daniel(Liang SH Su/China/Contr/IBM) For TMB related things, you can contact Flora(Jun Ying Li/China/IBM) For TWB related things, you can contact Kim(Yuan SH Jin/China/IBM) For others, I will reply you when I am back. Note: This is an automated response to your message Reducer MapFileOutpuFormat sent on 24/07/2012 4:09:51. This is the only notification you will receive while this person is away.
Re: Counting records
If the task fails the counter for that task is not used. So if you have speculative execution turned on and the JT kills a task, it won't affect your end results. Again the only major caveat is that the counters are in memory so if you have a lot of counters... On Jul 23, 2012, at 4:52 PM, Peter Marron wrote: Yeah, I thought about using counters but I was worried about what happens if a Mapper task fails. Does the counter get adjusted to remove any contributions that the failed Mapper made before another replacement Mapper is started? Otherwise in the case of any Mapper failure I'm going to get an overcount am I not? Or is there some way to make sure that counters have the correct semantics in the face of failures? Peter Marron -Original Message- From: Dave Shine [mailto:Dave.Shine@channelintelligence. com] Sent: 23 July 2012 15:35 To: common-user@hadoop.apache.org Subject: RE: Counting records You could just use a counter and never emit anything from the Map(). Use the getCounter(MyRecords, RecordTypeToCount).increment(1) whenever you find the type of record you are looking for. Never call output.collect(). Call the job with reduceTasks(0). When the job finishes, you can programmatically get the values of all counters including the one you create in the Map() method. Dave Shine Sr. Software Engineer 321.939.5093 direct | 407.314.0122 mobile CI Boost(tm) Clients Outperform Online(tm) www.ciboost.com -Original Message- From: Peter Marron [mailto:Peter.Marron@trilliumsoftware. com] Sent: Monday, July 23, 2012 10:25 AM To: common-user@hadoop.apache.org Subject: Counting records Hi, I am a complete noob with Hadoop and MapReduce and I have a question that is probably silly, but I still don't know the answer. For the purposes of discussion I'll assume that I'm using a standard TextInputFormat. (I don't think that this changes things too much.) To simplify (a fair bit) I want to count all the records that meet specific criteria. I would like to use MapReduce because I anticipate large sources and I want to get the performance and reliability that MapReduce offers. So the obvious and simple approach is to have my Mapper check whether each record meets the criteria and emit a 0 or a 1. Then I could use a combiner which accumulates (like a LongSumReducer) and use this as a reducer as well, and I am sure that that would work fine. However it seems massive overkill to have all those 1s and 0s emitted and stored on disc. It seems tempting to have the Mapper accumulate the count for all of the records that it sees and then just emit once at the end the total value. This seems simple enough, except that the Mapper doesn't seem to have any easy way to know when it is presented with the last record. Now I could just make the Mapper take a copy of the OutputCollector for each record called and then in the close method it could do a single emit. However, although, this looks like it would work with the current implementation, there seem to be no guarantees that the collector is valid at the time that the close is called. This just seems ugly. Or I could get the Mapper to record the first offset that it sees and read the split length using report.getInputSplit().getLength() and then it could monitor how far it is through the split and it should be able to detect the last record. It looks like the MapRunner class creates a Mapper object and uses it to process a split, and so it looks like it's safe to store state in the mapper class between invocations of the map method. (But is this just an implementation artefact? Is the mapper class supposed to be completely stateless?) Maybe I should have a custom InputFormat class and have it flag the last record by placing some extra information in the key? (Assuming that the InputFormant has enough information from the split to be able to detect the last record, which seems reasonable enough.) Is there some blessed way to do this? Or am I barking up the wrong tree because I should really just generate all those 1s and 0s and accept the overhead? Regards, Peter Marron Trillium Software UK Limited The information contained in this email message is considered confidential and proprietary to the sender and is intended solely for review and use by the named recipient. Any unauthorized review, use or distribution is strictly prohibited. If you have received this message in error, please advise the sender by reply email and delete the message.
Re: AUTO: Yuan Jin is out of the office. (returning 07/25/2012)
Guys, just be nice On Tue, Jul 24, 2012 at 5:59 AM, Chen He airb...@gmail.com wrote: Just kick this junk mail guy out of the group. On Mon, Jul 23, 2012 at 5:22 PM, Jean-Daniel Cryans jdcry...@apache.org wrote: Fifth offense. Yuan Jin is out of the office. - I will be out of the office starting 06/22/2012 and will not return until 06/25/2012. I am out of Jun 21 Yuan Jin is out of the office. - I will be out of the office starting 04/13/2012 and will not return until 04/16/2012. I am out of Apr 12 Yuan Jin is out of the office. - I will be out of the office starting 04/02/2012 and will not return until 04/05/2012. I am out of Apr 2 Yuan Jin is out of the office. - I will be out of the office starting 02/17/2012 and will not return until 02/20/2012. I am out of Feb 16 On Mon, Jul 23, 2012 at 1:09 PM, Yuan Jin jiny...@cn.ibm.com wrote: I am out of the office until 07/25/2012. I am out of office. For HAMSTER related things, you can contact Jason(Deng Peng Zhou/China/IBM) For CFM related things, you can contact Daniel(Liang SH Su/China/Contr/IBM) For TMB related things, you can contact Flora(Jun Ying Li/China/IBM) For TWB related things, you can contact Kim(Yuan SH Jin/China/IBM) For others, I will reply you when I am back. Note: This is an automated response to your message Reducer MapFileOutpuFormat sent on 24/07/2012 4:09:51. This is the only notification you will receive while this person is away. -- Regards, Hao Tian
int read(byte buf[], int off, int len) violates api level contract when length is 0 at the end of a stream
api contract on java public int read(byte[] buffer[], int off, int len): If len is zero, then no bytes are read and 0 is returned; otherwise, there is an attempt to read at least one byte. If no byte is available because the stream is at end of file, the value -1 is returned; otherwise, at least one byte is read and stored into b. DFSInputStream in hadoop 1 and 2 returns 0 in if len is 0 as long as the position is not currently at the end of a stream but if at the end of a stream and len is 0, read returns -1 instead of 0 because pos = getFileLength() /** * Read the entire buffer. */ @Override public synchronized int read(byte buf[], int off, int len) throws IOException { checkOpen(); if (closed) { throw new IOException(Stream closed); } failures = 0; if (pos getFileLength()) { int retries = 2; while (retries 0) { try { if (pos blockEnd) { currentNode = blockSeekTo(pos); } int realLen = Math.min(len, (int) (blockEnd - pos + 1)); int result = readBuffer(buf, off, realLen); if (result = 0) { pos += result; } else { // got a EOS from reader though we expect more data on it. throw new IOException(Unexpected EOS from the reader); } if (stats != null result != -1) { stats.incrementBytesRead(result); } return result; } catch (ChecksumException ce) { throw ce; } catch (IOException e) { if (retries == 1) { LOG.warn(DFS Read: + StringUtils.stringifyException(e)); } blockEnd = -1; if (currentNode != null) { addToDeadNodes(currentNode); } if (--retries == 0) { throw e; } } } } return -1; }
Re: AUTO: Yuan Jin is out of the office. (returning 07/25/2012)
Looks like that guy is your boss, Jason. It was you to let people forgive him last time. Tell him, remove the group mail-list from his auto email system. Looks like this Yuan has little contribution to the mail-list except for the spam auto emails. On Mon, Jul 23, 2012 at 6:12 PM, Jason tianxin...@gmail.com wrote: Guys, just be nice On Tue, Jul 24, 2012 at 5:59 AM, Chen He airb...@gmail.com wrote: Just kick this junk mail guy out of the group. On Mon, Jul 23, 2012 at 5:22 PM, Jean-Daniel Cryans jdcry...@apache.org wrote: Fifth offense. Yuan Jin is out of the office. - I will be out of the office starting 06/22/2012 and will not return until 06/25/2012. I am out of Jun 21 Yuan Jin is out of the office. - I will be out of the office starting 04/13/2012 and will not return until 04/16/2012. I am out of Apr 12 Yuan Jin is out of the office. - I will be out of the office starting 04/02/2012 and will not return until 04/05/2012. I am out of Apr 2 Yuan Jin is out of the office. - I will be out of the office starting 02/17/2012 and will not return until 02/20/2012. I am out of Feb 16 On Mon, Jul 23, 2012 at 1:09 PM, Yuan Jin jiny...@cn.ibm.com wrote: I am out of the office until 07/25/2012. I am out of office. For HAMSTER related things, you can contact Jason(Deng Peng Zhou/China/IBM) For CFM related things, you can contact Daniel(Liang SH Su/China/Contr/IBM) For TMB related things, you can contact Flora(Jun Ying Li/China/IBM) For TWB related things, you can contact Kim(Yuan SH Jin/China/IBM) For others, I will reply you when I am back. Note: This is an automated response to your message Reducer MapFileOutpuFormat sent on 24/07/2012 4:09:51. This is the only notification you will receive while this person is away. -- Regards, Hao Tian
Re: AUTO: Yuan Jin is out of the office. (returning 07/25/2012)
BTW, this is a Hadoop user group. You are welcomed to ask question and give solution to help people. Please do not pollute this technical environment. To Yuan Jin: DO NOT send me your auto email again to my personal mail-box. It is not fun but rude. We will still respect you if you do not send this type of auto email to our technical mail-list and say at least Execuse me to all people in this mail-list. On Mon, Jul 23, 2012 at 9:31 PM, Chen He airb...@gmail.com wrote: Looks like that guy is your boss, Jason. It was you to let people forgive him last time. Tell him, remove the group mail-list from his auto email system. Looks like this Yuan has little contribution to the mail-list except for the spam auto emails. On Mon, Jul 23, 2012 at 6:12 PM, Jason tianxin...@gmail.com wrote: Guys, just be nice On Tue, Jul 24, 2012 at 5:59 AM, Chen He airb...@gmail.com wrote: Just kick this junk mail guy out of the group. On Mon, Jul 23, 2012 at 5:22 PM, Jean-Daniel Cryans jdcry...@apache.org wrote: Fifth offense. Yuan Jin is out of the office. - I will be out of the office starting 06/22/2012 and will not return until 06/25/2012. I am out of Jun 21 Yuan Jin is out of the office. - I will be out of the office starting 04/13/2012 and will not return until 04/16/2012. I am out of Apr 12 Yuan Jin is out of the office. - I will be out of the office starting 04/02/2012 and will not return until 04/05/2012. I am out of Apr 2 Yuan Jin is out of the office. - I will be out of the office starting 02/17/2012 and will not return until 02/20/2012. I am out of Feb 16 On Mon, Jul 23, 2012 at 1:09 PM, Yuan Jin jiny...@cn.ibm.com wrote: I am out of the office until 07/25/2012. I am out of office. For HAMSTER related things, you can contact Jason(Deng Peng Zhou/China/IBM) For CFM related things, you can contact Daniel(Liang SH Su/China/Contr/IBM) For TMB related things, you can contact Flora(Jun Ying Li/China/IBM) For TWB related things, you can contact Kim(Yuan SH Jin/China/IBM) For others, I will reply you when I am back. Note: This is an automated response to your message Reducer MapFileOutpuFormat sent on 24/07/2012 4:09:51. This is the only notification you will receive while this person is away. -- Regards, Hao Tian
problem configuring hadoop with s3 bucket
Hello Group, I've hadoop setup locally running. Now I want to use Amazon s3://mybucket as my data store, so i changed like dfs.data.dir=s3://mybucket/hadoop/ in my hdfs-site.xml, Is it the correct way? I'm getting error : WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Invalid directory in dfs.data.dir: can not create directory: s3://mybucket/hadoop 2012-07-23 13:15:06,260 ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: All directories in dfs.data.dir are invalid. and when i changed like dfs.data.dir=s3://mybucket/ I got error : ERROR org.apache.hadoop.hdfs.server.datanode.DataNode: java.lang.IllegalArgumentException: Wrong FS: s3://mybucket/, expected: file:/// at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:381) at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:55) at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:393) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:251) at org.apache.hadoop.util.DiskChecker.mkdirsWithExistsAndPermissionCheck(DiskChecker.java:146) at org.apache.hadoop.util.DiskChecker.checkDir(DiskChecker.java:162) at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:1574) at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:1521) at org.apache.hadoop.hdfs.server.datanode.DataNode.createDataNode(DataNode.java:1539) at org.apache.hadoop.hdfs.server.datanode.DataNode.secureMain(DataNode.java:1665) at org.apache.hadoop.hdfs.server.datanode.DataNode.main(DataNode.java:1682) Also, When I'm changing fs.default.name=s3://mybucket , Namenode is not coming up with error : ERROR org.apache.hadoop.hdfs.server.namenode.NameNode: java.net.BindException: (Any way I want to run namenode locally, so I reverted it back to hdfs://localhost:9000 ) Your help is highly appreciated! Thanks -- Alok Kumar