Another thing that can happen, is if you use spot instances your spot instances can be taken back by AWS at any time. We had clusters in us-west-1 last week that were abruptly terminated without notice like this. (We use on-demand master and spot slaves, only the masters remained running... several times last week...) Best regards,
- Andy Problems worthy of attack prove their worth by hitting back. - Piet Hein (via Tom White) ----- Original Message ----- > From: Ted Yu <[email protected]> > To: [email protected] > Cc: > Sent: Tuesday, November 8, 2011 7:20 PM > Subject: Re: HBase 0.92/Hadoop 0.22 test results > > Maybe the following is related ? > > 11/11/08 18:50:04 WARN hdfs.DFSClient: DataStreamer Exception: > java.io.IOException: File > /hbase/splitlog/domU-12-31-39-09-E8-31.compute-1.internal,60020,1320792889412_hdfs%3A%2F%2Fip-10-46-114-25.ec2.internal%3A17020%2Fhbase%2F.logs%2Fip-10-245-191-239.ec2.internal%2C60020%2C1320792860210-splitting%2Fip-10-245-191-239.ec2.internal%252C60020%252C1320792860210.1320796004063/TestLoadAndVerify_1320795370905/d76a246e81525444beeea99200b3e9a4/recovered.edits/0000000000000048149 > could only be replicated to 0 nodes, instead of 1 > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1646) > at > org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:829) > at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source) > > On Tue, Nov 8, 2011 at 4:10 PM, Roman Shaposhnik <[email protected]> wrote: > >> +Konstantin (there's something weird in append handling) >> >> Some more updates. Hope this will help. I had this hunch that >> I was seeing those weird issues when HDFS DN was at 80% >> capacity (but nowhere near full!). So I quickly spun off a cluster >> that had 5 DNs with modest (and unbalanced!) amount of >> storage. Here's what started happening towards the end of >> loading 2M records into HBase: >> >> On the master: >> >> {"statustimems":-1,"status":"Waiting for > distributed tasks to finish. >> scheduled=4 done=0 >> > error=3","starttimems":1320796207862,"description":"Doing > distributed >> log split in >> > [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"RUNNING","statetimems":-1},{"statustimems":1320796275317,"status":"Waiting >> for distributed tasks to finish. scheduled=4 done=0 >> > error=1","starttimems":1320796206563,"description":"Doing > distributed >> log split in >> > [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"ABORTED","statetimems":1320796275317},{"statustimems":1320796275317,"status":"Waiting >> for distributed tasks to finish. scheduled=4 done=0 >> > error=2","starttimems":1320796205304,"description":"Doing > distributed >> log split in >> > [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"ABORTED","statetimems":1320796275317},{"statustimems":1320796275317,"status":"Waiting >> for distributed tasks to finish. scheduled=4 done=0 >> > error=3","starttimems":1320796203957,"description":"Doing > distributed >> log split in >> > [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"ABORTED","statetimems":1320796275317}] >> >> 11/11/08 18:51:15 WARN monitoring.TaskMonitor: Status Doing >> distributed log split in >> >> > [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]: >> status=Waiting for distributed tasks to finish. scheduled=4 done=0 >> error=3, state=RUNNING, startTime=1320796203957, completionTime=-1 >> appears to have been leaked >> 11/11/08 18:51:15 WARN monitoring.TaskMonitor: Status Doing >> distributed log split in >> >> > [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]: >> status=Waiting for distributed tasks to finish. scheduled=4 done=0 >> error=2, state=RUNNING, startTime=1320796205304, completionTime=-1 >> appears to have been leaked >> 11/11/08 18:51:15 WARN monitoring.TaskMonitor: Status Doing >> distributed log split in >> >> > [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]: >> status=Waiting for distributed tasks to finish. scheduled=4 done=0 >> error=1, state=RUNNING, startTime=1320796206563, completionTime=-1 >> appears to have been leaked >> >> And the behavior on the DNs was even weirder. I'm attaching a log >> from one of the DNs. The last exception is a shocker to me: >> >> 11/11/08 18:51:07 WARN regionserver.SplitLogWorker: log splitting of >> hdfs://ip-10-46-114-25.ec2.internal:17020/ >> >> > hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting/ip-10-245-191-239.ec2.internal%2C60020 >> %2C1320792860210.1320796004063 failed, returning error >> java.io.IOException: Failed to open >> hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2 >> >> > .internal,60020,1320792860210-splitting/ip-10-245-191-239.ec2.internal%2C60020%2C1320792860210.1320796004063 >> fo >> r append >> >> But perhaps its is cascading from some of the earlier ones. >> >> Anyway, take a look at the attached log. >> >> Now, this is a tricky issue to reproduce. Just before it started failing >> again I had a completely clean run over here: >> >> > http://bigtop01.cloudera.org:8080/view/Hadoop%200.22/job/Bigtop-trunk-smoketest-22/33/testReport/ >> >> Which makes me believe it is NOT configuration related. >> >> Thanks, >> Roman. >> >
