Maybe the following is related ? 11/11/08 18:50:04 WARN hdfs.DFSClient: DataStreamer Exception: java.io.IOException: File /hbase/splitlog/domU-12-31-39-09-E8-31.compute-1.internal,60020,1320792889412_hdfs%3A%2F%2Fip-10-46-114-25.ec2.internal%3A17020%2Fhbase%2F.logs%2Fip-10-245-191-239.ec2.internal%2C60020%2C1320792860210-splitting%2Fip-10-245-191-239.ec2.internal%252C60020%252C1320792860210.1320796004063/TestLoadAndVerify_1320795370905/d76a246e81525444beeea99200b3e9a4/recovered.edits/0000000000000048149 could only be replicated to 0 nodes, instead of 1 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1646) at org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:829) at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
On Tue, Nov 8, 2011 at 4:10 PM, Roman Shaposhnik <[email protected]> wrote: > +Konstantin (there's something weird in append handling) > > Some more updates. Hope this will help. I had this hunch that > I was seeing those weird issues when HDFS DN was at 80% > capacity (but nowhere near full!). So I quickly spun off a cluster > that had 5 DNs with modest (and unbalanced!) amount of > storage. Here's what started happening towards the end of > loading 2M records into HBase: > > On the master: > > {"statustimems":-1,"status":"Waiting for distributed tasks to finish. > scheduled=4 done=0 > error=3","starttimems":1320796207862,"description":"Doing distributed > log split in > [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"RUNNING","statetimems":-1},{"statustimems":1320796275317,"status":"Waiting > for distributed tasks to finish. scheduled=4 done=0 > error=1","starttimems":1320796206563,"description":"Doing distributed > log split in > [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"ABORTED","statetimems":1320796275317},{"statustimems":1320796275317,"status":"Waiting > for distributed tasks to finish. scheduled=4 done=0 > error=2","starttimems":1320796205304,"description":"Doing distributed > log split in > [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"ABORTED","statetimems":1320796275317},{"statustimems":1320796275317,"status":"Waiting > for distributed tasks to finish. scheduled=4 done=0 > error=3","starttimems":1320796203957,"description":"Doing distributed > log split in > [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"ABORTED","statetimems":1320796275317}] > > 11/11/08 18:51:15 WARN monitoring.TaskMonitor: Status Doing > distributed log split in > > [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]: > status=Waiting for distributed tasks to finish. scheduled=4 done=0 > error=3, state=RUNNING, startTime=1320796203957, completionTime=-1 > appears to have been leaked > 11/11/08 18:51:15 WARN monitoring.TaskMonitor: Status Doing > distributed log split in > > [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]: > status=Waiting for distributed tasks to finish. scheduled=4 done=0 > error=2, state=RUNNING, startTime=1320796205304, completionTime=-1 > appears to have been leaked > 11/11/08 18:51:15 WARN monitoring.TaskMonitor: Status Doing > distributed log split in > > [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]: > status=Waiting for distributed tasks to finish. scheduled=4 done=0 > error=1, state=RUNNING, startTime=1320796206563, completionTime=-1 > appears to have been leaked > > And the behavior on the DNs was even weirder. I'm attaching a log > from one of the DNs. The last exception is a shocker to me: > > 11/11/08 18:51:07 WARN regionserver.SplitLogWorker: log splitting of > hdfs://ip-10-46-114-25.ec2.internal:17020/ > > hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting/ip-10-245-191-239.ec2.internal%2C60020 > %2C1320792860210.1320796004063 failed, returning error > java.io.IOException: Failed to open > hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2 > > .internal,60020,1320792860210-splitting/ip-10-245-191-239.ec2.internal%2C60020%2C1320792860210.1320796004063 > fo > r append > > But perhaps its is cascading from some of the earlier ones. > > Anyway, take a look at the attached log. > > Now, this is a tricky issue to reproduce. Just before it started failing > again I had a completely clean run over here: > > http://bigtop01.cloudera.org:8080/view/Hadoop%200.22/job/Bigtop-trunk-smoketest-22/33/testReport/ > > Which makes me believe it is NOT configuration related. > > Thanks, > Roman. >
