Another thing that can happen, is if you use spot instances your spot instances 
can be taken back by AWS at any time. We had clusters in us-west-1 last week 
that were abruptly terminated without notice like this. (We use on-demand 
master and spot slaves, only the masters remained running... several times last 
week...)
 
Best regards,


    - Andy

Problems worthy of attack prove their worth by hitting back. - Piet Hein (via 
Tom White)


----- Original Message -----
> From: Ted Yu <[email protected]>
> To: [email protected]
> Cc: 
> Sent: Tuesday, November 8, 2011 7:20 PM
> Subject: Re: HBase 0.92/Hadoop 0.22 test results
> 
> Maybe the following is related ?
> 
> 11/11/08 18:50:04 WARN hdfs.DFSClient: DataStreamer Exception:
> java.io.IOException: File
> /hbase/splitlog/domU-12-31-39-09-E8-31.compute-1.internal,60020,1320792889412_hdfs%3A%2F%2Fip-10-46-114-25.ec2.internal%3A17020%2Fhbase%2F.logs%2Fip-10-245-191-239.ec2.internal%2C60020%2C1320792860210-splitting%2Fip-10-245-191-239.ec2.internal%252C60020%252C1320792860210.1320796004063/TestLoadAndVerify_1320795370905/d76a246e81525444beeea99200b3e9a4/recovered.edits/0000000000000048149
> could only be replicated to 0 nodes, instead of 1
>   at
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getAdditionalBlock(FSNamesystem.java:1646)
>   at
> org.apache.hadoop.hdfs.server.namenode.NameNode.addBlock(NameNode.java:829)
>   at sun.reflect.GeneratedMethodAccessor12.invoke(Unknown Source)
> 
> On Tue, Nov 8, 2011 at 4:10 PM, Roman Shaposhnik <[email protected]> wrote:
> 
>>  +Konstantin (there's something weird in append handling)
>> 
>>  Some more updates. Hope this will help. I had this hunch that
>>  I was seeing those weird issues when HDFS DN was at 80%
>>  capacity (but nowhere near full!). So I quickly spun off a cluster
>>  that had 5 DNs with modest (and unbalanced!) amount of
>>  storage. Here's what started happening towards the end of
>>  loading 2M records into HBase:
>> 
>>  On the master:
>> 
>>  {"statustimems":-1,"status":"Waiting for 
> distributed tasks to finish.
>>  scheduled=4 done=0
>> 
> error=3","starttimems":1320796207862,"description":"Doing 
> distributed
>>  log split in
>> 
> [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"RUNNING","statetimems":-1},{"statustimems":1320796275317,"status":"Waiting
>>  for distributed tasks to finish. scheduled=4 done=0
>> 
> error=1","starttimems":1320796206563,"description":"Doing 
> distributed
>>  log split in
>> 
> [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"ABORTED","statetimems":1320796275317},{"statustimems":1320796275317,"status":"Waiting
>>  for distributed tasks to finish. scheduled=4 done=0
>> 
> error=2","starttimems":1320796205304,"description":"Doing 
> distributed
>>  log split in
>> 
> [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"ABORTED","statetimems":1320796275317},{"statustimems":1320796275317,"status":"Waiting
>>  for distributed tasks to finish. scheduled=4 done=0
>> 
> error=3","starttimems":1320796203957,"description":"Doing 
> distributed
>>  log split in
>> 
> [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]","state":"ABORTED","statetimems":1320796275317}]
>> 
>>  11/11/08 18:51:15 WARN monitoring.TaskMonitor: Status Doing
>>  distributed log split in
>> 
>> 
> [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]:
>>  status=Waiting for distributed tasks to finish.  scheduled=4 done=0
>>  error=3, state=RUNNING, startTime=1320796203957, completionTime=-1
>>  appears to have been leaked
>>  11/11/08 18:51:15 WARN monitoring.TaskMonitor: Status Doing
>>  distributed log split in
>> 
>> 
> [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]:
>>  status=Waiting for distributed tasks to finish.  scheduled=4 done=0
>>  error=2, state=RUNNING, startTime=1320796205304, completionTime=-1
>>  appears to have been leaked
>>  11/11/08 18:51:15 WARN monitoring.TaskMonitor: Status Doing
>>  distributed log split in
>> 
>> 
> [hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting]:
>>  status=Waiting for distributed tasks to finish.  scheduled=4 done=0
>>  error=1, state=RUNNING, startTime=1320796206563, completionTime=-1
>>  appears to have been leaked
>> 
>>  And the behavior on the DNs was even weirder. I'm attaching a log
>>  from one of the DNs. The last exception is a shocker to me:
>> 
>>  11/11/08 18:51:07 WARN regionserver.SplitLogWorker: log splitting of
>>  hdfs://ip-10-46-114-25.ec2.internal:17020/
>> 
>> 
> hbase/.logs/ip-10-245-191-239.ec2.internal,60020,1320792860210-splitting/ip-10-245-191-239.ec2.internal%2C60020
>>  %2C1320792860210.1320796004063 failed, returning error
>>  java.io.IOException: Failed to open
>>  hdfs://ip-10-46-114-25.ec2.internal:17020/hbase/.logs/ip-10-245-191-239.ec2
>> 
>> 
> .internal,60020,1320792860210-splitting/ip-10-245-191-239.ec2.internal%2C60020%2C1320792860210.1320796004063
>>  fo
>>  r append
>> 
>>  But perhaps its is cascading from some of the earlier ones.
>> 
>>  Anyway, take a look at the attached log.
>> 
>>  Now, this is a tricky issue to reproduce. Just before it started failing
>>  again I had a completely clean run over here:
>> 
>> 
> http://bigtop01.cloudera.org:8080/view/Hadoop%200.22/job/Bigtop-trunk-smoketest-22/33/testReport/
>> 
>>  Which makes me believe it is NOT configuration related.
>> 
>>  Thanks,
>>  Roman.
>> 
>

Reply via email to