Nice find, zhoushuaifeng:) Suggest to raise an issue for 94.
Jieshan. ________________________________________ From: 周帅锋 [zhoushuaif...@gmail.com] Sent: Thursday, December 04, 2014 6:01 PM To: dev Subject: Re: split failed caused by FileNotFoundException I rechecked the code in 0.98, this problem is solved by check the store object in the compactrunner and cance the compact the compact. HRegion.compact: byte[] cf = Bytes.toBytes(store.getColumnFamilyName()); if (stores.get(cf) != store) { LOG.warn("Store " + store.getColumnFamilyName() + " on region " + this + " has been re-instantiated, cancel this compaction request. " + " It may be caused by the roll back of split transaction"); return false; } But, is it better to replease the store object by the new one and continue the compact on the store, instead of cancel? 2014-12-04 15:00 GMT+08:00 周帅锋 <zhoushuaif...@gmail.com>: > In our hbase clusters, split sometimes failed because the file to be > splited does not exist in parent region. In 0.94.2, this will cause > regionserver shutdown because the split transction has reached PONR state. > In 0.94.20 or 0.98, split will fail and can roll back, because the split > transction only reach the state offlined_parent. > > In 0.94.2, the error is like below: > 2014-09-23 22:27:55,710 INFO org.apache.hadoop.hbase.catalog.MetaEditor: > Offlined parent region xxxxx in META > 2014-09-23 22:27:55,820 INFO > org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup > of failed split of xxxxx > Caused by: java.io.IOException: java.io.IOException: > java.io.FileNotFoundException: File does not exist: xxxxx > Caused by: java.io.IOException: java.io.FileNotFoundException: File does > not exist: xxxxx > Caused by: java.io.FileNotFoundException: File does not exist: xxxxx > 2014-09-23 22:27:55,823 FATAL > org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server > xxx,60020,1411383568857: Abort; we got an error after point-of-no-return > > The reasion of missing files is a little complex, the whole procedure > include two failure split and one compact: > 1) there are too many files in the region and compact is requested, but > not execute because there are many CompactionRequests(compactionRunners) in > the compaction queue. The compactionRequest hodes the object of the Store, > and also hodes a storefile list to compact of the store. > > 2) the region size is big enough, and split is requested. the region is > offline during spliting,and the store is closed. but the split failed when > spliting files(for some reason, like io busy, etc. causing time out) > 2014-09-23 18:26:02,738 INFO > org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup > of failed split of xxxxx; Took too long to split the files and create the > references, aborting split > > 3) split successfully roll back, and the region is online again. During > roll back procedure, a new Store object is created, but the store in the > compaction queue did not removed, so there are two(or maybe more) store > object in regionserver. > > 4) the compaction on the store of the region requested before running, and > some storefiles are compact and removed, new bigger storefiles are created. > but the store reinitialized in the rollback split procedure doesn't know > the change of the storefiles. > > 5) split on region running again and fail again, because the storefiles in > parrent region doesn't exist(removed by compaction). Also, the split > transction doesn't know that there is a new file created by the compaction. > In 0.94.2, this error can't be found until the daughter region open, but > it's too late, the split failed at PONR state, and this will causing > regionserver shutdown. In 0.94.20 and 0.98, when doing splitStoreFiles, it > will looking into the storefile in the parent region and can found the > error before PONR, so split failure can be roll back. > code in HRegionFileSystem.splitStoreFile: > ... > byte[] lastKey = f.createReader().getLastKey(); > > So, this situation is a fatal error in previous 0.94 version, and also a > common bug in the later 0.94 and higher version. And this is also the > reason why sometimes storefile reader is null(closed by the first failure > split). >