Nice find, zhoushuaifeng:)

Suggest to raise an issue for 94.

Jieshan.
________________________________________
From: 周帅锋 [zhoushuaif...@gmail.com]
Sent: Thursday, December 04, 2014 6:01 PM
To: dev
Subject: Re: split failed caused by FileNotFoundException

I rechecked the code in 0.98, this problem is solved by check the store
object in the compactrunner and cance the compact the compact.
HRegion.compact:

      byte[] cf = Bytes.toBytes(store.getColumnFamilyName());
      if (stores.get(cf) != store) {
        LOG.warn("Store " + store.getColumnFamilyName() + " on region " +
this
            + " has been re-instantiated, cancel this compaction request. "
            + " It may be caused by the roll back of split transaction");
        return false;
      }


But, is it better to replease the store object by the new one and continue
the compact on the store, instead of cancel?


2014-12-04 15:00 GMT+08:00 周帅锋 <zhoushuaif...@gmail.com>:

> In our hbase clusters, split sometimes failed because the file to be
> splited does not exist in parent region. In 0.94.2, this will cause
> regionserver shutdown because the split transction has reached  PONR state.
> In 0.94.20 or 0.98, split will fail and can roll back, because the split
> transction only reach  the state offlined_parent.
>
> In 0.94.2, the error is like below:
> 2014-09-23 22:27:55,710 INFO org.apache.hadoop.hbase.catalog.MetaEditor:
> Offlined parent region xxxxx in META
> 2014-09-23 22:27:55,820 INFO
> org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup
> of failed split of xxxxx
> Caused by: java.io.IOException: java.io.IOException:
> java.io.FileNotFoundException: File does not exist: xxxxx
> Caused by: java.io.IOException: java.io.FileNotFoundException: File does
> not exist: xxxxx
> Caused by: java.io.FileNotFoundException: File does not exist: xxxxx
> 2014-09-23 22:27:55,823 FATAL
> org.apache.hadoop.hbase.regionserver.HRegionServer: ABORTING region server
> xxx,60020,1411383568857: Abort; we got an error after point-of-no-return
>
> The reasion of missing files is a little complex, the whole procedure
> include two failure split and one compact:
> 1) there are too many files in the region and compact is requested, but
> not execute because there are many CompactionRequests(compactionRunners) in
> the compaction queue. The compactionRequest hodes the object of the Store,
> and also hodes a storefile list to compact of the store.
>
> 2) the region size is big enough, and split is requested. the region is
> offline during spliting,and the store is closed. but the split failed when
> spliting files(for some reason, like io busy, etc. causing time out)
> 2014-09-23 18:26:02,738 INFO
> org.apache.hadoop.hbase.regionserver.SplitRequest: Running rollback/cleanup
> of failed split of xxxxx; Took too long to split the files and create the
> references, aborting split
>
> 3) split successfully roll back, and the region is online again. During
> roll back procedure, a new Store object is created, but the store in the
> compaction queue did not removed, so there are two(or maybe more) store
> object in regionserver.
>
> 4) the compaction on the store of the region requested before running, and
> some storefiles are compact and removed, new bigger storefiles are created.
> but the store reinitialized in the rollback split procedure doesn't know
> the change of the storefiles.
>
> 5) split on region running again and fail again, because the storefiles in
> parrent region doesn't exist(removed by compaction). Also, the split
> transction doesn't know that there is a new file created by the compaction.
> In 0.94.2, this error can't be found until the daughter region open, but
> it's too late, the split failed at PONR state, and this will causing
> regionserver shutdown. In 0.94.20 and 0.98, when doing splitStoreFiles, it
> will looking into the storefile in the parent region and can found the
> error before PONR, so split failure can be roll back.
>      code in HRegionFileSystem.splitStoreFile:
>      ...
>      byte[] lastKey = f.createReader().getLastKey();
>
> So, this situation is a fatal error in previous 0.94 version, and also a
> common bug in the later 0.94 and higher version. And this is also the
> reason why sometimes storefile reader is null(closed by the first failure
> split).
>

Reply via email to