[ 
https://issues.apache.org/jira/browse/HBASE-20642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16489498#comment-16489498
 ] 

Ankit Singhal commented on HBASE-20642:
---------------------------------------

I was analyzing the logs provided:-

Client tried to add column family "cf-0544745230" in "ittable-0455209020" 
client logs:
{code}
2018-05-15 02:54:20,789|INFO|MainThread|machine.py:167 - 
run()||GUID=0022cef5-fb09-4e5e-bfad-5f239adfb691|2018-05-15 02:54:20,786 INFO  
[Thread-10] hbase.IntegrationTestDDLMasterFailover: Adding column family: {NAME 
=> 'cf-0544745230', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false', 
NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE', 
CACHE_DATA_ON_WRITE => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL => 
'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMFILTER => 'ROW', 
CACHE_INDEX_ON_WRITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE => 
'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE 
=> 'true', BLOCKSIZE => '65536'} to table: ittable-0455209020
{code}

But master executing the procedure got restarted but procedure has already 
updated the tableinfo in hdfs 

Master which is about to got down:-
{code}
2018-05-15 02:54:21,862 INFO  [PEWorker-8] 
assignment.RegionTransitionProcedure: Dispatch pid=16618, ppid=16338, 
state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure 
table=ittable-0474715061, region=65e930848fdabc3fa93fc6c2ee8e9ca9, 
target=ctr-e138-1518143905142-311755-01-000009.hwx.site,16020,1526352510710; 
rit=OPENING, 
location=ctr-e138-1518143905142-311755-01-000009.hwx.site,16020,1526352510710
2018-05-15 02:54:25,908 INFO  [main] master.HMaster: STARTING service HMaster
2018-05-15 02:54:20,790 INFO  
[RpcServer.default.FPBQ.Fifo.handler=27,queue=0,port=20000] master.HMaster: 
Client=hbase//172.27.24.220 modify ittable-0455209020
2018-05-15 02:54:21,849 INFO  [PEWorker-2] util.FSTableDescriptors: Updated 
tableinfo=hdfs://ns1/apps/hbase/data/data/default/ittable-0455209020/.tabledesc/.tableinfo.0000000003
{code}


Though, standby master become active and executed the procedure from the state 
it was recorded in master procedure wals.
standby Master log :-
{code}
2018-05-15 02:54:27,465 INFO  
[master/ctr-e138-1518143905142-311755-01-000003:20000] 
master.ActiveMasterManager: Registered as active 
master=ctr-e138-1518143905142-311755-01-000003.hwx.site,20000,1526352691422
2018-05-15 02:55:14,413 INFO  [PEWorker-15] procedure2.ProcedureExecutor: 
Finished pid=16754, state=SUCCESS; ModifyTableProcedure 
table=ittable-0455209020 in 53.5830sec
{code}


So now the retry to add ColumnFamily will fail because of the below check as 
our descriptor is already updated by both the masters.
{code}
@Override
  public long addColumn(
      final TableName tableName,
      final ColumnFamilyDescriptor column,
      final long nonceGroup,
      final long nonce)
      throws IOException {
    checkInitialized();
    checkTableExists(tableName);

    TableDescriptor old = getTableDescriptors().get(tableName);
    if (old.hasColumnFamily(column.getName())) {
      throw new InvalidFamilyOperationException("Column family '" + 
column.getNameAsString()
          + "' in table '" + tableName + "' already exists so cannot be added");
    }
{code}

Failure at the client:-
{code}
org.apache.hadoop.hbase.InvalidFamilyOperationException: 
org.apache.hadoop.hbase.InvalidFamilyOperationException: Column family 
'cf-0544745230' in table 'ittable-0455209020' already exists so cannot be added
E               at 
org.apache.hadoop.hbase.master.HMaster.addColumn(HMaster.java:2158)
{code}


So the solution would be to pass every step/checks after nonce check in 
procedure execution to avoid failures during retries. Attaching a tentative fix.

> IntegrationTestDDLMasterFailover throws 'InvalidFamilyOperationException 
> -------------------------------------------------------------------------
>
>                 Key: HBASE-20642
>                 URL: https://issues.apache.org/jira/browse/HBASE-20642
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Ankit Singhal
>            Assignee: Ankit Singhal
>            Priority: Major
>
> [~romil.choksi] reported that IntegrationTestDDLMasterFailover is failing 
> while adding column family during the time master is restarting.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to