[
https://issues.apache.org/jira/browse/HBASE-20642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16489498#comment-16489498
]
Ankit Singhal commented on HBASE-20642:
---------------------------------------
I was analyzing the logs provided:-
Client tried to add column family "cf-0544745230" in "ittable-0455209020"
client logs:
{code}
2018-05-15 02:54:20,789|INFO|MainThread|machine.py:167 -
run()||GUID=0022cef5-fb09-4e5e-bfad-5f239adfb691|2018-05-15 02:54:20,786 INFO
[Thread-10] hbase.IntegrationTestDDLMasterFailover: Adding column family: {NAME
=> 'cf-0544745230', VERSIONS => '1', EVICT_BLOCKS_ON_CLOSE => 'false',
NEW_VERSION_BEHAVIOR => 'false', KEEP_DELETED_CELLS => 'FALSE',
CACHE_DATA_ON_WRITE => 'false', DATA_BLOCK_ENCODING => 'NONE', TTL =>
'FOREVER', MIN_VERSIONS => '0', REPLICATION_SCOPE => '0', BLOOMFILTER => 'ROW',
CACHE_INDEX_ON_WRITE => 'false', IN_MEMORY => 'false', CACHE_BLOOMS_ON_WRITE =>
'false', PREFETCH_BLOCKS_ON_OPEN => 'false', COMPRESSION => 'NONE', BLOCKCACHE
=> 'true', BLOCKSIZE => '65536'} to table: ittable-0455209020
{code}
But master executing the procedure got restarted but procedure has already
updated the tableinfo in hdfs
Master which is about to got down:-
{code}
2018-05-15 02:54:21,862 INFO [PEWorker-8]
assignment.RegionTransitionProcedure: Dispatch pid=16618, ppid=16338,
state=RUNNABLE:REGION_TRANSITION_DISPATCH; AssignProcedure
table=ittable-0474715061, region=65e930848fdabc3fa93fc6c2ee8e9ca9,
target=ctr-e138-1518143905142-311755-01-000009.hwx.site,16020,1526352510710;
rit=OPENING,
location=ctr-e138-1518143905142-311755-01-000009.hwx.site,16020,1526352510710
2018-05-15 02:54:25,908 INFO [main] master.HMaster: STARTING service HMaster
2018-05-15 02:54:20,790 INFO
[RpcServer.default.FPBQ.Fifo.handler=27,queue=0,port=20000] master.HMaster:
Client=hbase//172.27.24.220 modify ittable-0455209020
2018-05-15 02:54:21,849 INFO [PEWorker-2] util.FSTableDescriptors: Updated
tableinfo=hdfs://ns1/apps/hbase/data/data/default/ittable-0455209020/.tabledesc/.tableinfo.0000000003
{code}
Though, standby master become active and executed the procedure from the state
it was recorded in master procedure wals.
standby Master log :-
{code}
2018-05-15 02:54:27,465 INFO
[master/ctr-e138-1518143905142-311755-01-000003:20000]
master.ActiveMasterManager: Registered as active
master=ctr-e138-1518143905142-311755-01-000003.hwx.site,20000,1526352691422
2018-05-15 02:55:14,413 INFO [PEWorker-15] procedure2.ProcedureExecutor:
Finished pid=16754, state=SUCCESS; ModifyTableProcedure
table=ittable-0455209020 in 53.5830sec
{code}
So now the retry to add ColumnFamily will fail because of the below check as
our descriptor is already updated by both the masters.
{code}
@Override
public long addColumn(
final TableName tableName,
final ColumnFamilyDescriptor column,
final long nonceGroup,
final long nonce)
throws IOException {
checkInitialized();
checkTableExists(tableName);
TableDescriptor old = getTableDescriptors().get(tableName);
if (old.hasColumnFamily(column.getName())) {
throw new InvalidFamilyOperationException("Column family '" +
column.getNameAsString()
+ "' in table '" + tableName + "' already exists so cannot be added");
}
{code}
Failure at the client:-
{code}
org.apache.hadoop.hbase.InvalidFamilyOperationException:
org.apache.hadoop.hbase.InvalidFamilyOperationException: Column family
'cf-0544745230' in table 'ittable-0455209020' already exists so cannot be added
E at
org.apache.hadoop.hbase.master.HMaster.addColumn(HMaster.java:2158)
{code}
So the solution would be to pass every step/checks after nonce check in
procedure execution to avoid failures during retries. Attaching a tentative fix.
> IntegrationTestDDLMasterFailover throws 'InvalidFamilyOperationException
> -------------------------------------------------------------------------
>
> Key: HBASE-20642
> URL: https://issues.apache.org/jira/browse/HBASE-20642
> Project: HBase
> Issue Type: Bug
> Reporter: Ankit Singhal
> Assignee: Ankit Singhal
> Priority: Major
>
> [~romil.choksi] reported that IntegrationTestDDLMasterFailover is failing
> while adding column family during the time master is restarting.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)