Quanlong Huang created IMPALA-10875:
---------------------------------------

             Summary: Transient stale catalog if catalogd is restarted more 
than once shortly
                 Key: IMPALA-10875
                 URL: https://issues.apache.org/jira/browse/IMPALA-10875
             Project: IMPALA
          Issue Type: Bug
          Components: Catalog
            Reporter: Quanlong Huang


This is a follow-up task of IMPALA-5476. Though it's rare in practise, we still 
have a bug that client can see stale catalog in the following scenario:
 * Catalogd is restarted twice inside a statestore catalog update cycle.
 * A DDL finishes its execDdl RPC request on the second restarted catalogd. It 
gets a new catalog service id which differs from the local one. Then wait until 
the local one is updated.
 * Coordinator receives catalog update from the first restarted catalogd. So 
the local catalog service id changes, which wakes up the DDL execution thread.
 * The DDL execution thread finds the catalog service id still differs from the 
one that executes the DDL. Then ignores the DDL result and returns.

Client will see stale catalog until next catalog topic update comes.

The following test can reveal this bug (add it into 
tests/custom_cluster/test_restart_services.py)
{code:python}
  UPDATE_FREQUENCY_S = 10

  @pytest.mark.execute_serially
  @CustomClusterTestSuite.with_args(
    statestored_args="--statestore_update_frequency_ms={frequency_ms}"
    .format(frequency_ms=(UPDATE_FREQUENCY_S * 1000)))
  def test_restart_catalogd_twice2(self):
    self.execute_query_expect_success(self.client, "drop table if exists 
join_aa")
    self.execute_query_expect_success(self.client, "create table join_aa(id 
int)")
    # Make the catalog object version grow large enough
    self.execute_query_expect_success(self.client, "invalidate metadata")
    # No need to care whether the dll is executed successfully, it is just to 
make
    # the local catalog catche of impalad out of sync
    for i in range(0, 10):
      try:
        query = "alter table join_aa add columns (age" + str(i) + " int)"
        self.execute_query_async(query)
      except Exception, e:
        LOG.info(str(e))
    self.cluster.catalogd.restart()
    sleep(self.UPDATE_FREQUENCY_S * 2)
    self.cluster.catalogd.restart()
    self.execute_query_expect_success(self.client, "drop table join_aa")
    # Should not see stale metadata on 'join_aa'
    result = self.execute_query_expect_success(self.client, "show tables")
    assert 'join_aa' not in result.data
{code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to