[jira] [Commented] (IMPALA-7961) Concurrent catalog heavy workloads can cause queries with SYNC_DDL to fail fast

ASF subversion and git services (JIRA) Thu, 14 Feb 2019 21:00:30 -0800


    [ 
https://issues.apache.org/jira/browse/IMPALA-7961?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16768944#comment-16768944
 ]


ASF subversion and git services commented on IMPALA-7961:
---------------------------------------------------------

Commit 5ed6c665d190dbe5303e241afbc50e0eacb0a6af in impala's branch 
refs/heads/master from Bharath Vissapragada
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=5ed6c66 ]

IMPALA-7961: Avoid adding unmodified objects to DDL response

When a DDL is processed, we typically add the affected (added/removed)
objects to the response TCatalogUpdateResult struct. This response
is processed on the coordinator and the changes are applied locally.
When SYNC_DDL is enabled, the Catalog server also includes a topic
version number that should include all the affected objects so that the
coordinator can wait for that miniumum topic version to be applied on
all other coordinators before returning the control back to the user.
This covering topic version is calculated by looking at the topic
update log, which contains all the in-flight updates (and to an extent
past updates) that are perodically GC'ed.

Bug: In certain cases like CREATE TBL IF NOT EXISTS, we could end up
adding objects to the DDL response which haven't been modified in a
while (> TOPIC_UPDATE_LOG_GC_FREQUENCY) and hence could be potentially
GC'ed from the TopicUpdateLog. This means that the Catalog server
wouldn't be able to find a covering topic update version and eventually
gives up throwing an error as described in the jira.

Fix: Bumps the version of any objects that already exists when IF EXISTS
is used in conjunction with SYNC_DDL. This makes sure that the object is
included in the upcoming topic updates and waitForSyncDdlVersion() can find
a covering topic update that includes this object. This is a hack and could
cause false-positive invalidations, but definitely better than breaking
SYNC_DDL semantics.

Also added some additional diagnostic logging that could've simplified
debugging an issue like this.

Testing: Since this is a racy bug, I could only repro it by forcing
frequent topic update log GCs along with a specific sequence of
actions. Couldn't reproduce it with the patch.

Change-Id: If3e914b70ba796c9b224e9dea559b8c40aa25d83
Reviewed-on: http://gerrit.cloudera.org:8080/12428
Reviewed-by: Bharath Vissapragada <[email protected]>
Tested-by: Impala Public Jenkins <[email protected]>


> Concurrent catalog heavy workloads can cause queries with SYNC_DDL to fail 
> fast
> -------------------------------------------------------------------------------
>
>                 Key: IMPALA-7961
>                 URL: https://issues.apache.org/jira/browse/IMPALA-7961
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Catalog
>    Affects Versions: Impala 2.12.0, Impala 3.1.0
>            Reporter: bharath v
>            Assignee: bharath v
>            Priority: Critical
>         Attachments: 0001-Repro-of-IMPALA-7961.patch
>
>
> When catalog server is under heavy load with concurrent updates to objects, 
> queries with SYNC_DDL can fail with the following message.
> *User facing error message:*
> {noformat}
> ERROR: CatalogException: Couldn't retrieve the catalog topic version for the 
> SYNC_DDL operation after 3 attempts.The operation has been successfully 
> executed but its effects may have not been broadcast to all the coordinators.
> {noformat}
> *Exception from the catalog server log:*
> {noformat}
> I1031 00:00:49.168761 1127039 CatalogServiceCatalog.java:1903] Operation 
> using SYNC_DDL is waiting for catalog topic version: 236535. Time to identify 
> topic version (msec): 1088
> I1031 00:00:49.168824 1125528 CatalogServiceCatalog.java:1903] Operation 
> using SYNC_DDL is waiting for catalog topic version: 236535. Time to identify 
> topic version (msec): 12625
> I1031 00:00:49.168851 1131986 jni-util.cc:230] 
> org.apache.impala.catalog.CatalogException: Couldn't retrieve the catalog 
> topic version for the SYNC_DDL operation after 3 attempts.The operation has 
> been successfully executed but its effects may have not been broadcast to all 
> the coordinators.
>         at 
> org.apache.impala.catalog.CatalogServiceCatalog.waitForSyncDdlVersion(CatalogServiceCatalog.java:1891)
>         at 
> org.apache.impala.service.CatalogOpExecutor.execDdlRequest(CatalogOpExecutor.java:336)
>         at org.apache.impala.service.JniCatalog.execDdl(JniCatalog.java:146)
> ::::
> {noformat}
> *What this means*
> The Catalog operation is actually successful (the change has been committed 
> to HMS and Catalog server cache) but the Catalog server noticed that it is 
> taking longer than expected time for it to broadcast the changes (for 
> whatever reason) and instead of hanging in there, it fails fast. The 
> coordinators are expected to eventually sync up in the background.
> *Problem*
>  - This violates the contract of the SYNC_DDL query option since the query 
> returns early.
>  - This is a behavioral regression from pre IMPALA-5058 state where the 
> queries would wait forever for SYNC_DDL based changes to propagate.
> *Notes*
>  - Introduced by IMPALA-5058
>  - Based on the occurrences of this issue, we narrowed it down to a specific 
> kind of DDLs (see Jira comments).
>  - My understanding is that this also applies to the Catalog V2 (or 
> LocalCatalog mode) since we still rely on the CatalogServer for DDL 
> orchestration and hence it takes this codepath.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (IMPALA-7961) Concurrent catalog heavy workloads can cause queries with SYNC_DDL to fail fast

Reply via email to