[ 
https://issues.apache.org/jira/browse/IMPALA-12687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quanlong Huang resolved IMPALA-12687.
-------------------------------------
    Fix Version/s: Impala 4.4.0
       Resolution: Fixed

> Key conflicts in catalog inFlightOperations when catalog RPC retries
> --------------------------------------------------------------------
>
>                 Key: IMPALA-12687
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12687
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Catalog
>            Reporter: Quanlong Huang
>            Assignee: Quanlong Huang
>            Priority: Critical
>             Fix For: Impala 4.4.0
>
>
> Saw TestCatalogRPCTimeout.test_catalog_rpc_retries failed in an internal job. 
> The commit id is 624b6c646819d7a15d550169ceca4ce5a3fa794f. The failure is
> {noformat}
> custom_cluster/test_rpc_timeout.py:275: in test_catalog_rpc_retries
>     self.execute_query("refresh functional.alltypes")
> common/impala_test_suite.py:878: in wrapper
>     return function(*args, **kwargs)
> common/impala_test_suite.py:910: in execute_query
>     return self.__execute_query(self.client, query, query_options)
> common/impala_test_suite.py:1003: in __execute_query
>     return impalad_client.execute(query, user=user)
> common/impala_connection.py:214: in execute
>     return self.__beeswax_client.execute(sql_stmt, user=user)
> beeswax/impala_beeswax.py:191: in execute
>     handle = self.__execute_query(query_string.strip(), user=user)
> beeswax/impala_beeswax.py:369: in __execute_query
>     self.wait_for_finished(handle)
> beeswax/impala_beeswax.py:390: in wait_for_finished
>     raise ImpalaBeeswaxException("Query aborted:" + error_log, None)
> E   ImpalaBeeswaxException: ImpalaBeeswaxException:
> E    Query aborted:NullPointerException: null{noformat}
> catalogd logs:
> {noformat}
> I0105 08:19:22.862653 28878 debug-util.cc:439] Debug Action: 
> RESET_METADATA_DELAY:JITTER@[email protected] sleeping for 4791 ms
> I0105 08:19:27.671103 28878 JniUtil.java:166] 
> 3d452b4ae20cdf17:93549bd700000000] resetMetadata request: REFRESH TABLE 
> functional.alltypes issued by jenkins
> I0105 08:19:27.679502 28902 TableLoader.java:76] Loading metadata for: 
> functional.alltypes (Load triggered by REFRESH TABLE functional.alltypes 
> issued by jenkins)
> I0105 08:19:27.679899 28878 TableLoadingMgr.java:72] 
> 3d452b4ae20cdf17:93549bd700000000] Loading metadata for table: 
> functional.alltypes
> I0105 08:19:27.679981 28878 TableLoadingMgr.java:74] 
> 3d452b4ae20cdf17:93549bd700000000] Remaining items in queue: 0. Loads in 
> progress: 1
> I0105 08:19:27.709326 28902 Table.java:234] createEventId_ for table: 
> functional.alltypes set to: -1
> I0105 08:19:27.709725 28902 HdfsTable.java:1251] Loading metadata for table 
> definition and all partition(s) of functional.alltypes (Load triggered by 
> REFRESH TABLE functional.alltypes issued by jenkins)
> I0105 08:19:27.771229 28902 HdfsTable.java:1886] Loaded 13 columns from HMS. 
> Actual columns: 13
> I0105 08:19:27.805235 28902 HdfsTable.java:3056] Load Valid Write Id List 
> Done. Time taken: 26.465us
> I0105 08:19:27.805305 28902 HdfsTable.java:1291] Fetching partition metadata 
> from the Metastore: functional.alltypes
> I0105 08:19:27.809267 28902 MetaStoreUtil.java:190] Fetching 24 partitions 
> for: functional.alltypes using partition batch size: 1000 
> I0105 08:19:27.821784 28902 MetaStoreUtil.java:208] Fetched 24/24 partitions 
> for table functional.alltypes
> I0105 08:19:27.821811 28902 HdfsTable.java:1298] Fetched partition metadata 
> from the Metastore: functional.alltypes
> I0105 08:19:27.862032 28776 TAcceptQueueServer.cpp:355] New connection to 
> server CatalogService from client <Host: 127.0.0.1 Port: 39038>
> I0105 08:19:27.870393 28903 JniUtil.java:166] 
> 3d452b4ae20cdf17:93549bd700000000] resetMetadata request: REFRESH TABLE 
> functional.alltypes issued by jenkins
> I0105 08:19:27.870957 28903 TableLoadingMgr.java:72] 
> 3d452b4ae20cdf17:93549bd700000000] Loading metadata for table: 
> functional.alltypes
> I0105 08:19:27.871035 28903 TableLoadingMgr.java:74] 
> 3d452b4ae20cdf17:93549bd700000000] Remaining items in queue: 0. Loads in 
> progress: 1
> I0105 08:19:27.907255 28902 ParallelFileMetadataLoader.java:224] Loading file 
> and block metadata for 24 paths for table functional.alltypes using a thread 
> pool of size 5
> I0105 08:19:27.957284 28902 HdfsTable.java:821] Loaded file and block 
> metadata for functional.alltypes partitions: year=2009/month=1, 
> year=2009/month=10, year=2009/month=11, and 21 others. Time taken: 53.850ms
> I0105 08:19:27.959899 28902 Table.java:1081] last refreshed event id for 
> table: functional.alltypes set to: -1
> I0105 08:19:27.959956 28902 TableLoader.java:176] Loaded metadata for: 
> functional.alltypes (280ms)
> E0105 08:19:27.980352 28903 JniUtil.java:183] 
> 3d452b4ae20cdf17:93549bd700000000] Error in REFRESH TABLE functional.alltypes 
> issued by jenkins. Time spent: 110ms
> I0105 08:19:27.980885 28903 jni-util.cc:302] 
> 3d452b4ae20cdf17:93549bd700000000] java.lang.NullPointerException
>         at 
> org.apache.impala.catalog.monitor.CatalogOperationTracker.archiveRecord(CatalogOperationTracker.java:104)
>         at 
> org.apache.impala.catalog.monitor.CatalogOperationTracker.decrement(CatalogOperationTracker.java:160)
>         at 
> org.apache.impala.service.CatalogOpExecutor.execResetMetadata(CatalogOpExecutor.java:6614)
>         at 
> org.apache.impala.service.JniCatalog.lambda$resetMetadata$4(JniCatalog.java:321)
>         at 
> org.apache.impala.service.JniCatalogOp.lambda$execAndSerialize$1(JniCatalogOp.java:90)
>         at org.apache.impala.service.JniCatalogOp.execOp(JniCatalogOp.java:58)
>         at 
> org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:89)
>         at 
> org.apache.impala.service.JniCatalogOp.execAndSerialize(JniCatalogOp.java:100)
>         at 
> org.apache.impala.service.JniCatalog.execAndSerialize(JniCatalog.java:237)
>         at 
> org.apache.impala.service.JniCatalog.execAndSerialize(JniCatalog.java:251)
>         at 
> org.apache.impala.service.JniCatalog.resetMetadata(JniCatalog.java:320)
> I0105 08:19:27.980897 28903 status.cc:129] 3d452b4ae20cdf17:93549bd700000000] 
> NullPointerException: null{noformat}
> The REFRESH request from the same query id was sent twice to catalogd, 
> causing conflicts in the inFlightOperations map.
> {code:java}
> 100   private void archiveRecord(TUniqueId queryId, String errorMsg) {
> 101     if (queryId != null && inFlightOperations.containsKey(queryId)) {
> 102       TCatalogOpRecord record = inFlightOperations.remove(queryId);
> 103       if (catalogOperationLogSize == 0) return;
> 104       record.setFinish_time_ms(System.currentTimeMillis()); // record 
> could be removed by another thread so become null
> 105       if (errorMsg != null) {
> 106         record.setStatus("FAILED");
> 107         record.setDetails(record.getDetails() + ", error=" + errorMsg);
> 108       } else {
> 109         record.setStatus("FINISHED");
> 110       }
> 111       synchronized (finishedOperations) {
> 112         if (finishedOperations.size() >= catalogOperationLogSize) {
> 113           finishedOperations.poll();
> 114         }
> 115         finishedOperations.add(record);
> 116       }
> 117     }
> 118   }{code}
> We should use queryId+threadId as the key in case of catalog RPC retries. 
> It's not a severe issue in production cluster since 
> catalog_client_rpc_timeout_ms is 0 by default. So catalog RPC never timeout 
> and got retried.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to