[
https://issues.apache.org/jira/browse/IMPALA-10563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17300104#comment-17300104
]
Quanlong Huang commented on IMPALA-10563:
-----------------------------------------
h1. Analysis for *test_concurrent_inserts*
The timeout issue of *test_concurrent_inserts* is similar to above. It also
suffers for slow INSERTs due to slow file listing. This test have more inserts
to do (2x of test_failing_inserts), so it will take more time. What's worse, it
has some TRUNCATE statements which require EXCLUSIVE locks on the table.
TRUNCATE will block other INSERT statements to acquire the SHARED_READ lock. So
there is a risk that INSERTs will fail at waiting for the transaction lock:
{code:java}
E0311 18:32:04.738550 4356 Frontend.java:2053]
484300d116492c97:97cddf3800000000] Aborting transaction: 802
I0311 18:32:04.751101 4356 jni-util.cc:288] 484300d116492c97:97cddf3800000000]
org.apache.impala.common.TransactionException: Failed to acquire lock for
transaction 802
at
org.apache.impala.compat.MetastoreShim.acquireLock(MetastoreShim.java:843)
at
org.apache.impala.service.Frontend.createLockForInsert(Frontend.java:2129)
at
org.apache.impala.service.Frontend.doCreateExecRequest(Frontend.java:1668)
at
org.apache.impala.service.Frontend.getTExecRequest(Frontend.java:1605)
at
org.apache.impala.service.Frontend.createExecRequest(Frontend.java:1575)
at
org.apache.impala.service.JniFrontend.createExecRequest(JniFrontend.java:162)
{code}
So *test_concurrent_inserts* has two results: either timeout in 600s or failed
by the above error.
By increasing the timeout and lock retry limits (10 currently, change it to
1000), *test_concurrent_inserts* can pass in 2380.50s. Thus, it's just another
kind of slow down due to slow file listing on GCS.
> Stress insert tests timeout on GCS
> ----------------------------------
>
> Key: IMPALA-10563
> URL: https://issues.apache.org/jira/browse/IMPALA-10563
> Project: IMPALA
> Issue Type: Bug
> Components: Infrastructure
> Reporter: Quanlong Huang
> Assignee: Quanlong Huang
> Priority: Major
> Attachments: test_concurrent_inserts_catalogd.INFO.bz2,
> test_concurrent_inserts_hive-metastore.log.bz2,
> test_concurrent_inserts_impalad.INFO.bz2,
> test_concurrent_inserts_impalad_node1.INFO.bz2,
> test_concurrent_inserts_impalad_node2.INFO.bz2,
> test_failing_inserts_catalogd.INFO.bz2,
> test_failing_inserts_hive-metastore.log.bz2,
> test_failing_inserts_impalad.INFO.bz2,
> test_failing_inserts_impalad_node1.INFO.bz2,
> test_failing_inserts_impalad_node2.INFO.bz2
>
>
> The following tests failed with TimeoutError when running on GCS:
> {code:java}
> F
> stress/test_acid_stress.py::TestConcurrentAcidInserts::()::test_concurrent_inserts[unique_database0]
> stress/test_acid_stress.py:255: in test_concurrent_inserts
> run_tasks(writers + checkers)
> stress/stress_util.py:43: in run_tasks
> pool.map_async(Task.run, tasks).get(timeout_seconds)
>
> ../toolchain/toolchain-packages-gcc7.5.0/python-2.7.16/lib/python2.7/multiprocessing/pool.py:568:
> in get
> raise TimeoutError
> E TimeoutError
> F
> stress/test_acid_stress.py::TestFailingAcidInserts::()::test_failing_inserts[unique_database0]
> stress/test_acid_stress.py:333: in test_failing_inserts
> self._run_test_failing_inserts(unique_database, is_partitioned)
> stress/test_acid_stress.py:324: in _run_test_failing_inserts
> run_tasks(writers + checkers)
> stress/stress_util.py:43: in run_tasks
> pool.map_async(Task.run, tasks).get(timeout_seconds)
>
> ../toolchain/toolchain-packages-gcc7.5.0/python-2.7.16/lib/python2.7/multiprocessing/pool.py:568:
> in get
> raise TimeoutError
> E TimeoutError
> F
> stress/test_insert_stress.py::TestInsertStress::()::test_inserts[unique_database0]
> stress/test_insert_stress.py:102: in test_inserts
> run_tasks(writers + checkers)
> stress/stress_util.py:43: in run_tasks
> pool.map_async(Task.run, tasks).get(timeout_seconds)
>
> ../toolchain/toolchain-packages-gcc7.5.0/python-2.7.16/lib/python2.7/multiprocessing/pool.py:568:
> in get
> raise TimeoutError
> E TimeoutError
> {code}
> It seems the statements take too much time in waiting for HMS. Thus resulting
> in the whole slow down. Found some error logs in HMS's log:
> {code}
> 2021-03-08T01:29:10,479 INFO [pool-6-thread-181] txn.TxnHandler: Allocated
> writeId: 1 for txnId: 3667
> 2021-03-08T01:29:10,492 INFO [pool-6-thread-177] txn.TxnHandler: Added
> entries to MIN_HISTORY_LEVEL with a single query for current txn: [3669]
> 2021-03-08T01:29:10,493 INFO [pool-6-thread-181] txn.TxnHandler: Allocated
> write ids for dbName=test_concurrent_inserts_8933345c,
> tblName=test_concurrent_inserts (txnIds: [3667])
> 2021-03-08T01:29:10,502 INFO [pool-6-thread-195] txn.TxnHandler: Added
> entries to MIN_HISTORY_LEVEL with a single query for current txn: [3670]
> 2021-03-08T01:29:10,507 ERROR [pool-6-thread-197] txn.TxnHandler: Exception
> during write ids allocation for
> request=AllocateTableWriteIdsRequest(dbName:test_concurrent_inserts_8933345c,
> tableName:test_concurrent_inserts, txnIds:[3668]). Will retry if possible.
> org.postgresql.util.PSQLException: ERROR: duplicate key value violates unique
> constraint "NEXT_WRITE_ID_IDX"
> Detail: Key ("NWI_DATABASE",
> "NWI_TABLE")=(test_concurrent_inserts_8933345c, test_concurrent_inserts)
> already exists.
> at
> org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2532)
> ~[postgresql-42.2.14.jar:42.2.14]
> at
> org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2267)
> ~[postgresql-42.2.14.jar:42.2.14]
> at
> org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:312)
> ~[postgresql-42.2.14.jar:42.2.14]
> at
> org.postgresql.jdbc.PgStatement.executeInternal(PgStatement.java:448)
> ~[postgresql-42.2.14.jar:42.2.14]
> at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:369)
> ~[postgresql-42.2.14.jar:42.2.14]
> at
> org.postgresql.jdbc.PgPreparedStatement.executeWithFlags(PgPreparedStatement.java:153)
> ~[postgresql-42.2.14.jar:42.2.14]
> at
> org.postgresql.jdbc.PgPreparedStatement.execute(PgPreparedStatement.java:142)
> ~[postgresql-42.2.14.jar:42.2.14]
> at
> com.zaxxer.hikari.pool.ProxyPreparedStatement.execute(ProxyPreparedStatement.java:44)
> ~[HikariCP-2.6.1.jar:?]
> at
> com.zaxxer.hikari.pool.HikariProxyPreparedStatement.execute(HikariProxyPreparedStatement.java)
> ~[HikariCP-2.6.1.jar:?]
> at
> org.apache.hadoop.hive.metastore.txn.TxnHandler.allocateTableWriteIds(TxnHandler.java:1820)
> [hive-exec-3.1.3000.7.2.7.0-44.jar:3.1.3000.7.2.7.0-44]
> at
> org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.allocate_table_write_ids(HiveMetaStore.java:8370)
> [hive-exec-3.1.3000.7.2.7.0-44.jar:3.1.3000.7.2.7.0-44]
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> ~[?:1.8.0_282]
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> ~[?:1.8.0_282]
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> ~[?:1.8.0_282]
> at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_282]
> at
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:147)
> [hive-exec-3.1.3000.7.2.7.0-44.jar:3.1.3000.7.2.7.0-44]
> at
> org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:108)
> [hive-exec-3.1.3000.7.2.7.0-44.jar:3.1.3000.7.2.7.0-44]
> at com.sun.proxy.$Proxy28.allocate_table_write_ids(Unknown Source)
> [?:?]
> at
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$allocate_table_write_ids.getResult(ThriftHiveMetastore.java:20126)
> [hive-exec-3.1.3000.7.2.7.0-44.jar:3.1.3000.7.2.7.0-44]
> at
> org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Processor$allocate_table_write_ids.getResult(ThriftHiveMetastore.java:20110)
> [hive-exec-3.1.3000.7.2.7.0-44.jar:3.1.3000.7.2.7.0-44]
> at org.apache.thrift.ProcessFunction.process(ProcessFunction.java:39)
> [hive-exec-3.1.3000.7.2.7.0-44.jar:3.1.3000.7.2.7.0-44]
> at
> org.apache.hadoop.hive.metastore.TUGIBasedProcessor$1.run(TUGIBasedProcessor.java:111)
> [hive-exec-3.1.3000.7.2.7.0-44.jar:3.1.3000.7.2.7.0-44]
> at
> org.apache.hadoop.hive.metastore.TUGIBasedProcessor$1.run(TUGIBasedProcessor.java:107)
> [hive-exec-3.1.3000.7.2.7.0-44.jar:3.1.3000.7.2.7.0-44]
> at java.security.AccessController.doPrivileged(Native Method)
> [?:1.8.0_282]
> at javax.security.auth.Subject.doAs(Subject.java:422) [?:1.8.0_282]
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1898)
> [hadoop-common-3.1.1.7.2.7.0-44.jar:?]
> at
> org.apache.hadoop.hive.metastore.TUGIBasedProcessor.process(TUGIBasedProcessor.java:119)
> [hive-exec-3.1.3000.7.2.7.0-44.jar:3.1.3000.7.2.7.0-44]
> at
> org.apache.thrift.server.TThreadPoolServer$WorkerProcess.run(TThreadPoolServer.java:286)
> [hive-exec-3.1.3000.7.2.7.0-44.jar:3.1.3000.7.2.7.0-44]
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> [?:1.8.0_282]
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> [?:1.8.0_282]
> at java.lang.Thread.run(Thread.java:748) [?:1.8.0_282]
> 2021-03-08T01:29:10,513 WARN [pool-6-thread-197] txn.TxnHandler: Retryable
> error detected in
> allocateTableWriteIds(AllocateTableWriteIdsRequest(dbName:test_concurrent_inserts_8933345c,
> tableName:test_concurrent_inserts, txnIds:[3668])). Will wait 2000ms and
> retry up to 10 times. Error: ERROR: duplicate key value violates unique
> constraint "NEXT_WRITE_ID_IDX"
> Detail: Key ("NWI_DATABASE",
> "NWI_TABLE")=(test_concurrent_inserts_8933345c, test_concurrent_inserts)
> already exists.
> 2021-03-08T01:29:10,554 INFO [pool-6-thread-177] txn.TxnHandler: Allocated
> writeId: 2 for txnId: 3669
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]