KnightChess opened a new issue #3889:
URL: https://github.com/apache/iceberg/issues/3889
spark3.2, iceberg 0.12.0, hiveCatalog
[phenomenon]
hive metastore table paramar `metadata.location` point to v3.metadata.json,
but there are only `v1.metadata.json` and `v2.metadata.json` in the storage
system.
[log]
s3a:
```shell
2022-01-11 17:46:13,291 DEBUG org.apache.hadoop.fs.s3a.S3ABlockOutputStream
[] - Upload complete to xxxxx/v3.metadata.json by WriteOperationHe
lper {bucket=didi.eu-north-1.omega}
2022-01-11 17:46:13,291 DEBUG org.apache.hadoop.fs.s3a.S3ABlockOutputStream
[] - Closing FileBlock{index=1,
destFile=/tmp/s3a/s3ablock-0001-6924320477208373257.tmp, state=Closed,
dataSize=38806, limit=104857600}
2022-01-11 17:46:13,291 DEBUG org.apache.hadoop.fs.s3a.S3ABlockOutputStream
[] - Closing
org.apache.hadoop.fs.s3a.S3ADataBlocks$DiskBlockFactory@30e560a4
2022-01-11 17:46:13,291 DEBUG org.apache.hadoop.fs.s3a.S3ABlockOutputStream
[] - Statistics: OutputStreamStatistics{blocksSubmitted=1,
blocksInQueue=1, blocksActive=0, blockUploadsCompleted=0, blockUploadsFailed=0,
bytesPendingUpload=0, bytesU
ploaded=38806, blocksAllocated=1, blocksReleased=1,
blocksActivelyAllocated=0, exceptionsInMultipartFinalize=0, transferDuration=0
ms, queueDuration=0 ms, averageQueueTime=0 ms, totalUploadDuration=0 ms,
effectiveBandwidth=0.0 bytes/s}
2022-01-11 17:46:13,291 DEBUG org.apache.hadoop.fs.s3a.S3ABlockOutputStream
[] - Closing OutputStreamStatistics{blocksSubmitted=1,
blocksInQueue=1, blocksActive=0, blockUploadsCompleted=0, blockUploadsFailed=0,
bytesPendingUpload=0, bytesUploa
ded=38806, blocksAllocated=1, blocksReleased=1, blocksActivelyAllocated=0,
exceptionsInMultipartFinalize=0, transferDuration=0 ms, queueDuration=0 ms,
averageQueueTime=0 ms, totalUploadDuration=0 ms, effectiveBandwidth=0.0 bytes/s}
2022-01-11 17:46:13,291 DEBUG org.apache.hadoop.fs.s3a.S3ABlockOutputStream
[] - Ignoring close() as stream is already closed
2022-01-11 17:46:13,947 DEBUG
org.apache.hadoop.security.UserGroupInformation [] -
PrivilegedAction as:(auth:SIMPLE)
from:org.apache.iceberg.hive.HiveClientPool.reconnect(HiveClientPool.java:108)
2022-01-11 17:46:14,102 DEBUG org.apache.hadoop.fs.s3a.S3AStorageStatistics
[] - op_delete += 1 -> 3
2022-01-11 17:46:14,102 DEBUG org.apache.hadoop.fs.s3a.S3AStorageStatistics
[] - op_get_file_status += 1 -> 122
2022-01-11 17:46:14,102 DEBUG org.apache.hadoop.fs.s3a.S3AFileSystem
[] - Getting path status for s3a://xxxx/v3.metadata.json
2022-01-11 17:46:14,102 DEBUG org.apache.hadoop.fs.s3a.S3AStorageStatistics
[] - object_metadata_requests += 1 -> 175
2022-01-11 17:46:14,144 DEBUG org.apache.hadoop.fs.s3a.S3AFileSystem
[] - Found exact file: normal file
2022-01-11 17:46:14,144 DEBUG org.apache.hadoop.fs.s3a.S3AFileSystem
[] - Delete path s3a://xxxx/v3.metadata.json - recursive false
```
iceberg:
```shell
2022-01-11 17:46:13,944 ERROR org.apache.iceberg.hive.HiveTableOperations
[] - Cannot tell if commit to omega_dev.ods_web_global_h
succeeded, attempting to reconnect and check.
java.lang.RuntimeException: org.apache.thrift.transport.TTransportException
at
org.apache.iceberg.relocated.com.google.common.base.Throwables.propagate(Throwables.java:241)
~[iceberg-flink-runtime-0.12.0-flink-20211230.023248-3.jar:?]
at
org.apache.iceberg.common.DynMethods$UnboundMethod.invoke(DynMethods.java:80)
~[iceberg-flink-runtime-0.12.0-flink-20211230.023248-3.jar:?]
at
org.apache.iceberg.hive.HiveTableOperations.lambda$persistTable$3(HiveTableOperations.java:302)
~[iceberg-flink-runtime-0.12.0-flink-20211230.023248-3.jar:?]
at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:51)
~[iceberg-flink-runtime-0.12.0-flink-20211230.023248-3.jar:?]
at
org.apache.iceberg.hive.CachedClientPool.run(CachedClientPool.java:76)
~[iceberg-flink-runtime-0.12.0-flink-20211230.023248-3.jar:?]
at
org.apache.iceberg.hive.HiveTableOperations.persistTable(HiveTableOperations.java:298)
~[iceberg-flink-runtime-0.12.0-flink-20211230.023248-3.jar:?]
at
org.apache.iceberg.hive.HiveTableOperations.doCommit(HiveTableOperations.java:259)
~[iceberg-flink-runtime-0.12.0-flink-20211230.023248-3.jar:?]
at
org.apache.iceberg.BaseMetastoreTableOperations.commit(BaseMetastoreTableOperations.java:126)
~[iceberg-flink-runtime-0.12.0-flink-20211230.023248-3.jar:?]
at
org.apache.iceberg.SnapshotProducer.lambda$commit$2(SnapshotProducer.java:300)
~[iceberg-flink-runtime-0.12.0-flink-20211230.023248-3.jar:?]
...
2022-01-11 17:46:13,947 INFO hive.metastore
[] - Closed a connection to metastore, current connections: 0
2022-01-11 17:46:13,947 INFO hive.metastore
[] - Trying to connect to metastore with URI thrift://xxxxx.xxx
2022-01-11 17:46:13,949 INFO hive.metastore
[] - Opened a connection to metastore, current connections: 1
2022-01-11 17:46:13,949 INFO hive.metastore
[] - Request client ipAddress:
2022-01-11 17:46:13,954 INFO hive.metastore
[] - Connected to metastore.
2022-01-11 17:46:14,102 INFO
org.apache.iceberg.BaseMetastoreTableOperations [] - Commit status
check: Commit to iceberg.omega_dev.ods_web_global_h of
s3a://xxxx/v3.metadata.json failed
```
metastore log:
```shell
2022-01-11T17:46:13,416 ........ alter table: xxxxx.xxx
...
2022-01-11T17:46:14,912 log.PerfLogger: </PERFLOG
method=alter_table_with_environment_context start=1641894373416
end=1641894374912 duration=1496
from=org.apache.hadoop.hive.metastore.RetryingHMSHandler threadId=49
retryCount=0 error=false>
```
simple:
2022-01-11 17:46:13,291 Upload complete v3.metadata.json
2022-01-11 17:46:13.416 metastore begin alterTable
2022-01-11 17:46:13,944 HiveTableOperations throw RuntimeException,
mayby network problems
2022-01-11 17:46:14,102 BaseMetastoreTableOperations
checkCommitStatus: Commit status check faile
2022-01-11 17:46:14,144 because FAILURE, S3AFileSystem Delete path
s3a://v3.metadata.json
2022-01-11 17:46:14.912 metastore alterTable end
When checkCommitStatus, metastore still alter table, the metadata.location
is still v2.metadata.json, so delete v3.metadata.json in storage system, and
after alter table, metastore is point to v3.metadata.json
I don't know how to judge the time alter table will finish when the above
problem appears. In our metastore, the lock timeout is 5 minutes, and I think
the operation is idempotent. So what about increase failure retry when
`persistTable`
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]