[GitHub] [iceberg] KnightChess opened a new issue #3889: PersistTable error lead to task startup failure

GitBox Thu, 13 Jan 2022 00:52:14 -0800


KnightChess opened a new issue #3889:
URL: https://github.com/apache/iceberg/issues/3889



   spark3.2,   iceberg 0.12.0,    hiveCatalog
   [phenomenon]
   hive metastore table paramar `metadata.location` point to v3.metadata.json, 
but there are only `v1.metadata.json` and  `v2.metadata.json` in the storage 
system.
   
   [log]
   s3a:
   ```shell
   2022-01-11 17:46:13,291 DEBUG org.apache.hadoop.fs.s3a.S3ABlockOutputStream  
              [] - Upload complete to xxxxx/v3.metadata.json by WriteOperationHe
   lper {bucket=didi.eu-north-1.omega}
   2022-01-11 17:46:13,291 DEBUG org.apache.hadoop.fs.s3a.S3ABlockOutputStream  
              [] - Closing FileBlock{index=1, 
destFile=/tmp/s3a/s3ablock-0001-6924320477208373257.tmp, state=Closed, 
dataSize=38806, limit=104857600}
   2022-01-11 17:46:13,291 DEBUG org.apache.hadoop.fs.s3a.S3ABlockOutputStream  
              [] - Closing 
org.apache.hadoop.fs.s3a.S3ADataBlocks$DiskBlockFactory@30e560a4
   2022-01-11 17:46:13,291 DEBUG org.apache.hadoop.fs.s3a.S3ABlockOutputStream  
              [] - Statistics: OutputStreamStatistics{blocksSubmitted=1, 
blocksInQueue=1, blocksActive=0, blockUploadsCompleted=0, blockUploadsFailed=0, 
bytesPendingUpload=0, bytesU
   ploaded=38806, blocksAllocated=1, blocksReleased=1, 
blocksActivelyAllocated=0, exceptionsInMultipartFinalize=0, transferDuration=0 
ms, queueDuration=0 ms, averageQueueTime=0 ms, totalUploadDuration=0 ms, 
effectiveBandwidth=0.0 bytes/s}
   2022-01-11 17:46:13,291 DEBUG org.apache.hadoop.fs.s3a.S3ABlockOutputStream  
              [] - Closing OutputStreamStatistics{blocksSubmitted=1, 
blocksInQueue=1, blocksActive=0, blockUploadsCompleted=0, blockUploadsFailed=0, 
bytesPendingUpload=0, bytesUploa
   ded=38806, blocksAllocated=1, blocksReleased=1, blocksActivelyAllocated=0, 
exceptionsInMultipartFinalize=0, transferDuration=0 ms, queueDuration=0 ms, 
averageQueueTime=0 ms, totalUploadDuration=0 ms, effectiveBandwidth=0.0 bytes/s}
   2022-01-11 17:46:13,291 DEBUG org.apache.hadoop.fs.s3a.S3ABlockOutputStream  
              [] - Ignoring close() as stream is already closed
   2022-01-11 17:46:13,947 DEBUG 
org.apache.hadoop.security.UserGroupInformation              [] - 
PrivilegedAction as:(auth:SIMPLE) 
from:org.apache.iceberg.hive.HiveClientPool.reconnect(HiveClientPool.java:108)
   2022-01-11 17:46:14,102 DEBUG org.apache.hadoop.fs.s3a.S3AStorageStatistics  
              [] - op_delete += 1  ->  3
   2022-01-11 17:46:14,102 DEBUG org.apache.hadoop.fs.s3a.S3AStorageStatistics  
              [] - op_get_file_status += 1  ->  122
   2022-01-11 17:46:14,102 DEBUG org.apache.hadoop.fs.s3a.S3AFileSystem         
              [] - Getting path status for s3a://xxxx/v3.metadata.json 
   2022-01-11 17:46:14,102 DEBUG org.apache.hadoop.fs.s3a.S3AStorageStatistics  
              [] - object_metadata_requests += 1  ->  175
   2022-01-11 17:46:14,144 DEBUG org.apache.hadoop.fs.s3a.S3AFileSystem         
              [] - Found exact file: normal file
   2022-01-11 17:46:14,144 DEBUG org.apache.hadoop.fs.s3a.S3AFileSystem         
              [] - Delete path s3a://xxxx/v3.metadata.json - recursive false
   ```
   iceberg:
   ```shell
   2022-01-11 17:46:13,944 ERROR org.apache.iceberg.hive.HiveTableOperations    
              [] - Cannot tell if commit to omega_dev.ods_web_global_h 
succeeded, attempting to reconnect and check.
   java.lang.RuntimeException: org.apache.thrift.transport.TTransportException
        at 
org.apache.iceberg.relocated.com.google.common.base.Throwables.propagate(Throwables.java:241)
 ~[iceberg-flink-runtime-0.12.0-flink-20211230.023248-3.jar:?]
        at 
org.apache.iceberg.common.DynMethods$UnboundMethod.invoke(DynMethods.java:80) 
~[iceberg-flink-runtime-0.12.0-flink-20211230.023248-3.jar:?]
        at 
org.apache.iceberg.hive.HiveTableOperations.lambda$persistTable$3(HiveTableOperations.java:302)
 ~[iceberg-flink-runtime-0.12.0-flink-20211230.023248-3.jar:?]
        at org.apache.iceberg.ClientPoolImpl.run(ClientPoolImpl.java:51) 
~[iceberg-flink-runtime-0.12.0-flink-20211230.023248-3.jar:?]
        at 
org.apache.iceberg.hive.CachedClientPool.run(CachedClientPool.java:76) 
~[iceberg-flink-runtime-0.12.0-flink-20211230.023248-3.jar:?]
        at 
org.apache.iceberg.hive.HiveTableOperations.persistTable(HiveTableOperations.java:298)
 ~[iceberg-flink-runtime-0.12.0-flink-20211230.023248-3.jar:?]
        at 
org.apache.iceberg.hive.HiveTableOperations.doCommit(HiveTableOperations.java:259)
 ~[iceberg-flink-runtime-0.12.0-flink-20211230.023248-3.jar:?]
        at 
org.apache.iceberg.BaseMetastoreTableOperations.commit(BaseMetastoreTableOperations.java:126)
 ~[iceberg-flink-runtime-0.12.0-flink-20211230.023248-3.jar:?]
        at 
org.apache.iceberg.SnapshotProducer.lambda$commit$2(SnapshotProducer.java:300) 
~[iceberg-flink-runtime-0.12.0-flink-20211230.023248-3.jar:?]
   ...
   2022-01-11 17:46:13,947 INFO  hive.metastore                                 
              [] - Closed a connection to metastore, current connections: 0
   2022-01-11 17:46:13,947 INFO  hive.metastore                                 
              [] - Trying to connect to metastore with URI thrift://xxxxx.xxx
   2022-01-11 17:46:13,949 INFO  hive.metastore                                 
              [] - Opened a connection to metastore, current connections: 1
   2022-01-11 17:46:13,949 INFO  hive.metastore                                 
              [] - Request client ipAddress:
   2022-01-11 17:46:13,954 INFO  hive.metastore                                 
              [] - Connected to metastore.
   2022-01-11 17:46:14,102 INFO  
org.apache.iceberg.BaseMetastoreTableOperations              [] - Commit status 
check: Commit to iceberg.omega_dev.ods_web_global_h of 
s3a://xxxx/v3.metadata.json failed
   ```
   metastore log:
   ```shell
   2022-01-11T17:46:13,416   ........ alter table: xxxxx.xxx
   ...
   2022-01-11T17:46:14,912  log.PerfLogger: </PERFLOG 
method=alter_table_with_environment_context start=1641894373416 
end=1641894374912 duration=1496 
from=org.apache.hadoop.hive.metastore.RetryingHMSHandler threadId=49 
retryCount=0 error=false>
   ```
   
   simple:
   2022-01-11 17:46:13,291           Upload complete v3.metadata.json
   2022-01-11 17:46:13.416          metastore begin alterTable
   2022-01-11 17:46:13,944          HiveTableOperations throw RuntimeException, 
mayby network problems
   2022-01-11 17:46:14,102           BaseMetastoreTableOperations   
checkCommitStatus: Commit status check faile
   2022-01-11 17:46:14,144           because FAILURE, S3AFileSystem Delete path 
s3a://v3.metadata.json 
   2022-01-11 17:46:14.912          metastore alterTable end
   
   When checkCommitStatus, metastore still alter table, the metadata.location 
is still v2.metadata.json, so delete v3.metadata.json in storage system, and 
after alter table, metastore is point to v3.metadata.json
   
   I don't know how to judge the time alter table will finish when the above 
problem appears. In our metastore, the lock timeout is 5 minutes, and I think 
the operation is  idempotent. So what about increase failure retry when 
`persistTable` 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [iceberg] KnightChess opened a new issue #3889: PersistTable error lead to task startup failure

Reply via email to