[
https://issues.apache.org/jira/browse/HDDS-6261?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17488341#comment-17488341
]
Siyao Meng commented on HDDS-6261:
----------------------------------
Since this is not an issue on 1.2.0 (might need to be rechecked). With the
addition of upgrade test that checks overwriting keys created on new version
cluster.
1. We could do a [{{git bisect}}|https://git-scm.com/docs/git-bisect] with
known good (initial starting point) at {{ozone-1.2.0}} and known bad (initial
ending point) at current master branch.
{code:bash|title=Using git bisect}
$ git checkout master
$ git bisect start
$ git bisect bad # Current version is bad
$ git bisect good ozone-1.2.0 # ozone-1.2.0 is known to be good
$ git bisect run bisect.sh # script should exit with code 0 if current source
is good
{code}
or 2. do it by staring at the code.
btw I'm busy working on a bunch of multi-tenancy branch changes (and some other
jiras) right now. So I may well not be looking at this anytime soon.
Any help from who is interested / want to help / has knowledge on this bug is
welcome. :)
> OM crashes when trying to overwrite a key during upgrade downgrade testing
> --------------------------------------------------------------------------
>
> Key: HDDS-6261
> URL: https://issues.apache.org/jira/browse/HDDS-6261
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Manager
> Affects Versions: 1.3.0
> Reporter: Siyao Meng
> Priority: Blocker
>
> While working on HDDS-6084 (related to upgrade acceptance testing), [~erose]
> and I found that if:
> 1) a key is created with a new OM version (1.3.0)
> 2) downgrade to OM 1.1.0
> 3) try to overwrite the key created in (1)
> Step (3) will result in all 3 OMs crashing.
> The issue seems to be introduced in the unreleased branch of 1.3.0. The same
> issue cannot be reproduced in 1.1.0 to 1.2.0 upgrade/downgrade testing (which
> should exclude HDDS-5243 or HDDS-5393 as a potential root cause). This could
> indicate some unreleased changes has broken the key versioning after the
> downgrade. (Could well be an incompatible change. Need further investigation.)
> {code}
> om2_1 | 2022-02-03 21:36:15,228 [OM StateMachine ApplyTransaction Thread -
> 0] ERROR ratis.OzoneManagerStateMachine: Terminating with exit status 1:
> Request cmdType: CreateKey
> om2_1 | traceID: ""
> om2_1 | clientId: "client-72B024AF247D"
> om2_1 | userInfo {
> om2_1 | userName: "dlfknslnfslf"
> om2_1 | remoteAddress: "10.9.0.19"
> om2_1 | hostName: "ha_s3g_1.ha_net"
> om2_1 | }
> om2_1 | version: 1
> om2_1 | createKeyRequest {
> om2_1 | keyArgs {
> om2_1 | volumeName: "s3v"
> om2_1 | bucketName: "old1-bucket"
> om2_1 | keyName: "key2"
> om2_1 | dataSize: 17539
> om2_1 | type: RATIS
> om2_1 | factor: THREE
> om2_1 | keyLocations {
> om2_1 | blockID {
> om2_1 | containerBlockID {
> om2_1 | containerID: 1
> om2_1 | localID: 107736214721200128
> om2_1 | }
> om2_1 | blockCommitSequenceId: 0
> om2_1 | }
> om2_1 | offset: 0
> om2_1 | length: 268435456
> om2_1 | createVersion: 0
> om2_1 | pipeline {
> om2_1 | members {
> om2_1 | uuid: "b92bf4c8-3b0c-40b0-bb2b-05b6d3594e13"
> om2_1 | ipAddress: "10.9.0.16"
> om2_1 | hostName: "ha_dn2_1.ha_net"
> om2_1 | ports {
> om2_1 | name: "REPLICATION"
> om2_1 | value: 9886
> om2_1 | }
> om2_1 | ports {
> om2_1 | name: "RATIS"
> om2_1 | value: 9858
> om2_1 | }
> om2_1 | ports {
> om2_1 | name: "RATIS_ADMIN"
> om2_1 | value: 9857
> om2_1 | }
> om2_1 | ports {
> om2_1 | name: "RATIS_SERVER"
> om2_1 | value: 9856
> om2_1 | }
> om2_1 | ports {
> om2_1 | name: "STANDALONE"
> om2_1 | value: 9859
> om2_1 | }
> om2_1 | networkName: "b92bf4c8-3b0c-40b0-bb2b-05b6d3594e13"
> om2_1 | networkLocation: "/default-rack"
> om2_1 | persistedOpState: IN_SERVICE
> om2_1 | persistedOpStateExpiry: 0
> om2_1 | uuid128 {
> om2_1 | mostSigBits: -5103716611873029968
> om2_1 | leastSigBits: -4959864281830437357
> om2_1 | }
> om2_1 | }
> om2_1 | members {
> om2_1 | uuid: "f0b7e615-d4ee-4ec4-a6b5-ec68b82c07e9"
> om2_1 | ipAddress: "10.9.0.15"
> om2_1 | hostName: "ha_dn1_1.ha_net"
> om2_1 | ports {
> om2_1 | name: "REPLICATION"
> om2_1 | value: 9886
> om2_1 | }
> om2_1 | ports {
> om2_1 | name: "RATIS"
> om2_1 | value: 9858
> om2_1 | }
> om2_1 | ports {
> om2_1 | name: "RATIS_ADMIN"
> om2_1 | value: 9857
> om2_1 | }
> om2_1 | ports {
> om2_1 | name: "RATIS_SERVER"
> om2_1 | value: 9856
> om2_1 | }
> om2_1 | ports {
> om2_1 | name: "STANDALONE"
> om2_1 | value: 9859
> om2_1 | }
> om2_1 | networkName: "f0b7e615-d4ee-4ec4-a6b5-ec68b82c07e9"
> om2_1 | networkLocation: "/default-rack"
> om2_1 | persistedOpState: IN_SERVICE
> om2_1 | persistedOpStateExpiry: 0
> om2_1 | uuid128 {
> om2_1 | mostSigBits: -1101158602427707708
> om2_1 | leastSigBits: -6433976558118238231
> om2_1 | }
> om2_1 | }
> om2_1 | members {
> om2_1 | uuid: "c7912312-811d-469d-8c40-c739cd2a1e62"
> om2_1 | ipAddress: "10.9.0.17"
> om2_1 | hostName: "ha_dn3_1.ha_net"
> om2_1 | ports {
> om2_1 | name: "REPLICATION"
> om2_1 | value: 9886
> om2_1 | }
> om2_1 | ports {
> om2_1 | name: "RATIS"
> om2_1 | value: 9858
> om2_1 | }
> om2_1 | ports {
> om2_1 | name: "RATIS_ADMIN"
> om2_1 | value: 9857
> om2_1 | }
> om2_1 | ports {
> om2_1 | name: "RATIS_SERVER"
> om2_1 | value: 9856
> om2_1 | }
> om2_1 | ports {
> om2_1 | name: "STANDALONE"
> om2_1 | value: 9859
> om2_1 | }
> om2_1 | networkName: "c7912312-811d-469d-8c40-c739cd2a1e62"
> om2_1 | networkLocation: "/default-rack"
> om2_1 | persistedOpState: IN_SERVICE
> om2_1 | persistedOpStateExpiry: 0
> om2_1 | uuid128 {
> om2_1 | mostSigBits: -4066430426156284259
> om2_1 | leastSigBits: -8340447458821005726
> om2_1 | }
> om2_1 | }
> om2_1 | state: PIPELINE_OPEN
> om2_1 | type: RATIS
> om2_1 | factor: THREE
> om2_1 | id {
> om2_1 | id: "c0b6f272-9a39-4dc3-ada8-c3b833cc6e17"
> om2_1 | uuid128 {
> om2_1 | mostSigBits: -4560190998638408253
> om2_1 | leastSigBits: -5933277313150194153
> om2_1 | }
> om2_1 | }
> om2_1 | leaderID: "f0b7e615-d4ee-4ec4-a6b5-ec68b82c07e9"
> om2_1 | creationTimeStamp: 1643924132125
> om2_1 | suggestedLeaderID {
> om2_1 | mostSigBits: -1101158602427707708
> om2_1 | leastSigBits: -6433976558118238231
> om2_1 | }
> om2_1 | leaderID128 {
> om2_1 | mostSigBits: -1101158602427707708
> om2_1 | leastSigBits: -6433976558118238231
> om2_1 | }
> om2_1 | }
> om2_1 | partNumber: 0
> om2_1 | }
> om2_1 | isMultipartKey: false
> om2_1 | acls {
> om2_1 | type: USER
> om2_1 | name: "dlfknslnfslf"
> om2_1 | rights: "\200"
> om2_1 | aclScope: ACCESS
> om2_1 | }
> om2_1 | modificationTime: 1643924174840
> om2_1 | }
> om2_1 | clientID: 107736214722445312
> om2_1 | }
> om2_1 | failed with exception
> om2_1 | java.lang.IllegalArgumentException
> om2_1 | at
> com.google.common.base.Preconditions.checkArgument(Preconditions.java:128)
> om2_1 | at
> org.apache.hadoop.ozone.om.helpers.OmKeyInfo.<init>(OmKeyInfo.java:81)
> om2_1 | at
> org.apache.hadoop.ozone.om.helpers.OmKeyInfo$Builder.build(OmKeyInfo.java:378)
> om2_1 | at
> org.apache.hadoop.ozone.om.helpers.OmKeyInfo.getFromProtobuf(OmKeyInfo.java:460)
> om2_1 | at
> org.apache.hadoop.ozone.om.codec.OmKeyInfoCodec.fromPersistedFormat(OmKeyInfoCodec.java:59)
> om2_1 | at
> org.apache.hadoop.ozone.om.codec.OmKeyInfoCodec.fromPersistedFormat(OmKeyInfoCodec.java:36)
> om2_1 | at
> org.apache.hadoop.hdds.utils.db.CodecRegistry.asObject(CodecRegistry.java:55)
> om2_1 | at
> org.apache.hadoop.hdds.utils.db.TypedTable.getFromTableIfExist(TypedTable.java:261)
> om2_1 | at
> org.apache.hadoop.hdds.utils.db.TypedTable.getIfExist(TypedTable.java:248)
> om2_1 | at
> org.apache.hadoop.ozone.om.request.key.OMKeyCreateRequest.validateAndUpdateCache(OMKeyCreateRequest.java:236)
> om2_1 | at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleWriteRequest(OzoneManagerRequestHandler.java:227)
> om2_1 | at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.runCommand(OzoneManagerStateMachine.java:415)
> om2_1 | at
> org.apache.hadoop.ozone.om.ratis.OzoneManagerStateMachine.lambda$applyTransaction$1(OzoneManagerStateMachine.java:240)
> om2_1 | at
> java.base/java.util.concurrent.CompletableFuture$AsyncSupply.run(CompletableFuture.java:1700)
> om2_1 | at
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
> om2_1 | at
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
> om2_1 | at java.base/java.lang.Thread.run(Thread.java:834)
> om2_1 | 2022-02-03 21:36:15,253 [shutdown-hook-0] INFO
> om.OzoneManagerStarter: SHUTDOWN_MSG:
> om2_1 | /************************************************************
> om2_1 | SHUTDOWN_MSG: Shutting down OzoneManager at a250845831a7/10.9.0.12
> om2_1 | ************************************************************/
> {code}
> This is where OM is crashing in ozone 1.1.0 code:
> https://github.com/apache/ozone/blob/ozone-1.1.0/hadoop-ozone/common/src/main/java/org/apache/hadoop/ozone/om/helpers/OmKeyInfo.java#L81-L82
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]