[ 
https://issues.apache.org/jira/browse/NIFI-16066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18093260#comment-18093260
 ] 

ASF subversion and git services commented on NIFI-16066:
--------------------------------------------------------

Commit 306830fcd0220ae994b3c1f7f2c5e42aacfcc1fc in nifi's branch 
refs/heads/main from Alaksiej Ščarbaty
[ https://gitbox.apache.org/repos/asf?p=nifi.git;h=306830fcd02 ]

NIFI-16066 Release lingering rename in ConsumeKinesis (#11385)

* NIFI-16066 Release lingering rename lock when error occurs in ConsumeKinesis

When an error (e.g. a failed checkpoint copy) occurred while a node held
the rename lock in LegacyCheckpointMigrator, the renameOwner marker was
left on the migration table, blocking all future rename attempts until
the stale-lock timeout elapsed. Release the rename lock on error, on all
three lock-holding paths, so the rename can be retried. The release is
owner-guarded (renameOwner = :owner) so a slow-failing node cannot wipe a
lock another node has already force-acquired via the stale-lock takeover,
and it is best-effort so a transient DynamoDB error while releasing does
not mask the original failure.

Also make waitForTableRenamed treat a rename as complete only once the
migration table has been dropped, not merely when the checkpoint table
has the new schema. The new checkpoint table is created empty before the
copy runs, so schema presence alone does not indicate the migration
finished.

> Release lingering rename lock when error occurs in ConsumeKinesis
> -----------------------------------------------------------------
>
>                 Key: NIFI-16066
>                 URL: https://issues.apache.org/jira/browse/NIFI-16066
>             Project: Apache NiFi
>          Issue Type: Improvement
>          Components: Extensions
>            Reporter: Alaksiej Ščarbaty
>            Assignee: Alaksiej Ščarbaty
>            Priority: Major
>             Fix For: 2.10.0
>
>
> When error occurs in _LegacyCheckpointMigrator_ during checkpoint error 
> rename, the lock isn't released.
> In addition the migration is considered complete only when a table with new 
> schema is available, without checking whether the migration actually 
> completed.
> *Improvements*
> Remove the rename lock if an error during table rename occurs.
> In _waitForTableRenamed_ check not only for the table schema, but also ensure 
> the migration table has been dropped. That's a sign of the completed 
> migration.
> *Failure scenario*
>  # LegacyCheckpointMigrator creates a migration table, copies the checkpoints 
> there.
>  # The migrator acquires rename lock.
>  # The migrator deletes the original table, creates a new one.
>  # The migrator copies items from the migration table into the new "original" 
> one.
>  # A DynamoDB exception is thrown, the copying is interrupted, but {*}lock 
> not released{*}.
>  # On a restart the migrator sees the migration table is lingering, thus 
> tries to continue the migration.
>  # The lock is taken, the migrator waits for the table to be renamed.
>  # {*}The condition checks table schema only{*}, it ignores the fact that the 
> checkpoints haven't been migrated yet.
>  # The migrator calls the migration as done.
>  # The processor sees no checkpoints in the table, starts from the latest 
> position. - {*}Potential data loss{*}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to