[jira] [Commented] (NIFI-9477) data loss doing Incremental load based on a timestamp from SQL server using QueryDatabaseTableRecord

Peter Turcsanyi (Jira) Tue, 14 Dec 2021 07:54:06 -0800


    [ 
https://issues.apache.org/jira/browse/NIFI-9477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17459267#comment-17459267
 ]


Peter Turcsanyi commented on NIFI-9477:
---------------------------------------

[~abduq] Timestamp accuracy and having multiple records with the same 
timestamps is one aspect of this problem but there is another one: due to 
concurrent execution of transactions in RDBMS, it is not guaranteed that 
timestamps arrive in monotonically increasing way (that is a record with an 
older timestamp can appear).

Example scenario:
 # tx1 starts
 # tx1 creates a record with ts1
 # tx2 starts
 # tx2 creates a record with ts2 (ts2 > ts1)
 # tx2 commits and the record with ts2 becomes visible for others
 # QueryDatabaseTableRecord executes its query and stores ts2 as the max 
timestamp
 # tx1 commits, the record with ts1 becomes visible for others but 
QueryDatabaseTableRecord will never load it because ts1 < max ts (== ts2)

The recommended solution is to configure "Additional WHERE Clause" property 
with a condition that skips the most recent records. In Oracle it would look 
like {{"my_ts_column < systimestamp - interval 'N' second"}} (I believe MS SQL 
Server has something similar). It loads only those records that are older than 
'N' seconds (the interval should be longer than the longest data loading 
transaction).

It is similar to your solution but it does not generate duplicates and there is 
no need for a custom processor.

The other option would be to configure your data loading transactions to use 
stricter transaction isolation level and/or locks to achieve serial and ordered 
execution of transactions but in most cases it is not possible  or desirable 
(affects the loading side, performance reasons, etc.).

I hope this helps.

> data loss doing Incremental load based on a timestamp from SQL server using 
> QueryDatabaseTableRecord 
> -----------------------------------------------------------------------------------------------------
>
>                 Key: NIFI-9477
>                 URL: https://issues.apache.org/jira/browse/NIFI-9477
>             Project: Apache NiFi
>          Issue Type: Bug
>          Components: Core Framework
>    Affects Versions: 1.13.2
>         Environment: ubuntu 20 
> 2-node cluster using an embedded zk
>            Reporter: Abdullah Aqeeli
>            Priority: Critical
>
> hello all,
>  
> We are using Nifi to do an incremental load from SQL Server using 
> QueryDatabaseTableRecord where max-value column is updated_at column with a 
> [Datetime2 
> |https://docs.microsoft.com/en-us/sql/t-sql/data-types/datetime2-transact-sql]data
>  type. The source tables get updated in milliseconds. We noticed that we are 
> losing few events doing with each QueryDatabaseTableRecord run. As a 
> work-around we created a custom processor that subtracts few seconds from the 
> max-value in the state fetching overlapping events and then we handle any 
> duplicates later on in the pipeline and it seems to solve the issue. However, 
> this is not optimal. We'd like to know what's causing the skips. Our 
> assumption so far is that timestamp datatype accuracy is causing the skips in 
> the incremental load however we're not sure.  
>  
> Thanks!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (NIFI-9477) data loss doing Incremental load based on a timestamp from SQL server using QueryDatabaseTableRecord

Reply via email to