dstandish commented on PR #50791:
URL: https://github.com/apache/airflow/pull/50791#issuecomment-2895054198

   > The only viable option, I think, would be to record in the database when 
was the last event for a given trigger associated to a trigger (table 
`asset_trigger`). We could pass that value down to the trigger which then can 
use it however it wants in its implementation. I say "viable option" because 
there can be multiple triggerers in an environment, so the only central place 
to record such things is the database. Any attempt to save a state in the 
triggerer would fail as soon as 2 (or more) are used.
   > 
   > This is definitely doable and not that hard I think. I do not have the 
bandwidth to do it now but happy to review :)
   
   We have to bear in mind that there isn't nessarily a perfect relationship 
between airflow's event timestamp and the external one.
   
   An obvious example is, s3 file lands with timestamp X.  Airflow records the 
event timestamp 10 seconds after X.  But between X and X + 10, more files land 
in the bucket.  So if you take airflow's event timestamp as the high watermark, 
you'll miss files.
   
   So what we're really talking about here, is _watermarking_.  In general, you 
need to track the actual timestamps from the external system when doing this 
kind of thing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to