jroachgolf84 commented on issue #49900:
URL: https://github.com/apache/airflow/issues/49900#issuecomment-2835642538

   This is definitely that I've been thinking about a bit. We're going to have 
to some sort of a "time-based" approach to this. I've drafted up some thoughts 
here; this is WIP.
   
   ### Proposed Solution
   
   - An `ObjectStore` (S3, WASB, etc.) AssetWatcher is written, and leverages 
some Trigger. Really, this Trigger will be doing most (all) of the work.
   - This Trigger pulls from the `asset` endpoint to determine the 
`last_asset_event_timestamp` for that asset (see "Changes Needed" below, I am 
close to closing this issue out).
   - The trigger retrieves a list of all objects in the object store with a 
“last updated” timestamp `>= timestamp` of the “last Asset Event”.
   - For each of these resulting objects, an Asset Event is written to the 
`asset_event` endpoint (using timestamp that the object itself landed/was 
updated in the object store), and yielded to trigger a DAG Run. *Note that the 
implementation of #47164 will include a write to the `/assets` endpoint to 
update the `last_asset_event_id` and `last_asset_event_timestamp`.
   
   ### Changes Needed
   
   Outside of the general development of this `ObjectStore` AssetWatcher, a few 
other changes are needed in order to implement this pattern.
   
   - The addition of a `last_asset_event` field to the `asset` endpoint. See 
#47164 . I chatted with Pierre; he implied that this would be the desired way 
to go about this, so I’m going to go ahead and add to the issue.
   
   ## Assumptions/Constraints
   
   I’m making a number of assumptions/constraints in order to come to the 
proposed solution above. These include the following:
   
   1. A single object store is considered an Asset.
   2. Each time that a object lands in an object store or is updated, it should 
be considered an Asset Event. Each file object is NOT an Asset itself in this 
model (see below).
   3. There is a fundamental difference between watching an object store, and 
watching an object. These should be two different `AssetWatcher`'s.
   4. When an `AssetWatcher` runs, it should not alter the state of the object 
store.
   5. It is untenable to “scan” the entire object store each time the 
`AssetWatcher` runs. This would be too computationally expensive.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to