jroachgolf84 commented on issue #49900: URL: https://github.com/apache/airflow/issues/49900#issuecomment-2835642538
This is definitely that I've been thinking about a bit. We're going to have to some sort of a "time-based" approach to this. I've drafted up some thoughts here; this is WIP. ### Proposed Solution - An `ObjectStore` (S3, WASB, etc.) AssetWatcher is written, and leverages some Trigger. Really, this Trigger will be doing most (all) of the work. - This Trigger pulls from the `asset` endpoint to determine the `last_asset_event_timestamp` for that asset (see "Changes Needed" below, I am close to closing this issue out). - The trigger retrieves a list of all objects in the object store with a “last updated” timestamp `>= timestamp` of the “last Asset Event”. - For each of these resulting objects, an Asset Event is written to the `asset_event` endpoint (using timestamp that the object itself landed/was updated in the object store), and yielded to trigger a DAG Run. *Note that the implementation of #47164 will include a write to the `/assets` endpoint to update the `last_asset_event_id` and `last_asset_event_timestamp`. ### Changes Needed Outside of the general development of this `ObjectStore` AssetWatcher, a few other changes are needed in order to implement this pattern. - The addition of a `last_asset_event` field to the `asset` endpoint. See #47164 . I chatted with Pierre; he implied that this would be the desired way to go about this, so I’m going to go ahead and add to the issue. ## Assumptions/Constraints I’m making a number of assumptions/constraints in order to come to the proposed solution above. These include the following: 1. A single object store is considered an Asset. 2. Each time that a object lands in an object store or is updated, it should be considered an Asset Event. Each file object is NOT an Asset itself in this model (see below). 3. There is a fundamental difference between watching an object store, and watching an object. These should be two different `AssetWatcher`'s. 4. When an `AssetWatcher` runs, it should not alter the state of the object store. 5. It is untenable to “scan” the entire object store each time the `AssetWatcher` runs. This would be too computationally expensive. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
