dstandish commented on PR #24743: URL: https://github.com/apache/airflow/pull/24743#issuecomment-1175763546
> I think "any" is a more general and simpler first pass. If you go "all", and you have a really infrequently produced dataset, we are basically forced to provide more functionality initially Not sure if I understand you correctly re providing more functionality, but it sounds like you mean that if we go with "all" we would have to provide a way to have the "infrequent" dataset as a dependency but ignored under certain circumstances. I don't think we do. I think that if it's updated very infrequently, you simply don't add it as a dependency. In practice, I think in this kind of situation you know the other datasets are triggered more frequently, so likely it doesn't matter when the "infrequent" dataset is updated -- it's gonna get processed soon anyway. I think "all" is likely the more needed pattern in the wild. E.g. in a data warehousing scenario, if your fact table depends on 4 dim tables, you don't want to run until they've all been processed. If you don't care if 9 out of 10 of your upstream datasets have not yet been updated, then there's a good chance a schedule would work just fine. But conversely, if you want to wait for N datasets to be updated, and then run immediately, in that case a schedule really does _not_ work well. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
