[GitHub] [airflow] dstandish commented on pull request #24743: Get dataset-driven scheduling working

GitBox Tue, 05 Jul 2022 21:19:57 -0700


dstandish commented on PR #24743:
URL: https://github.com/apache/airflow/pull/24743#issuecomment-1175763546


   > I think "any" is a more general and simpler first pass. If you go "all", 
and you have a really infrequently produced dataset, we are basically forced to 
provide more functionality initially
   
   Not sure if I understand you correctly re providing more functionality, but 
it sounds like you mean that if we go with "all" we would have to provide a way 
to have the "infrequent" dataset as a dependency but ignored under certain 
circumstances.  I don't think we do.
   
   I think that if it's updated very infrequently, you simply don't add it as a 
dependency.  In practice, I think in this kind of situation you know the other 
datasets are triggered more frequently, so likely it doesn't matter when the 
"infrequent" dataset is updated -- it's gonna get processed soon anyway.
   
   I think "all" is likely the more needed pattern in the wild.  E.g. in a data 
warehousing scenario, if your fact table depends on 4 dim tables, you don't 
want to run until they've all been processed.  
   
   If you don't care if 9 out of 10 of your upstream datasets have not yet been 
updated, then there's a good chance a schedule would work just fine.  But 
conversely, if you want to wait for N datasets to be updated, and then run 
immediately, in that case a schedule really does _not_ work well.
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] dstandish commented on pull request #24743: Get dataset-driven scheduling working

Reply via email to