Re: [I] Add a dry run feature [hamilton]

via GitHub Fri, 12 Sep 2025 06:27:33 -0700


zilto commented on issue #1370:
URL: https://github.com/apache/hamilton/issues/1370#issuecomment-3285205955


   ## How caching works
   Before execution, you are certain of your input `data_version` and the 
`code_version` of **all nodes**.  What you don't know is the `data_version` of 
**all non-input nodes**. To know the `data_version`, you need to apply the 
caching logic by traversing the DAG. 
   
   In DAG `A -> B -> C -> D`, to know if `D` needs to be executed, you need to 
resolve `C.data_version` (you already know `C.code_version`). To know 
`C.data_version`,  you need to resolve `B.data_version`. The key point being, 
you need to resolve things in order.
   
   (The best summary I can provide is the picture in the previous reply)
   
   ## In short
   > Put differently, couldn't you provide a dry-run feature that works under 
the assumption that all data versions will be the same if all upstream code 
versions have not changed?
   
   As I understand, the goal of the visualization is to tell you "how caching 
will behave before running the DAG". As explained in a previous reply, it 
**is** possible, but almost all your nodes will be "maybe execute". 
   
   If this visualization makes different assumptions than the caching 
algorithm, it will display something that diverges from the actual caching 
behavior. The output will be probably be more confusing than useful.
   
   ## Assumptions
   ### dry-run
   > under the assumption that all data versions will be the same if all 
upstream code versions have not changed?
   
   If you already know the `code_version` and `data_version` of A, B, C, D, 
there's no need to execute the DAG. A `data_version` is a hash of a specific 
Python object. If you know `D.data_version = XYZ123`, then just 
`cache.get(D.data_version)` and retrieve the value.
   
   If you do know that `B = 4`, then you should use `Driver.execute(["D"], 
overrides={"B": 4})`. Caching will properly handle this knowledge. If you were 
to implement the visualization I explained in a previous reply, you would see 
less "maybe execute", but still potentially a lot of them.
   
   ## Idempotency
   > What is assumed to be capable of changing data if not code though?
   
   The same code can produce different results (non-idempotency): it has a 
stochastic component (`random.choice`,  ML training), it does I/O and reads 
values from an external file, you use a different version of an external Python 
library (hypothetically, `numpy.float` defaulting to 32 precision in 
`numpy==2.0.0` vs.  64 precision in `numpy==3.0.0`).
   
   By default Hamilton assumes idempotency of nodes and it's the user's 
responsibility to tag non-idempotent nodes.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] Add a dry run feature [hamilton]

Reply via email to