zilto commented on issue #1370:
URL: https://github.com/apache/hamilton/issues/1370#issuecomment-3285205955
## How caching works
Before execution, you are certain of your input `data_version` and the
`code_version` of **all nodes**. What you don't know is the `data_version` of
**all non-input nodes**. To know the `data_version`, you need to apply the
caching logic by traversing the DAG.
In DAG `A -> B -> C -> D`, to know if `D` needs to be executed, you need to
resolve `C.data_version` (you already know `C.code_version`). To know
`C.data_version`, you need to resolve `B.data_version`. The key point being,
you need to resolve things in order.
(The best summary I can provide is the picture in the previous reply)
## In short
> Put differently, couldn't you provide a dry-run feature that works under
the assumption that all data versions will be the same if all upstream code
versions have not changed?
As I understand, the goal of the visualization is to tell you "how caching
will behave before running the DAG". As explained in a previous reply, it
**is** possible, but almost all your nodes will be "maybe execute".
If this visualization makes different assumptions than the caching
algorithm, it will display something that diverges from the actual caching
behavior. The output will be probably be more confusing than useful.
## Assumptions
### dry-run
> under the assumption that all data versions will be the same if all
upstream code versions have not changed?
If you already know the `code_version` and `data_version` of A, B, C, D,
there's no need to execute the DAG. A `data_version` is a hash of a specific
Python object. If you know `D.data_version = XYZ123`, then just
`cache.get(D.data_version)` and retrieve the value.
If you do know that `B = 4`, then you should use `Driver.execute(["D"],
overrides={"B": 4})`. Caching will properly handle this knowledge. If you were
to implement the visualization I explained in a previous reply, you would see
less "maybe execute", but still potentially a lot of them.
## Idempotency
> What is assumed to be capable of changing data if not code though?
The same code can produce different results (non-idempotency): it has a
stochastic component (`random.choice`, ML training), it does I/O and reads
values from an external file, you use a different version of an external Python
library (hypothetically, `numpy.float` defaulting to 32 precision in
`numpy==2.0.0` vs. 64 precision in `numpy==3.0.0`).
By default Hamilton assumes idempotency of nodes and it's the user's
responsibility to tag non-idempotent nodes.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]