GitHub user gidhubuser255 edited a comment on the discussion: Parallelization 
Enhancement Ideas

(Reposting comment with updated content)

1) Not necessarily a single node, could be an entire "branch" (what I think you 
are referring to as sub-DAG) where some nodes in the branch are sensitive to 
the parameter being parallelized (these would be represented as separate nodes 
in the graph) and some are not (these would be single nodes shared by the 
parallelized branches).

Contrived example below:

> basket_price <- price(product) for product in basket
> price(product) <- production_cost(product), markup(product)
> production_cost(product) <- material_cost(product), **labor_cost**, 
> **overhead**
> markup(product) <- competitor_price(product), **brand_premium**, 
> demand(product)
> ...
> ...

(Where basket_price itself can be a dependency for another node, and the 
parallelized variable (product) insensitive nodes are bolded).

2) Ok so the way I would foresee this working 

Normally a hamilton function determines its dependencies by inspecting the 
function arguments during the graph build and mapping that to a function. With 
the new proposed "driver aware" functions, during graph build, you would 
instead invoke the function with a pseudo-driver object that records the 
dependency when it's called with driver.call('dep') and then it would exit the 
function when it reaches the Collect. And then during execution you would 
invoke the function with a different pseudo-driver object that pulls the result 
from a pre-populated dict. And you can easily check if a function is driver 
aware to switch between the standard dependency collect method vs the new 
proposed method.

>From an end user POV it would look something like this, take the existing 
>hamilton function:

```
def A(B: int, C: int):
    return B + C
```

Could be rewritten as:

```
def A(driver: Driver):  # Driver type would need to be protocol or something 
since graph build and execution get different pseudo-drivers, doesn't really 
matter though since the type here wouldn't be used for anything
    b = driver.call('B', type=int)  # not sure how best to do the types, some 
other ideas listed below
    c = driver.call('C', type=int)
    Collect()  # exit here in graph build stage, ignore in execution stage
    return b + c
```

In this case, there would be no point in doing so, but you could then add any 
arguments to A's signature to pass through as regular parameters to other 
calls. This means that the Node object definition also needs to contain these 
parameter values. Also you can see here that the concept doesn't really only 
need to be restricted to parallelization. It could also still be used in 
non-parameterized cases where you want some control flow to determine what 
dependencies to call. e.g:

```
def A(driver):
    b = driver.call('B', type=int)
    if <some condition>:
        c = driver.call('C1', type=int)
    else:
        c = driver.call('C2', type=int)
    Collect()
    return b + c
```

But this case can already be handled by config.when in most cases.

To enumerate some of the ideas for how type could be specified:

```
d = driver.call('dep', type=int)
d = driver.type(int).call('dep')
d = driver.call[int]('dep')
```

GitHub link: 
https://github.com/apache/hamilton/discussions/1412#discussioncomment-14821859

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: 
[email protected]

Reply via email to