santosh-d3vpl3x commented on code in PR #39902:
URL: https://github.com/apache/spark/pull/39902#discussion_r1103146476
##########
python/pyspark/worker.py:
##########
@@ -208,6 +208,41 @@ def wrapped(left_key_series, left_value_series,
right_key_series, right_value_se
return lambda kl, vl, kr, vr: [(wrapped(kl, vl, kr, vr),
to_arrow_type(return_type))]
+def wrap_multi_cogrouped_map_pandas_udf(f, return_type, runner_conf, argspec):
+ def wrapped(key_series, value_series):
+ import pandas as pd
+
+ dfs = [pd.concat(series, axis=1) for series in value_series]
+
+ if runner_conf.get("pass_key") == "true":
Review Comment:
```python
def func(key, *pdfs: pd.DataFrame) -> pd.DataFrame
```
I would really like to make this as a default then we do not need any
**implicit assumption** around when the first arg is key and when it is not.
If replacing the existing way is not an option then I would like to keep
`pass_key` as **explicit** way for user to instruct spark to treat first arg as
key. Hopefully that clears up the meaning of explicitness of the API that I
have in my mind.
> Then, len(argspec.args) will be 0 for func and 1 for func_with_key and
argspec.args will not be None in both cases. So restricting var-args cases to
above signatures (not allowing def func(pdf: pd.DataFrame, *pdfs: pd.DataFrame)
-> pd.DataFrame) should make pass_key redundant.
Seems quite complicated and not very friendly for users [higher cognitive
load].
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]