Github user mstewart141 commented on the issue:
https://github.com/apache/spark/pull/20900
Many (though not all, I don't think `callable`s are impacted) of the
limitations of pandas_udf relative to UDF in this domain are due to the fact
that `pandas_udf` doesn't allow for keyword arguments at the call site. This
obviously impacts plain old function-based `pandas_udf`s but also partial fns,
where one would typically need to specify the argument (that one was partially
applying) by name.
In the incremental commits of this PR as at:
https://github.com/apache/spark/pull/20900/commits/9ea2595f0cecb0cd05e0e6b99baf538679332e8b
You can see the kind of things I was investigating to try and fix that
case. Indeed my original PR was (ambitiously) titled something about enabling
kw args for all pandas_udfs. This is actually very easy to do for *functions*
on python3 (and maybe plain functions in py2 also, but I suspect that this is
also rather tricky as `getargspec` is pretty unhelpful when it comes to some of
the kw-arg metadata one would need)). However, it is rather harder for the
partial function case as one quickly gets into stacktraces from places like
`python/pyspark/worker.py` where the semantics of the current strategy do not
realize that a column from the arguments list may already be "accounted for"
and one runs into duplicate arguments passed for `a`, e.g., as a result of
this.
My summary is that the change to allow kw for functions is simple (at least
in py3 -- indeed my incremental commit referenced above does this), but for
partial fns maybe not so much. I'm pretty confident I'm most of the way to
accomplishing the former, but not that latter.
However, I have no substantial knowledge of the pyspark codebase so you
will likely have better luck there, should you go down that route :)
**TL;DR**: I could work on a PR to allow keyword arguments for python3
functions (not partials, and not py2), but that is likely too narrow a goal
given the broader context.
One general question: how do we tend to think about the py2/3 split for api
quirks/features? Must everything that is added for py3 also be functional in
py2?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]