zero323 commented on a change in pull request #27406:
[SPARK-30681][PYSPARK][SQL] Add higher order functions API to PySpark
URL: https://github.com/apache/spark/pull/27406#discussion_r376355854
##########
File path: python/pyspark/sql/functions.py
##########
@@ -2840,6 +2840,463 @@ def from_csv(col, schema, options={}):
return Column(jc)
+def _unresolved_named_lambda_variable(*name_parts):
+ """
+ Create `o.a.s.sql.expressions.UnresolvedNamedLambdaVariable`,
+ convert it to o.s.sql.Column and wrap in Python `Column`
+
+ :param name_parts: str
+ """
+ sc = SparkContext._active_spark_context
+ name_parts_seq = _to_seq(sc, name_parts)
+ expressions = sc._jvm.org.apache.spark.sql.catalyst.expressions
+ return Column(
+ sc._jvm.Column(
+ expressions.UnresolvedNamedLambdaVariable(name_parts_seq)
+ )
+ )
+
+
+def _get_lambda_parameters(f):
Review comment:
I get you point but right now I consider this core logic. We can shuffle
things around and shave a line or two, but I don't see it improving overall
"reviewability".
Somewhere here we have to decide out how many placeholders should be
provided (if it wasn't for polymorphic behavior of some variants, we could just
assume ‒ if we for example had `transform` and `transform_with_index`. But now
have to do it at least for a few of functions, and there might be more to
come). So we can only push things around and
```python
import inspect
signature = inspect.signature(f)
parameters = signature.parameters.values()
```
as well as the legacy part (I hope it is going to be dropped before 3.1
release anyway) are here to stay (could be inlined, but personally I think it
would makes things worse not better).
If we wanted to condense things as much as possible, it could be condensed
into something like
```python
args = [
_unresolved_named_lambda_variable(f"x{i}") for i, _ in
enumerate(
inspect.signature(f).parameters if sys.version_info >= (3, 3)
else inspect.getargspec(f)
)
]
```
but I don't think that it makes things any clearer, do you? Also leaves the
below.
If we keep just this part and leave variadic argument validation out, I'd be
concerned that it will to all kinds of confusing or potentially incorrect
behaviors. Users are extremely imaginative and delayed resolution doesn't
help*.
And since it is neither heavy in logic nor duplicates (for some definition
of duplication) JVM logic, it shouldn't be controversial.
----------------
\* We can be pretty sure that attempts to apply
some-general-purpose-library in func will happen, if this is merged.
And let's be honest ‒ higher order functions are already pretty hard to work
with. There are sensitive to minor type mismatches (numeric precision and such)
and corresponding expressions are pretty large.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
With regards,
Apache Git Services
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]