[GitHub] [spark] zero323 commented on a change in pull request #27406: [SPARK-30681][PYSPARK][SQL] Add higher order functions API to PySpark

GitBox Fri, 07 Feb 2020 04:00:28 -0800

zero323 commented on a change in pull request #27406: 
[SPARK-30681][PYSPARK][SQL] Add higher order functions API to PySpark
URL: https://github.com/apache/spark/pull/27406#discussion_r376355854


 ##########
 File path: python/pyspark/sql/functions.py
 ##########
 @@ -2840,6 +2840,463 @@ def from_csv(col, schema, options={}):
     return Column(jc)
 
 
+def _unresolved_named_lambda_variable(*name_parts):
+    """
+    Create `o.a.s.sql.expressions.UnresolvedNamedLambdaVariable`,
+    convert it to o.s.sql.Column and wrap in Python `Column`
+
+    :param name_parts: str
+    """
+    sc = SparkContext._active_spark_context
+    name_parts_seq = _to_seq(sc, name_parts)
+    expressions = sc._jvm.org.apache.spark.sql.catalyst.expressions
+    return Column(
+        sc._jvm.Column(
+            expressions.UnresolvedNamedLambdaVariable(name_parts_seq)
+        )
+    )
+
+
+def _get_lambda_parameters(f):
 
 Review comment:
   I get you point but right now I consider this core logic. We can shuffle 
things around and shave a line or two, but I don't see it improving overall 
"reviewability".
   
   Somewhere here we have to decide out how many placeholders  should be 
provided (if it wasn't for polymorphic behavior of some variants, we could just 
assume ‒ if we for example had `transform` and `transform_with_index`. But now 
have to do it at least for a few of functions, and there might be more to 
come). So we can only push things around and
   
   ```python
       import inspect
   
       signature = inspect.signature(f)
       parameters = signature.parameters.values()
   ```
   
   as well as the legacy part (I hope it is going to be dropped before 3.1 
release anyway) are here to stay (could be inlined, but personally I think it 
would makes things worse not better).
   
   If we wanted to condense things as much as possible, it could be condensed 
into something like
   
   ```python
   
       args = [
           _unresolved_named_lambda_variable(f"x{i}") for i, _ in 
           enumerate(
             inspect.signature(f).parameters if sys.version_info >= (3, 3) 
             else inspect.getargspec(f) 
           )
       ]
   
   ```
   
   but I don't think that it makes things any clearer, do you?  Also leaves the 
below.
   
   
   If we keep just this part and leave variadic argument validation out, I'd be 
concerned that it will to all kinds of confusing or potentially incorrect 
behaviors. Users are extremely imaginative and delayed resolution doesn't 
help*. 
   
   And since it is neither heavy in logic nor duplicates (for some definition 
of duplication) JVM logic, it shouldn't be controversial. 
   
   
   
   
   ----------------
   \*  We can be pretty sure that  attempts to apply 
some-general-purpose-library in func will happen, if this is merged. 
   
   And let's be honest ‒ higher order functions are already pretty hard to work 
with. There are sensitive to minor type mismatches (numeric precision and such) 
and corresponding expressions are pretty large.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] zero323 commented on a change in pull request #27406: [SPARK-30681][PYSPARK][SQL] Add higher order functions API to PySpark

Reply via email to