[jira] [Commented] (SPARK-37348) PySpark pmod function

Tim Schwab (Jira) Thu, 18 Nov 2021 07:28:13 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-37348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445988#comment-17445988
 ]


Tim Schwab commented on SPARK-37348:
------------------------------------

Fair enough. The reasoning to add a function as given in the comment you linked 
is precisely the reasoning I have—to create a compile-time check as opposed to 
a runtime check. That, and the niceness of it fully integrating with the rest 
of PySpark. (E.g. I can't use F.expr() in the middle of several other functions 
- I would have to either rewrite the whole line to use F.expr() or break the 
line into several intermediate columns. Not a big deal obviously, but still not 
ideal.)

As for whether it is commonly used, I am not sure how to validate this one way 
or the other. However, I can say that the majority of use cases for the % 
operator in computer science in general are looking for the modulus rather than 
the remainder. Specifically, the majority of use cases expect a range of [0, n) 
as opposed to a range of (-n, n). At the same time, the majority of use cases 
also have a restricted domain of positive numbers anyway, so there is no 
difference. But in the cases where the domain is negative, often modulus is 
desired rather than remainder. This is because the % operator is most often 
used to map a larger number onto a range of smaller numbers [0, n). The 
counterpoint to this is cryptographic functions, which can use remainders just 
fine, but I would expect that manual implementations of cryptographic functions 
on RDDs or Dataframes are not common.

So, as far as I can see, when the domain includes negative numbers, usually 
modulus is desired rather than remainder. It happens that Spark includes a very 
commonly used function that has a range that includes negative numbers and 
whose output is often fed into the % operator: hash(). This is in fact the 
exact use case that brought me here initially; I want to map hash() outputs to 
a range of [0, n) instead of (-n, n). For this use case alone I would think it 
is worth it to include pmod in PySpark.

In addition to this, Python's % operator is modulus rather than remainder 
unlike the JVM's, therefore I would expect Python users to more often feel a 
need for pmod() than, say, Scala users.

> PySpark pmod function
> ---------------------
>
>                 Key: SPARK-37348
>                 URL: https://issues.apache.org/jira/browse/SPARK-37348
>             Project: Spark
>          Issue Type: Improvement
>          Components: PySpark
>    Affects Versions: 3.2.0
>            Reporter: Tim Schwab
>            Priority: Minor
>
> Because Spark is built on the JVM, in PySpark, F.lit(-1) % F.lit(2) returns 
> -1. However, the modulus is often desired instead of the remainder.
>  
> There is a [PMOD() function in Spark 
> SQL|https://spark.apache.org/docs/latest/api/sql/#pmod], but [not in 
> PySpark|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#functions].
>  So at the moment, the two options for getting the modulus is to use 
> F.expr("pmod(A, B)"), or create a helper function such as:
>  
> {code:java}
> def pmod(dividend, divisor):
>     return F.when(dividend < 0, (dividend % divisor) + 
> divisor).otherwise(dividend % divisor){code}
>  
>  
> Neither are optimal - pmod should be native to PySpark as it is in Spark SQL.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-37348) PySpark pmod function

Reply via email to