[
https://issues.apache.org/jira/browse/SPARK-37348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17445988#comment-17445988
]
Tim Schwab commented on SPARK-37348:
------------------------------------
Fair enough. The reasoning to add a function as given in the comment you linked
is precisely the reasoning I have—to create a compile-time check as opposed to
a runtime check. That, and the niceness of it fully integrating with the rest
of PySpark. (E.g. I can't use F.expr() in the middle of several other functions
- I would have to either rewrite the whole line to use F.expr() or break the
line into several intermediate columns. Not a big deal obviously, but still not
ideal.)
As for whether it is commonly used, I am not sure how to validate this one way
or the other. However, I can say that the majority of use cases for the %
operator in computer science in general are looking for the modulus rather than
the remainder. Specifically, the majority of use cases expect a range of [0, n)
as opposed to a range of (-n, n). At the same time, the majority of use cases
also have a restricted domain of positive numbers anyway, so there is no
difference. But in the cases where the domain is negative, often modulus is
desired rather than remainder. This is because the % operator is most often
used to map a larger number onto a range of smaller numbers [0, n). The
counterpoint to this is cryptographic functions, which can use remainders just
fine, but I would expect that manual implementations of cryptographic functions
on RDDs or Dataframes are not common.
So, as far as I can see, when the domain includes negative numbers, usually
modulus is desired rather than remainder. It happens that Spark includes a very
commonly used function that has a range that includes negative numbers and
whose output is often fed into the % operator: hash(). This is in fact the
exact use case that brought me here initially; I want to map hash() outputs to
a range of [0, n) instead of (-n, n). For this use case alone I would think it
is worth it to include pmod in PySpark.
In addition to this, Python's % operator is modulus rather than remainder
unlike the JVM's, therefore I would expect Python users to more often feel a
need for pmod() than, say, Scala users.
> PySpark pmod function
> ---------------------
>
> Key: SPARK-37348
> URL: https://issues.apache.org/jira/browse/SPARK-37348
> Project: Spark
> Issue Type: Improvement
> Components: PySpark
> Affects Versions: 3.2.0
> Reporter: Tim Schwab
> Priority: Minor
>
> Because Spark is built on the JVM, in PySpark, F.lit(-1) % F.lit(2) returns
> -1. However, the modulus is often desired instead of the remainder.
>
> There is a [PMOD() function in Spark
> SQL|https://spark.apache.org/docs/latest/api/sql/#pmod], but [not in
> PySpark|https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql.html#functions].
> So at the moment, the two options for getting the modulus is to use
> F.expr("pmod(A, B)"), or create a helper function such as:
>
> {code:java}
> def pmod(dividend, divisor):
> return F.when(dividend < 0, (dividend % divisor) +
> divisor).otherwise(dividend % divisor){code}
>
>
> Neither are optimal - pmod should be native to PySpark as it is in Spark SQL.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]