Shivaram Venkataraman created SPARK-6822:
--------------------------------------------
Summary: lapplyPartition passes empty list to function
Key: SPARK-6822
URL: https://issues.apache.org/jira/browse/SPARK-6822
Project: Spark
Issue Type: Bug
Components: SparkR
Affects Versions: 1.4.0
Reporter: Shivaram Venkataraman
I have an rdd containing two elements, as expected or as shown by a collect.
When I call lapplyPartition on it with a function that prints its arguments in
stderr, the function gets called three times, the first two with the expected
arguments and the third with an empty list as argument. I was wondering if
that's a bug or if there are conditions under which that's possible. I
apologize I don't have a simple test case ready yet. I run into this potential
bug developing a separate package, plyrmr. If you are willing to install it,
the test case is very simple. The rdd that creates this problem is a result of
a join, but I couldn't replicate the problem using a plain vanilla join.
Example from Antonio on SparkR JIRA: I don't have time to try any harder to
repro this without plyrmr. For the record this is the example
{code}
library(plyrmr)
plyrmr.options(backend = "spark")
df1 = mtcars[1:4,]
df2 = mtcars[3:6,]
w = as.data.frame(gapply(merge(input(df1), input(df2)), identity))
{code}
the gapply is implemented with a lapplyPartition in most cases. The merge with
a join. as.data.frame with a collect. The join has an arbitrary argument of 4
partitions. If I turn that down to 2L, the problem disappears. I will check in
a version with a workaround in place but a debugging statement will leave a
record in stderr whenever the workaround kicks in, so that we can track it.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]