[
https://issues.apache.org/jira/browse/SPARK-31500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17091833#comment-17091833
]
Pablo Langa Blanco commented on SPARK-31500:
--------------------------------------------
Hi [~ewasserman],
This is a scala base problem, equality between arrays is not behaving as
expected.
[https://blog.bruchez.name/2013/05/scala-array-comparison-without-phd.html]
I'm going to work to find a solution, but here is a workaround, change the
definition of the case class and put Seq instead of Array and it will work as
expected.
{code:java}
case class R(id: String, value: String, bytes: Seq[Byte]){code}
> collect_set() of BinaryType returns duplicate elements
> ------------------------------------------------------
>
> Key: SPARK-31500
> URL: https://issues.apache.org/jira/browse/SPARK-31500
> Project: Spark
> Issue Type: Bug
> Components: SQL
> Affects Versions: 2.4.4, 2.4.5
> Reporter: Eric Wasserman
> Priority: Major
>
> The collect_set() aggregate function should produce a set of distinct
> elements. When the column argument's type is BinayType this is not the case.
>
> Example:
> {{import org.apache.spark.sql.functions._}}
> {{import org.apache.spark.sql.expressions.Window}}
> {{case class R(id: String, value: String, bytes: Array[Byte])}}
> {{def makeR(id: String, value: String) = R(id, value, value.getBytes)}}
> {{val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"),
> makeR("b", "fish")).toDF()}}
>
> {{// In the example below "bytesSet" erroneously has duplicates but
> "stringSet" does not (as expected).}}
> {{df.agg(collect_set('value) as "stringSet", collect_set('bytes) as
> "byteSet").show(truncate=false)}}
>
> {{// The same problem is displayed when using window functions.}}
> {{val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding,
> Window.unboundedFollowing)}}
> {{val result = df.select(}}
> collect_set('value).over(win) as "stringSet",
> collect_set('bytes).over(win) as "bytesSet"
> {{)}}
> {{.select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize",
> size('bytesSet) as "bytesSetSize")}}
> {{.show()}}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]