[jira] [Commented] (SPARK-29419) Seq.toDS / spark.createDataset(Seq) is not thread-safe

Dongjoon Hyun (Jira) Sun, 12 Jan 2020 21:34:28 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-29419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17014032#comment-17014032
 ]


Dongjoon Hyun commented on SPARK-29419:
---------------------------------------

Hi, [~joshrosen]. If this is `correctness`, is this `Bug` instead of 
`Improvement`?

> Seq.toDS / spark.createDataset(Seq) is not thread-safe
> ------------------------------------------------------
>
>                 Key: SPARK-29419
>                 URL: https://issues.apache.org/jira/browse/SPARK-29419
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 2.4.0, 3.0.0
>            Reporter: Josh Rosen
>            Assignee: Josh Rosen
>            Priority: Major
>              Labels: correctness
>
> The {{Seq.toDS}} and {{spark.createDataset(Seq)}} code is not thread-safe: if 
> the caller-supplied {{Encoder}} is used in multiple threads then 
> {{createDataset}}'s usage of the encoder may lead to incorrect answers 
> because the Encoder's internal mutable state will be updated by from multiple 
> threads.
> Here is an example demonstrating the problem:
> {code:java}
> import org.apache.spark.sql._
> val enc = implicitly[Encoder[(Int, Int)]]
> val datasets = (1 to 100).par.map { _ =>
>   val pairs = (1 to 100).map(x => (x, x))
>   spark.createDataset(pairs)(enc)
> }
> datasets.reduce(_ union _).collect().foreach {
>   pair => require(pair._1 == pair._2, s"Pair elements are mismatched: $pair")
> }{code}
> Due to the thread-safety issue, the above example results in the creation of 
> corrupted records where different input records' fields are co-mingled.
> This bug is similar to SPARK-22355, a related problem in 
> {{Dataset.collect()}} (fixed in Spark 2.2.1+).
> Fortunately, this has a simple one-line fix (copy the encoder); I'll submit a 
> patch for this shortly.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (SPARK-29419) Seq.toDS / spark.createDataset(Seq) is not thread-safe

Reply via email to