JoshRosen edited a comment on issue #26076: [SPARK-29419][SQL] Fix Encoder 
thread-safety bug in createDataset(Seq)
URL: https://github.com/apache/spark/pull/26076#issuecomment-540211354
 
 
   > But if we would like to fix all these problems, all public APIs accepting 
`Encoder` will need the copy.
   
   I think that most existing uses of Encoders are de-facto thread-safe because 
either (a) the use occurs inside of a Spark task and task gets its own fresh 
copy of the Encoder when the `Task` is deserialized or (b) the use occurs on 
the driver but the code calls call `resolveAndBind` (which internally performs 
a `copy`) prior to using the Encoder.
   
   Given this, I suspect that this might be the only non-thread-safe Encoder 
usage in Spark (excluding code which is only used in Spark's unit tests). I 
don't think that we need to introduce similar copying in other public APIs.
   
   > I did some research about this and found some noticeable performance 
regression in our internal benchmark.
   
   What do you think about improving the performance / reducing the cost of 
`.copy()` by refactoring the `ExpressionEncoder` class such that (a) all of the 
immutable `vals` become fields of the case class, (b) the current constructor 
becomes a `.apply()` on the companion object and the case class constructor 
becomes `private`, and (c) `resolveAndBind` calls the companion object 
constructor instead of `copy()`? Given this, I think `copy()` could be _really_ 
cheap, effectively giving us a fresh copy of the internal mutable state but 
copying all other immutable attributes without performing any re-resolution, 
analysis, attribute binding, etc.
   
   If we do that, we'd be able to defensively copy at _very_ low cost (e.g. one 
object allocation) and then could copy-by-default and free users from having to 
worry about thread-safety.
   
   I think that's a potentially huge win from a developer productivity 
point-of-view: the cost / toil of having to worry about thread-unsafe code is a 
tax placed on end users and creates a developer education / training burden, so 
I think it's worth attempting to eliminate this entire class of pitfall.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to