hvanhovell opened a new pull request, #42476:
URL: https://github.com/apache/spark/pull/42476

   ### What changes were proposed in this pull request?
   When you try to run a streaming query from the REPL for example:
   ```scala
   val add1 = udf((i: Long) => i + 1)
   val query = spark.readStream
       .format("rate")
       .option("rowsPerSecond", "10")
       .option("numPartitions", "1")
       .load()
       .withColumn("value", add1($"value"))
       .writeStream
       .format("memory")
       .queryName("my_sink")
       .start()
   ```
   You are currently greeted by a hard to understand deserialization issue, 
where a serialization proxy cannot be assigned to a field. The underlying cause 
here is a `ClassNotFoundException` (yes, java serialization is weird). This  
`ClassNotFoundException`  is caused by us not propagating the 
`JobArtifactState` (this - indirectly - contains information about the location 
of REPL generated classes, and session local libraries) properly to the 
streaming query execution thread.
   
   This PR fixed this by propagating the `JobArtifactState` into the stream 
execution thread.
   
   
   ### Why are the changes needed?
   It is a bug. We want streaming to work with connect's isolated dependencies.
   
   ### Does this PR introduce _any_ user-facing change?
   No.
   
   ### How was this patch tested?
   I added a test to `ReplE2ESuite`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to