hvanhovell opened a new pull request, #39186:
URL: https://github.com/apache/spark/pull/39186

   ### What changes were proposed in this pull request?
   This PR introduces so AgnosticEncoders. AgnosticEncoders describe how an 
external type maps to a Spark data type. They are agnostic in the sense that 
they do not prescribe which internal format is to be used.
   
   For example the following class:
   ```
   case class Person(id: Long, name: String, hobbies: Seq[String])
   ```
   
   Translates into the following agnostic encoder:
   ```
   ProductEncoder(Person,List(
     (id, PrimitiveLongEncoder),
     (name, StringEncoder),
     (hobbies, IterableEncoder(scala.collection.Seq,StringEncoder))))
   ```
   
   This PR integrates `AgnosticEncoders` in `ScalaReflection`, and so it is 
used for all `Dataset` operations. In a follow-up we will address `RowEncoder` 
and `JavaReflection`. In the old situation we would traverse the type hierarchy 
for each options (`serializedForType`, `deserializerForType` & `schemaFor`). In 
the new situation we will create an AgnosticEncoder first and then generate a 
serializer, deserializer, and/or `schema`. This saves significantly in time, 
especially for `ExpressionEncoder` where we only need one pass through the type 
hierarchy instead 2 or 3.
   
   ### Why are the changes needed?
   For the Spark Connect Scala Client we need encoders. We want to stay as 
close to the current Dataset APIs and encoders are part of this. Additionally 
we would like retain the rich type support.
   
   Encoders are currently tied to ExpressionEncoders, we cannot use for a 
couple of reasons:
   1. Mid-term we don't want to have a dependency on Catalyst. Splitting of the 
public API that will be shared between Catalyst and the client is tracked in 
SPARK-41400.
   2. ExpressionEncoders only support the internal row format. The client will 
use Arrow instead.
   3. We are nog particularly keen on sending the expressions needed by 
ExpressionEncoders over the wire. They are overpowered.
   
   So we need an alternative to the current ExpressionEncoders. This class need 
to be serializable.
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   ### How was this patch tested?
   Existing tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to