[PR] [SPARK-49789][SQL]: Handling of generic parameter with bounds while creating encoders [spark]

via GitHub Wed, 25 Sep 2024 14:07:47 -0700


ahshahid opened a new pull request, #48252:
URL: https://github.com/apache/spark/pull/48252


   ### What changes were proposed in this pull request?
   If a bean has generic types with bounds ( eg T <: SomeClass>) , as 
getters/setters, then depending upon the nature of the bounds, if it is java 
Serializable, KryoSerializable or a UDT Type, then appropriate encoder is 
created. If the bound is of first two types, then the data is represented as a 
BinaryType, while if the bound is of UDT type then schema / behaviour follows 
UDT Type.
   
   Following things are considered while fixing the issue:
   
   Since the concrete class type of the generic parameter is not available, it 
is not possible to create instance of the class ( during deser), if the bound 
represents any other type than the 3 mentioned above.
   Because a UDT class can appear anywhwere in the bound's hierarchy, all the 
super classes of the bound ( including the bound) is considered and checked . 
To create the encoder the preference is UDTType followed by JavaSerializer or 
KryoSerializer, whichever shows first.
   The majority of the code change in JavaTypeInference is a boolean check , to 
ignore any data type match when considering bound, except UDT and Type Variable 
( type variable is included because T <: S and say S <: Serializable).
   Following cases are considered which are sort of boundary cases:
   
   If the generic bean is of type
   `
   Bean[T <: UDTClass] {
   @BeanProperty var udt: T = _
   }
   Then the UDTEncoder will be created for the field
   
   But if the Bean is of type
   
   Bean[T <: UDTDerivedClass] {
   @BeanProperty var udt: T = _
   
   }
   
   where UDTDerivedClass <: UDTClass
   Then a JavaSerializable encoder will be created , even though the class 
hierarchy of UDTDerivedClass contains UDTClass. The reason being that concrete 
instance created by UDTType would be of UDTClass which is not assignable to
   UDTDerivedClass
   `
   
   similarly for non generic bean class having UDTDerivedClass as bean property 
will also use Java Serialization encoder. ( added test for the same). The 
reason for JavaSerializationEncoder is same as that for Generic one.
   
   
   ### Why are the changes needed?
   To fix the regression in spark 3.3 onwards, where the bean having a generic 
type as return value, throws EncoderNotFoundException.
   
   
   ### Does this PR introduce _any_ user-facing change?
   No
   
   
   ### How was this patch tested?
   Added bug tests..
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [SPARK-49789][SQL]: Handling of generic parameter with bounds while creating encoders [spark]

Reply via email to