Re: [PR] [SPARK-49789][SQL] Handling of generic parameter with bounds while creating encoders [spark]

via GitHub Fri, 15 Nov 2024 16:16:32 -0800


ahshahid commented on code in PR #48252:
URL: https://github.com/apache/spark/pull/48252#discussion_r1844719865



##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala:
##########
@@ -148,34 +163,180 @@ object JavaTypeInference {
       // TODO: we should only collect properties that have getter and setter. 
However, some tests
       //   pass in scala case class as java bean class which doesn't have 
getter and setter.
       val properties = getJavaBeanReadableProperties(c)
-      // add type variables from inheritance hierarchy of the class
-      val classTV = JavaTypeUtils.getTypeArguments(c, 
classOf[Object]).asScala.toMap ++
-        typeVariables
-      // Note that the fields are ordered by name.
-      val fields = properties.map { property =>
-        val readMethod = property.getReadMethod
-        val encoder = encoderFor(readMethod.getGenericReturnType, seenTypeSet 
+ c, classTV)
-        // The existence of `javax.annotation.Nonnull`, means this field is 
not nullable.
-        val hasNonNull = readMethod.isAnnotationPresent(classOf[Nonnull])
-        EncoderField(
-          property.getName,
-          encoder,
-          encoder.nullable && !hasNonNull,
-          Metadata.empty,
-          Option(readMethod.getName),
-          Option(property.getWriteMethod).map(_.getName))
+
+      // if the properties is empty and this is not a top level enclosing 
class, then we
+      // should not consider class as bean, as otherwise it will be treated as 
empty schema
+      // and loose the data on deser.

Review Comment:
   Lets say the top level class for which encoder is being created, has a field 
x which is a POJO, but has no Bean type getters.
   This means field x  corresponding schema is empty.  So when the DataSet 
corresponding to top level class is converted to a dataframe, there is no 
representation of x, in the Row object.
   So when this data frame is converted back to DataSet, the field x : POJO 
will be set to null and there is data loss.
   But when we started , it was NOT NULL. It became null, because schema was 
empty.
   So to handle that case, a POJO without getters, should be represented as 
BinaryType , so that when the dataframe is reconverted,  field x gets 
deserialized pojo.
   The reason why it is not done for top class is that there are existing 
tests, which assert that if top level class has no getters, schema should be 
empty, implying 0 rows and no schema.
   Now whether that is desirable, or it should be represented as a binary type 
is debatable. As in any case no meaningful sql operation can be done on binary 
data .
   So a distinction is made using the boolean. That is Top level class with no 
getters need to be treated differently from any field having no getters.
   
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [SPARK-49789][SQL] Handling of generic parameter with bounds while creating encoders [spark]

Reply via email to