ahshahid commented on code in PR #48252:
URL: https://github.com/apache/spark/pull/48252#discussion_r1844719865
##########
sql/api/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala:
##########
@@ -148,34 +163,180 @@ object JavaTypeInference {
// TODO: we should only collect properties that have getter and setter.
However, some tests
// pass in scala case class as java bean class which doesn't have
getter and setter.
val properties = getJavaBeanReadableProperties(c)
- // add type variables from inheritance hierarchy of the class
- val classTV = JavaTypeUtils.getTypeArguments(c,
classOf[Object]).asScala.toMap ++
- typeVariables
- // Note that the fields are ordered by name.
- val fields = properties.map { property =>
- val readMethod = property.getReadMethod
- val encoder = encoderFor(readMethod.getGenericReturnType, seenTypeSet
+ c, classTV)
- // The existence of `javax.annotation.Nonnull`, means this field is
not nullable.
- val hasNonNull = readMethod.isAnnotationPresent(classOf[Nonnull])
- EncoderField(
- property.getName,
- encoder,
- encoder.nullable && !hasNonNull,
- Metadata.empty,
- Option(readMethod.getName),
- Option(property.getWriteMethod).map(_.getName))
+
+ // if the properties is empty and this is not a top level enclosing
class, then we
+ // should not consider class as bean, as otherwise it will be treated as
empty schema
+ // and loose the data on deser.
Review Comment:
Lets say the top level class for which encoder is being created, has a field
x which is a POJO, but has no Bean type getters.
This means field x corresponding schema is empty. So when the DataSet
corresponding to top level class is converted to a dataframe, there is no
representation of x, in the Row object.
So when this data frame is converted back to DataSet, the field x : POJO
will be set to null and there is data loss.
But when we started , it was NOT NULL. It became null, because schema was
empty.
So to handle that case, a POJO without getters, should be represented as
BinaryType , so that when the dataframe is reconverted, field x gets
deserialized pojo.
The reason why it is not done for top class is that there are existing
tests, which assert that if top level class has no getters, schema should be
empty, implying 0 rows and no schema.
Now whether that is desirable, or it should be represented as a binary type
is debatable. As in any case no meaningful sql operation can be done on binary
data .
So a distinction is made using the boolean. That is Top level class with no
getters need to be treated differently from any field having no getters.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]