Hi devs,

There're couple of issues being reported on the user@ mailing list which
results in being affected by inconsistent schema on Encoders.bean.

1. Typed datataset from Avro generated classes? [1]
2. spark structured streaming GroupState returns weird values from sate [2]

Below is a part of JavaTypeInference.inferDataType() which handles beans:

https://github.com/apache/spark/blob/f72220b8ab256e8e6532205a4ce51d50b69c26e9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala#L139-L157

it collects properties based on the availability of getter.

(It's applied as well as `SQLContext.beansToRows`.)

JavaTypeInference.serializerFor() and JavaTypeInference.deserializerFor()
aren't. They collect properties based on the available of both getter and
setter.
(It calls JavaTypeInference.inferDataType() inside the method, making
inconsistent even only these method is called.)

This inconsistent produces runtime issues when Java bean only has getter
for some fields, even there's no such field for the getter method - as
getter/setter methods are determined by naming convention.

I feel this is something we should fix, but would like to see opinions on
how to fix it. If the user query has the problematic beans but hasn't
encountered such issue, fixing the issue would drop off some columns, which
would be backward incompatible. I think this is still the way to go, but if
we concern more on not breaking existing query, we may want to at least
document the ideal form of the bean Spark expects.

Would like to hear opinions on this.

Thanks,
Jungtaek Lim (HeartSaVioR)

1.
https://lists.apache.org/thread.html/r8f8e680e02955cdf05b4dd34c60a9868288fd10a03f1b1b8627f3d84%40%3Cuser.spark.apache.org%3E
2.
http://mail-archives.apache.org/mod_mbox/spark-user/202003.mbox/%3ccafx8l21dzbyv5m1qozs3y+pcmycwbtjko6ytwvkydztq7u4...@mail.gmail.com%3e

Reply via email to