Hi devs, There're couple of issues being reported on the user@ mailing list which results in being affected by inconsistent schema on Encoders.bean.
1. Typed datataset from Avro generated classes? [1] 2. spark structured streaming GroupState returns weird values from sate [2] Below is a part of JavaTypeInference.inferDataType() which handles beans: https://github.com/apache/spark/blob/f72220b8ab256e8e6532205a4ce51d50b69c26e9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala#L139-L157 it collects properties based on the availability of getter. (It's applied as well as `SQLContext.beansToRows`.) JavaTypeInference.serializerFor() and JavaTypeInference.deserializerFor() aren't. They collect properties based on the available of both getter and setter. (It calls JavaTypeInference.inferDataType() inside the method, making inconsistent even only these method is called.) This inconsistent produces runtime issues when Java bean only has getter for some fields, even there's no such field for the getter method - as getter/setter methods are determined by naming convention. I feel this is something we should fix, but would like to see opinions on how to fix it. If the user query has the problematic beans but hasn't encountered such issue, fixing the issue would drop off some columns, which would be backward incompatible. I think this is still the way to go, but if we concern more on not breaking existing query, we may want to at least document the ideal form of the bean Spark expects. Would like to hear opinions on this. Thanks, Jungtaek Lim (HeartSaVioR) 1. https://lists.apache.org/thread.html/r8f8e680e02955cdf05b4dd34c60a9868288fd10a03f1b1b8627f3d84%40%3Cuser.spark.apache.org%3E 2. http://mail-archives.apache.org/mod_mbox/spark-user/202003.mbox/%3ccafx8l21dzbyv5m1qozs3y+pcmycwbtjko6ytwvkydztq7u4...@mail.gmail.com%3e