The first case of user report is obvious - according to the user report,
AVRO generated code contains getter which denotes to itself hence Spark
disallows (throws exception), but it doesn't have matching setter method
(if I understand correctly) so technically it shouldn't matter.

For the second case of user report, I've reproduced with my own code.
Please refer the gist code:
https://gist.github.com/HeartSaVioR/fab85734b5be85198c48f45004c8e0ca

This code aggregates the max value of the values in key where the key is in
the range of (0 ~ 9).

We're expecting the result of execution like (0, 10000), (1, 10001), ...,
(9, 10009), but the result is going to be incorrect like below:

-------------------------------------------
Batch: 0
-------------------------------------------
+---+--------+
|key|maxValue|
+---+--------+
+---+--------+

-------------------------------------------
Batch: 1
-------------------------------------------
+---+--------+
|key|maxValue|
+---+--------+
|  0|   18990|
|  7|   18997|
|  6|   18996|
|  9|   18999|
|  5|   18995|
|  1|   18991|
|  3|   18993|
|  8|   18998|
|  2|   18992|
|  4|   18994|
+---+--------+

-------------------------------------------
Batch: 2
-------------------------------------------
+-----+------------+
|  key|    maxValue|
+-----+------------+
|18990|       30990|
|18997|540502118145|
|18996|249574852617|
|18999|146327314953|
|18995|243603134985|
|18991|476309451025|
|18993|287916490001|
|18998|324427845137|
|18992|412640801297|
|18994|302012976401|
+-----+------------+
...

This can happen with such inconsistent schemas because State in Structured
Streaming doesn't check the schema (both name and type are unchecked) and
simply apply the raw values with the sequence of column.

On Fri, May 8, 2020 at 5:50 PM Wenchen Fan <cloud0...@gmail.com> wrote:

> Can you give some simple examples to demonstrate the problem? I think the
> inconsistency would bring problems but don't know how.
>
> On Fri, May 8, 2020 at 3:49 PM Jungtaek Lim <kabhwan.opensou...@gmail.com>
> wrote:
>
>> (bump to expose the discussion to more readers)
>>
>> On Mon, May 4, 2020 at 4:57 PM Jungtaek Lim <kabhwan.opensou...@gmail.com>
>> wrote:
>>
>>> Hi devs,
>>>
>>> There're couple of issues being reported on the user@ mailing list
>>> which results in being affected by inconsistent schema on Encoders.bean.
>>>
>>> 1. Typed datataset from Avro generated classes? [1]
>>> 2. spark structured streaming GroupState returns weird values from sate
>>> [2]
>>>
>>> Below is a part of JavaTypeInference.inferDataType() which handles beans:
>>>
>>>
>>> https://github.com/apache/spark/blob/f72220b8ab256e8e6532205a4ce51d50b69c26e9/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/JavaTypeInference.scala#L139-L157
>>>
>>> it collects properties based on the availability of getter.
>>>
>>> (It's applied as well as `SQLContext.beansToRows`.)
>>>
>>> JavaTypeInference.serializerFor() and
>>> JavaTypeInference.deserializerFor() aren't. They collect properties based
>>> on the available of both getter and setter.
>>> (It calls JavaTypeInference.inferDataType() inside the method, making
>>> inconsistent even only these method is called.)
>>>
>>> This inconsistent produces runtime issues when Java bean only has getter
>>> for some fields, even there's no such field for the getter method - as
>>> getter/setter methods are determined by naming convention.
>>>
>>> I feel this is something we should fix, but would like to see opinions
>>> on how to fix it. If the user query has the problematic beans but hasn't
>>> encountered such issue, fixing the issue would drop off some columns, which
>>> would be backward incompatible. I think this is still the way to go, but if
>>> we concern more on not breaking existing query, we may want to at least
>>> document the ideal form of the bean Spark expects.
>>>
>>> Would like to hear opinions on this.
>>>
>>> Thanks,
>>> Jungtaek Lim (HeartSaVioR)
>>>
>>> 1.
>>> https://lists.apache.org/thread.html/r8f8e680e02955cdf05b4dd34c60a9868288fd10a03f1b1b8627f3d84%40%3Cuser.spark.apache.org%3E
>>> 2.
>>> http://mail-archives.apache.org/mod_mbox/spark-user/202003.mbox/%3ccafx8l21dzbyv5m1qozs3y+pcmycwbtjko6ytwvkydztq7u4...@mail.gmail.com%3e
>>>
>>

Reply via email to