WweiL commented on PR #38503:
URL: https://github.com/apache/spark/pull/38503#issuecomment-1309381602

   [From this 
comment](https://github.com/apache/spark/pull/38503#discussion_r1013665318), 
Jungtaek suggested we check not only aggregates but also all stateful 
operators, with a kind intent to reduce engineering work of reasoning all 
tricky combinations of allowed / disallowed cases.
   
   I followed his idea and disallowed all stateful ops running on Complete and 
Update mode. Then there were some test cases failing, because the error message 
is changed. To name a few:
   ```
   test: streaming plan - flatMapGroupsWithState - 
flatMapGroupsWithState(Update) on streaming relation with aggregation in Update 
mode: not supported
   exception message: flatMapGroupsWithState in update mode is not supported 
with aggregation on a streaming DataFrame/Dataset
   
   test: streaming plan - flatMapGroupsWithState - 
flatMapGroupsWithState(Update) on streaming relation with aggregation in 
Complete mode: not supported
   message: flatMapGroupsWithState in update mode is not supported with 
aggregation on a streaming DataFrame/Dataset
   
   test: streaming plan - mapGroupsWithState - mapGroupsWithState on streaming 
relation with aggregation in Update mode: not supported
   message: mapGroupsWithState is not supported with aggregation on a streaming 
DataFrame/Dataset;
   ```
   
   I sense that, from here we could say, the tricky cases Jungtaek suggested 
might have already been handled by the checker. Or put it in another way, __if 
we only disable multiple aggregates (in Update and Complete mode) but not 
multiple stateful ops in line 186 of 
[UnsupportedOperationChecker.scala](https://github.com/apache/spark/pull/38503/files/c0381591fa955042c1cf64221a6f4c909143405f..3637b807d57ff5c534386132fb9d47c7cce72705#diff-7c879c08d2f379c139d5229a88857229ae69bb48f0a138a3d64e1b2dde3502fe),
 things that failed before will still fail (in Update and Complete mode), but 
we still loose the restriction by allowing some multiple stateful ops in append 
mode__. 
   
   I raised this to Alex and Jungtaek, but we finally agree to proceed this 
direction. But then we found that it will also change existing allowed cases, 
for example 
[here](https://github.com/apache/spark/pull/38503#discussion_r1017413562) and 
[here](https://github.com/apache/spark/pull/38503#discussion_r1017447511). If 
this is pushed, currently running pipelines will fail.
   
   We could definitely then change the code to specifically allow these cases, 
but I sense that, for least engineering effort, it's enough to just revert the 
change from disallowing multiple stateful operators to just multiple aggregates 
(in Update and Complete mode). Given that the tricky cases will be handled 
somewhere else in the checker.
   
   I believe only disabling multiple aggregates (in Update and Complete mode) 
looses the restriction in a proper way. As we will still loose the restriction 
by allow cases that were disallowed before. (Before all multiple agg is 
disallowed, but now with append mode it is allowed), and all tests that were 
passing before still passes. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to