HeartSaVioR edited a comment on pull request #31700: URL: https://github.com/apache/spark/pull/31700#issuecomment-791149996
Actually that's the one of few advantages from micro-batch compared to record-to-record, and we already leveraged it by some public API (e.g. flatMapGroupsWithState - this "sorts" the inputs in specific micro-batch so that values from the same group can be served to the user func sequentially wrapped with iterator. Imagine how it could be done without sorting.) That said, I'm supportive on the concept of the ordering, only for micro-batch. Dealing with sort in continuous mode is quite tricky - due to the nature of record-to-record processing, sort requires to buffer inputs into state or somewhere in memory until the epoch has been finished (we can maintain the state or buffer be kept to be sorted though), and downstream operations can only continue their works after that, which contradicts the fact that epoch is finished. My 2 cents on continuous mode is that we'd be better to admit the architectural differences between the batch oriented and streaming oriented, and try to have some safe approach to isolate between twos. Naturally integrating twos sounds very hard to achieve, and even has been playing as roadblock for improving functionalities on micro-batch mode as well. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
