aokolnychyi commented on pull request #31700: URL: https://github.com/apache/spark/pull/31700#issuecomment-790935772
@cloud-fan, I am not sure about the continuous mode, but I think there is a valid use case for micro-batch streaming. The required distribution and ordering apply to individual writes so it does not mean the underlying sink is globally ordered. For example, let's say we are writing to a partitioned file sink. If we just group incoming data by partition, a single output task may still have records for multiple partitions. A naive sink implementation may close the current file and open a new one each time it sees records for another partition, producing a large number of files. An alternative implementation can keep multiple files open. That's not ideal too as we increase memory consumption. That's why ordering data within a task by partition seems like a good default for micro-batch streaming. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
