[GitHub] [spark] aokolnychyi commented on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

GitBox Thu, 04 Mar 2021 12:57:04 -0800


aokolnychyi commented on pull request #31700:
URL: https://github.com/apache/spark/pull/31700#issuecomment-790935772



   @cloud-fan, I am not sure about the continuous mode, but I think there is a 
valid use case for micro-batch streaming. The required distribution and 
ordering apply to individual writes so it does not mean the underlying sink is 
globally ordered.
   
   For example, let's say we are writing to a partitioned file sink. If we just 
group incoming data by partition, a single output task may still have records 
for multiple partitions. A naive sink implementation may close the current file 
and open a new one each time it sees records for another partition, producing a 
large number of files. An alternative implementation can keep multiple files 
open. That's not ideal too as we increase memory consumption. That's why 
ordering data within a task by partition seems like a good default for 
micro-batch streaming.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] aokolnychyi commented on pull request #31700: [SPARK-34183][SS] DataSource V2: Support required distribution and ordering in SS

Reply via email to