[GitHub] [spark] jdesjean commented on pull request #42772: [SPARK-45051][CONNECT] Use UUIDv7 by default for operation IDs to make operations chronologically sortable

via GitHub Fri, 01 Sep 2023 09:09:57 -0700


jdesjean commented on PR #42772:
URL: https://github.com/apache/spark/pull/42772#issuecomment-1702991483


   > I am not sure I understand the use case here. Why do we exactly need them 
to be sortable? And is this a must-have?
   > 
   > One of the problems I see here is that you rely on the client to generate 
a proper v7 UUID, we do not control the client it is an open protocol, so a new 
implementation can just provide a v4 UUID, or generate an improper v7. There is 
also the matter of time drift between client and server, who will this affect 
the generated UUIDs?
   
   When operation id is used as a PK, UUIDv7 gives us the nice property that 
the order will roughly match the start time order for the query. While no one 
should rely on this property exclusively, having the records roughly ordered 
improves sorting performance.
   Additionally, for most lookup sorting by start time, sorting by operation id 
is useful to obtain consistent ordering in the case of duplicates. Roughly 
ordered records again help improve the performance.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] jdesjean commented on pull request #42772: [SPARK-45051][CONNECT] Use UUIDv7 by default for operation IDs to make operations chronologically sortable

Reply via email to