[GitHub] [spark] otterc commented on pull request #33446: [SPARK-36215][SHUFFLE] Add logging for slow fetches to diagnose external shuffle service issues

GitBox Tue, 27 Jul 2021 18:16:34 -0700


otterc commented on pull request #33446:
URL: https://github.com/apache/spark/pull/33446#issuecomment-887937290



   > > And I think a more straightforward way to do this probably would be to 
log slow fetches which are slower than most fetches. For example, we could log 
the 5%th percentile slow fetches among all fetches. The downside is that we 
have to store more intermediate states.
   > 
   > This is a great idea actually. We can use [Dropwizard's Histogram with a 
uniform 
reservoir](https://metrics.dropwizard.io/4.1.2/manual/core.html#histograms) to 
keep histogram information with constant storage space.
   
   I have a different opinion here. 
   - The number of requests that are in-flight are bounded by parameters like 
`maxBytesInFlight` and `maxReqsInFlight`. The number of blocks to fetch in each 
network request is bounded by these params as well. So, even for applications 
with larger shuffle, we do impose a limit on the amount of shuffle data being 
fetched at a time with these limits. For a fetch to be slow (bytes fetched per 
sec being lower than threshold), it is more likely a server issue which depends 
on how much time the shuffle server took to respond to a request. So, I don't 
think the threshold for marking it slow may vary from application to 
application in a similar cluster. 
   
   - The slow fetches are not necessarily because of this particular 
application. It could be because of a shuffle server which is overloaded with 
requests from the other apps running in the same cluster. So, again the 
threshold for categorizing a fetch being slow is mostly dependent on this and 
the default should be something the cluster admins can determine for a cluster.
   
   - Usually,  the user doesn't change configs like `maxBytesInFlight` and 
`maxBytesInFlight`, so they shouldn't have to change the configs related to 
this as well. 
   
   - Storing the data for fetches just adds more state to the iterator even 
though we can bound it. Just adding more code around this seems overkill. For a 
user, this log is helpful to identify why one run of an application took longer 
than a previous run. If most of the runs of the same application have the same 
time, they will not even bother with it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] otterc commented on pull request #33446: [SPARK-36215][SHUFFLE] Add logging for slow fetches to diagnose external shuffle service issues

Reply via email to