squito commented on issue #24175: [SPARK-27232][SQL]Ignore file locality in 
InMemoryFileIndex if spark.locality.wait is set to zero
URL: https://github.com/apache/spark/pull/24175#issuecomment-478720781
 
 
   @LantaoJin just pointed me at this based on some discussion in 
https://github.com/apache/spark/pull/23951.  I totally understand the use case 
for this, but it needs to use a new config.  Even with locality wait == 0, 
spark still tries to schedule tasks to take advantage of locality.  It just 
means spark won't *wait* until it gets an offer with better locality.  In fact 
I regularly recommend users to turn locality wait == 0 even on colocated 
clusters.
   
   Furthermore, even in disagg clusters, you don't necessarily want to turn 
*all* locality wait to 0, right?  I mean you still might want to wait for 
locality persisted data from cached rdds?
   
   https://github.com/apache/spark/pull/23951  pointed out a case for skipping 
rack resolution entirely on disagg clusters.  This is another good case.  I'm 
not entirely sure if they should be controlled by the same thing ... I wonder 
if there is some hdfs-specific thing which might be appropriate here.  Eg. you 
might have "semi" disagg clusters with most data living remotely, but some 
small local hdfs.  I'm not sure if there is an easy way to figure this out.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to