Github user steveloughran commented on a diff in the pull request:

    https://github.com/apache/spark/pull/19623#discussion_r148325280
  
    --- Diff: 
sql/core/src/main/java/org/apache/spark/sql/sources/v2/reader/ReadTask.java ---
    @@ -37,13 +37,19 @@
        * The preferred locations where this read task can run faster, but 
Spark does not guarantee that
        * this task will always run on these locations. The implementations 
should make sure that it can
        * be run on any location. The location is a string representing the 
host name of an executor.
    +   *
    +   * If an exception was thrown, the action would fail and we guarantee 
that no Spark job was
    +   * submitted.
        */
       default String[] preferredLocations() {
         return new String[0];
    --- End diff --
    
    I'm sure there was some specific filter for it, though a quick grep only 
shows that happening in {{ReliableCheckpointRDD}}. The reason the filter is 
needed is that getHostByName(localhost) does return a host, but scheduling gets 
confused, without the filtering work can get get held back until the driver 
concludes that localhost isn't free & so assign it elsewhere in the cluster 
(Hive can do this unintentionally, which is why I did trace through spark's use 
of getBlockLocations once)


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to