Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/19439
  
    @liancheng  I see you've worked with PathFilters in Spark SQL, so I'll ask 
here: We're uncertain about how PathFilters are used in Hadoop, and it would be 
helpful to understand (and use) them in order to ensure deterministic behavior 
for sampling in this image reader.
    
    Background: We use the PathFilter abstraction for the new 
"SamplePathFilter" class introduced by this PR.  That filter is for sampling a 
subset of rows.
    
    Our question: Will this be deterministic?
    
    My thoughts:
    * If we set a random seed, then this *may* be deterministic depending on 
the usage of PathFilters by the filesystem.
      * If a new PathFilter is instantiated for each partition read, then we 
get determinism.  (Since a new instance is created using the same seed, it will 
behave the same way each time, assuming files in the partition are read in the 
same order.)
      * If a PathFilter may be reused across partitions, then we cannot 
guarantee determinism.
    * Is this explanation reasonable, and do you know what we should expect?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to