Github user jkbradley commented on the issue: https://github.com/apache/spark/pull/19439 @liancheng I see you've worked with PathFilters in Spark SQL, so I'll ask here: We're uncertain about how PathFilters are used in Hadoop, and it would be helpful to understand (and use) them in order to ensure deterministic behavior for sampling in this image reader. Background: We use the PathFilter abstraction for the new "SamplePathFilter" class introduced by this PR. That filter is for sampling a subset of rows. Our question: Will this be deterministic? My thoughts: * If we set a random seed, then this *may* be deterministic depending on the usage of PathFilters by the filesystem. * If a new PathFilter is instantiated for each partition read, then we get determinism. (Since a new instance is created using the same seed, it will behave the same way each time, assuming files in the partition are read in the same order.) * If a PathFilter may be reused across partitions, then we cannot guarantee determinism. * Is this explanation reasonable, and do you know what we should expect?
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org