Github user jkbradley commented on the issue:
https://github.com/apache/spark/pull/19439
@liancheng I see you've worked with PathFilters in Spark SQL, so I'll ask
here: We're uncertain about how PathFilters are used in Hadoop, and it would be
helpful to understand (and use) them in order to ensure deterministic behavior
for sampling in this image reader.
Background: We use the PathFilter abstraction for the new
"SamplePathFilter" class introduced by this PR. That filter is for sampling a
subset of rows.
Our question: Will this be deterministic?
My thoughts:
* If we set a random seed, then this *may* be deterministic depending on
the usage of PathFilters by the filesystem.
* If a new PathFilter is instantiated for each partition read, then we
get determinism. (Since a new instance is created using the same seed, it will
behave the same way each time, assuming files in the partition are read in the
same order.)
* If a PathFilter may be reused across partitions, then we cannot
guarantee determinism.
* Is this explanation reasonable, and do you know what we should expect?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]