cchighman commented on pull request #28841:
URL: https://github.com/apache/spark/pull/28841#issuecomment-647838170
> @gengliangwang @dongjoon-hyun
>
> Are there any strong feelings against making `pathGlobfilter` a derived
class strategy, along with the disjointed implementation in `CommandUtils` for
`PathIgnoreNonData`, as been implemented here in `InMemoryFileIndex` except
that the implementation approach and strategy would actually be according to
`pathGlobFilter` in `PartitioningAwareFileIndex`?
>
> In effect, the implementation in my change above in `listLeafFiles` was
extending what appears to be a duplication of the initial strategy used here
for `pathGlobFilter`. In `PartitioningAwareFileIndex`, the `allFiles` method
uses the same filtering methodology.
>
> So, the difference would be the below method signature taking the same
strategy approach on the abstraction to determine which filters to apply:
>
> ` protected def matchGlobPattern(file: FileStatus): Boolean = {
pathGlobFilter.forall(_.accept(file.getPath)) }`
>
> _changes to_
>
> ` protected def matchPathFilter(file: FileStatus): Boolean = {
pathFilters.forall(_.accept(file.getPath)) }`
>
> This would mean there is little implementation changes in the current code
and it can be easily extended. `GlobFilter`, being a derived class from
`PathFilter`, makes this a rather perfect fit since each case is implementing
the `accept` interface. We would also be consolidating these two different
filtering approaches in the same place, reducing code, as to eliminate any sort
of confusion in the future.
>
> I can't see any discernable benefit as to why `InMemoryFileIndex` needs a
separate implementation in `listLeafFiles` for path filtering that isn't
already handled when filtering during `allfiles` where `pathGlobFilter` is
implemented.
> @gengliangwang @dongjoon-hyun
>
> Are there any strong feelings against making `pathGlobfilter` a derived
class strategy, along with the disjointed implementation in `CommandUtils` for
`PathIgnoreNonData`, as been implemented here in `InMemoryFileIndex` except
that the implementation approach and strategy would actually be according to
`pathGlobFilter` in `PartitioningAwareFileIndex`?
>
> In effect, the implementation in my change above in `listLeafFiles` was
extending what appears to be a duplication of the initial strategy used here
for `pathGlobFilter`. In `PartitioningAwareFileIndex`, the `allFiles` method
uses the same filtering methodology.
>
> So, the difference would be the below method signature taking the same
strategy approach on the abstraction to determine which filters to apply:
>
> ` protected def matchGlobPattern(file: FileStatus): Boolean = {
pathGlobFilter.forall(_.accept(file.getPath)) }`
>
> _changes to_
>
> ` protected def matchPathFilter(file: FileStatus): Boolean = {
pathFilters.forall(_.accept(file.getPath)) }`
>
> This would mean there is little implementation changes in the current code
and it can be easily extended. `GlobFilter`, being a derived class from
`PathFilter`, makes this a rather perfect fit since each case is implementing
the `accept` interface. We would also be consolidating these two different
filtering approaches in the same place, reducing code, as to eliminate any sort
of confusion in the future.
>
> I can't see any discernable benefit as to why `InMemoryFileIndex` needs a
separate implementation in `listLeafFiles` for path filtering that isn't
already handled when filtering during `allfiles` where `pathGlobFilter` is
implemented.
The need arises since we only have one filtering strategy where
`pathGlobFilter` is used in `PartitioningAwareFileIndex` and handling cleanly
if we now have more than one approach to path filtering.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]