johannaojeling commented on PR #25809: URL: https://github.com/apache/beam/pull/25809#issuecomment-1489761150
> I think I'm a touch surprised that neither of these are Splittable DoFns. I'd expect Match to be able to "sub element split on it's input globs to allow downstream processing of the files to ultimately be split down to each file (if not within the file itself), since we know/can find the size and so forth, to begin to make decent splitting decisions. Hmm I don't see clearly how the incoming glob element in matchFn should be split into restriction pairs. Only after the List operation do we know how many output elements there are per input element. I guess we could create an initial offsetrange.Restriction based on the total count of files that match the glob pattern from a List invocation, then split that into sub-ranges. The downside is that this would impose additional API calls for the same List operation in ProcessElement and require sorting of the list result to determine which files fall under the current restriction. I'm curious to hear what would be your suggestion for how to implement the Match transform with an SDF? As I saw it, the matchFn essentially serves the same purpose as expandFn currently does in textio/avroio/parquetio, plus attaches the size. I was thinking that those IOs could potentially be refactored to make use of the fileio Match and Read tranforms to reduce repetition if we want that. At least textio which would utilize both the file handle and the size from a ReadableFile. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
