johannaojeling commented on PR #25809:
URL: https://github.com/apache/beam/pull/25809#issuecomment-1489761150

   > I think I'm a touch surprised that neither of these are Splittable DoFns. 
I'd expect Match to be able to "sub element split on it's input globs to allow 
downstream processing of the files to ultimately be split down to each file (if 
not within the file itself), since we know/can find the size and so forth, to 
begin to make decent splitting decisions.
   
   Hmm I don't see clearly how the incoming glob element in matchFn should be 
split into restriction pairs. Only after the List operation do we know how many 
output elements there are per input element. I guess we could create an initial 
offsetrange.Restriction based on the total count of files that match the glob 
pattern from a List invocation, then split that into sub-ranges. The downside 
is that this would impose additional API calls for the same List operation in 
ProcessElement and require sorting of the list result to determine which files 
fall under the current restriction. I'm curious to hear what would be your 
suggestion for how to implement the Match transform with an SDF?
   
   As I saw it, the matchFn essentially serves the same purpose as expandFn 
currently does in textio/avroio/parquetio, plus attaches the size. I was 
thinking that those IOs could potentially be refactored to make use of the 
fileio Match and Read tranforms to reduce repetition if we want that. At least 
textio which would utilize both the file handle and the size from a 
ReadableFile.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to