johannaojeling commented on PR #28177:
URL: https://github.com/apache/beam/pull/28177#issuecomment-1694669559

   R: @lostluck
   
   I wanted to look into improving the avroio and parquetio read transforms, 
and refactoring them to use the fileio transforms is a first step. The 
`fileio.ReadableFile` elements that the intermediary PCollection now consists 
of will contain a file's size, which can be used for splitting in a next step.
   
   A disadavantage however is that the match transform also fetches a file's 
last modified date, which serves `fileio.MatchContinuously` well, but is a 
redundant operation in this case. The cost should be negligible under most 
circumstances, but still, a bit annoying. If this is an issue, we could 
potentially have a an option in `fileio.matchFn` to make it configurable what 
should be retrieved vs ignored. Let me know what you think?
   
   PS. It was great to meet you at GopherCon EU. I enjoyed your lightning talk!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to