johannaojeling commented on PR #28177: URL: https://github.com/apache/beam/pull/28177#issuecomment-1694669559
R: @lostluck I wanted to look into improving the avroio and parquetio read transforms, and refactoring them to use the fileio transforms is a first step. The `fileio.ReadableFile` elements that the intermediary PCollection now consists of will contain a file's size, which can be used for splitting in a next step. A disadavantage however is that the match transform also fetches a file's last modified date, which serves `fileio.MatchContinuously` well, but is a redundant operation in this case. The cost should be negligible under most circumstances, but still, a bit annoying. If this is an issue, we could potentially have a an option in `fileio.matchFn` to make it configurable what should be retrieved vs ignored. Let me know what you think? PS. It was great to meet you at GopherCon EU. I enjoyed your lightning talk! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
