I just wrote up a JIRA issues proposing that FileSystem implementations
retrieve lastModified time of the files they list:
https://issues.apache.org/jira/browse/BEAM-5910

Any immediate concerns? I'm not intimately familiar with HDFS, but I'm
otherwise confident that GCS, S3, and local filesystems can all give us a
suitable timestamp.

In the short term, this change would allow users to write their own polling
logic on top of FileSystems to periodically check for updates to files.
Currently, you would need to fall back to the APIs for each individual
storage provider.

Longer term, I'd love to see FileIO.match.continuously support an option
for returning updated contents when files are updated.

Reply via email to