Thanks to dylanhz for proposing this FLIP. It provides a solid foundation for reading, processing, and lakehousing multimodal data in Flink. Overall +1.
I have two suggestions: 1. Clarify the semantics and limitations of modification_time The document excludes etag/version on the grounds that they are system-specific and not reliable cross-system. However, modification_time faces a similar cross-system issue that should be acknowledged. In practice, modification_time from storage systems (S3 LastModified, GCS updated, HDFS modificationTime) represents when the object was uploaded to that storage system, not when the content was originally created or last modified. When the same file is copied or synced across systems, each system assigns its own timestamp: Local disk: modification_time = 2026-01-15 10:00 (content creation) S3: LastModified = 2026-06-10 15:42 (upload to S3) GCS (mirror): updated = 2026-06-12 09:30 (copy to GCS) This means modification_time is not consistent across systems for the same content, and using it in cross-source equality/hashing (e.g., UNION ALL + DISTINCT across connectors) may produce surprising results. 2. Clarify READ_FILE behavior when the referenced content has changed Since FILE is a reference (not a snapshot), the underlying content may change between when the FILE value is constructed and when READ_FILE is called. The document does not specify the expected behavior in this case — does READ_FILE return the new content silently, return null on metadata mismatch, or throw an exception? This matters for Flink's streaming semantics, where FILE values may survive in state/checkpoint across hours or days. I'd suggest providing a configuration option to control this behavior (e.g., read current content by default, with optional null-on-mismatch or fail-on-mismatch modes). dylanhz <[email protected]> 于2026年6月12日周五 09:51写道: > Hi everyone, > > I would like to start a discussion on FLIP-589: Introduce FILE Type for > Byte-Content References [1]. > > This FLIP proposes to introduce FILE as a logical type for byte-content > references. A FILE value describes where bytes can be read from, together > with optional range and common metadata, but does not store the bytes > themselves. > > The main motivation is multimodal processing. In pipelines that process > images, audio, video, or documents, jobs often need to pass references and > metadata through filtering, routing, joins, or inference preparation, while > reading the actual content only at explicit processing stages. A dedicated > FILE type provides a clearer contract than STRING URIs or ad-hoc ROW > descriptors for SQL functions, connectors, planners, and UDFs. > > Looking forward to your feedback. > > [1] > https://cwiki.apache.org/confluence/display/FLINK/FLIP-589%3A+Introduce+FILE+Type+for+Byte-Content+References > > ---------- > Best regards, > dylanhz > > > > >
