Hey Matt, For Flight: for DoGet/DoPut/DoExchange, you can accomplish with the app_metadata fields built in to these methods. For instance, in DoGet/DoExchange, you could send some batches of data, then send a message with only an app_metadata field encoding the watermark. (The app_metadata field is a byte blob, and you can impart structure on it with JSON or Protobuf or the scheme of your choice.) The app_metadata field could also be appended to the side of a record batch as well.
For Arrow files: ARROW-16131 in Arrow 8.0.0 gives you the ability to add custom key-value metadata to individual batches, and Arrow already supported key-value metadata in files, so you could use either of these to record a watermark. Do these seem workable? Regarding the roadmap, I don't think we've seen requests for this previously; what sort of support would help in the compute API? -David On Wed, Apr 27, 2022, at 13:02, Matt Rudary wrote: > Hi, > > We're looking at using Arrow as part of our solution to ship tabular > data between different streaming systems, potentially implemented using > different technologies, like Spark, Beam, Flink, etc. Some of these > systems contain "watermarks" as a key concept. Briefly, a watermark is > a promise that a certain data source will not produce any more > events/rows with a timestamp earlier than a given time. For example, if > I produce a batch of rows every 5 minutes, after I've finished sending > the 12:00 data, I would send a watermark update of 12:04:59, thus > letting downstream consumers know that no future row from me will have > a timestamp before 12:05. > > We would like to be able to propagate watermarks with our data, and I > wondered if this list has any ideas of how to do this currently, or > whether it is part of the roadmap for the Arrow compute api or similar. > We'd like to be able to do this over Arrow Flight, but potentially also > for other methods of shipping Arrow data, like pubsub feeds, file > dumps, etc. > > Thanks > Matt Rudary > Two Sigma