Sorry for so long to reply. Hi, Aljoscha, I think Async I/O operator and Batch the same, and Async is a better interface. All IO-related operations may be more appropriate for asynchronous use. Just like you said, the beginning is like no any special support by the Runners. I really like Luke's idea, let the user see a SeekableRead + Sideinput interface, and in the runner layer will optimize it to the direct access to external store. This requires a suitable SeekableRead interface and more efficient compiler optimization. Kenn's idea is exciting. If we can have an interface similar to FileSystem (Maybe like SeekableRead), abstract and unify a interface for multiple of KV stores, we can let users to see only the concept of Beam rather than the specific KVStore. Best, Jingsong Lee ------------------------------------------------------------------From:Kenneth Knowles <[email protected]>Time:2017 Jul 7 (Fri) 11:43To:dev <[email protected]>Cc:JingsongLee <[email protected]>Subject:Re: [PROPOSAL] External Join with KV Stores In the streams/tables way of talking, side inputs are tables. External KV stores are basically also [globally windowed] tables. Both are time-varying.
I think it makes perfect sense to access an external KV store in userland directly rather than listen to its changelog and reproduce the same table as a multimap side input. I'm sure many users are already doing this. I'm sure users will always do this. Providing a common interface (simpler than Filesystem) and helpful transform(s) in an extension module seems nice. Does it require any support in the core SDK? If I understand, Luke & Robert, you favor adding metadata to Read/SDF so that a user _does_ write it as a changelog listener that is observed as a multimap side input, and each runner optimizes it if they can to just directly access the KV store? A runner is free to use any kind of storage they like to materialize a side input anyhow, so this is surely possible, but it is a "sufficiently smart compiler" issue. As for semantics, I'm not worried about availability - it is globally windowed and always available. But I think this requires retractions to be correctly equivalent to direct access. I think we can have a userland PTransform in much less time than a model concept, so I favor it. Kenn
