Hi

I am working on adding functionality to support querying Pubsub messages
directly from Beam SQL.

*Goal*
  Provide Beam users a pure  SQL solution to create the pipelines with
Pubsub as a data source, without the need to set up the pipelines in Java
before applying the query.

*High level approach*

   -
   - Build on top of PubsubIO;
   - Pubsub source will be declared using CREATE TABLE DDL statement:
      - Beam SQL already supports declaring sources like Kafka and Text
      using CREATE TABLE DDL;
      - it supports additional configuration using TBLPROPERTIES clause.
      Currently it takes a text blob, where we can put a JSON configuration;
      - wrapping PubsubIO into a similar source looks feasible;
   - The plan is to initially support messages only with JSON payload:
   -
      - more payload formats can be added later;
   - Messages will be fully described in the CREATE TABLE statements:
      - event timestamps. Source of the timestamp is configurable. It is
      required by Beam SQL to have an explicit timestamp column for windowing
      support;
      - messages attributes map;
      - JSON payload schema;
   - Event timestamps will be taken either from publish time or
   user-specified message attribute (configurable);

Thoughts, ideas, comments?

More details are in the doc here:
https://docs.google.com/document/d/1wIXTxh-nQ3u694XbF0iEZX_7-b3yi4ad0ML2pcAxYfE


Thank you,
Anton

Reply via email to