[PR] [Feature] Dynamic Table. [cloudberry]

via GitHub Thu, 21 Nov 2024 00:01:11 -0800


avamingli opened a new pull request, #725:
URL: https://github.com/apache/cloudberry/pull/725


   Dynamic Table is a an auto-refreshing materialized view which could be 
constructed by base tables, external tables, materialized views and dynamic 
tables.
   And it could be used to answer query by AQUMV.
   As normal tables in CBDB, dynamic tables could also have distribution keys.
   
   The purpose of Dynamic Tables is to solve the problem often raised by 
customers who are big fans of a lakehouse architecture: how can we run queries 
on external tables as fast as internal tables?
   
   See details in discussion #706 and more cases in tests.
   
   ### Create  Dynamic Table:
   
   ```sql
   CREATE DYNAMIC TABLE dt0 SCHEDULE '5 * * * *' AS
     SELECT a, b, sum(c) FROM t1 GROUP BY a, b WITH NO DATA DISTRIBUTED BY(b);
   CREATE DYNAMIC TABLE
   
   \d
                    List of relations
    Schema | Name |     Type      |  Owner  | Storage
   --------+------+---------------+---------+---------
    public | dt0  | dynamic table | gpadmin | heap
    public | t1   | table         | gpadmin | heap
   (2 rows)
   ```
   
   CREATE DYNAMIC TABLE xxx AS `Query`
   The `Query` allows any valid SELECT SQL of Materialized Views: from single 
or multiple relations,  base tables, materialized views,  and dynamic tables as 
well,  joins, subquery, aggregation, group by and etc.
   However, if you want to use it to Answer Query, that is limited by AQUMV: 
currently we allow Select from single base table, aggregation on it or 
aggregation SQL replace directly #705 
   
   #### SCHEDULE: 
   A string used to schedule background job which auto-refreshes the dynamic 
table.
   We follow the valid string of pg_cron extension which supports linux 
crontab, refer  https://crontab.guru/ .
   ```text
    ┌───────────── min (0 - 59)
    │ ┌────────────── hour (0 - 23)
    │ │ ┌─────────────── day of month (1 - 31) or last day of the month ($)
    │ │ │ ┌──────────────── month (1 - 12)
    │ │ │ │ ┌───────────────── day of week (0 - 6) (0 to 6 are Sunday to
    │ │ │ │ │                  Saturday, or use names; 7 is also Sunday)
    │ │ │ │ │
    │ │ │ │ │
    * * * * *
   ```
   
   You can also use '[1-59] seconds' to schedule a job based on an interval.
   The example creates a cron job refreshing the dynamic table at minute 5 of 
each hour.
   
   User don't need to consider the auto-refresh job, however query on pg_task 
catalog if we want to see the task:
   ```sql
   SELECT * FROM pg_task WHERE jobname LIKE 'gp_dynamic_table_refresh%' AND 
command LIKE '%dt0';
    jobid | schedule  |             command              | nodename  | nodeport 
| database | username | active |
      jobname
   
-------+-----------+----------------------------------+-----------+----------+----------+----------+--------+---------
   -----------------------
    17398 | 5 * * * * | REFRESH DYNAMIC TABLE public.dt0 | 127.0.0.1 |     9000 
| gpadmin  | gpadmin  | t      | gp_dynam
   ic_table_refresh_17394
   (1 row)
   ```
   As Snowflake, Dynamic Tables should always have a auto-refresh process.
   However, for convenience, I make SCHEDULE optional. If user didn't specific 
it, a default schedule is provided: maybe at every 5th minute(snowflake limit 
at most 5 minutes for dynamic table auto-refresh, not sure)?
   
   #### WITH NO DATA: 
   Same as Materialized View, will create an empty Dynamic Table if specified.
   
   #### DISTRIBUTED BY:
   Same as normal tables in CBDB, Dynamic Tables could support distribution 
keys as materialized views.
   Use \d+ to see the distribution keys and the Query SQL of Dynamic Tables.
   ```sql
   \d+ dt0;
                                          Dynamic table "public.dt0"
    Column |  Type   | Collation | Nullable | Default | Storage | Compression | 
Stats target | Description
   
--------+---------+-----------+----------+---------+---------+-------------+--------------+-------------
    a      | integer |           |          |         | plain   |             | 
             |
    b      | integer |           |          |         | plain   |             | 
             |
    sum    | bigint  |           |          |         | plain   |             | 
             |
   View definition:
    SELECT t1.a,
       t1.b,
       sum(t1.c) AS sum
      FROM t1
     GROUP BY t1.a, t1.b;
   Distributed by: (b)
   Access method: heap
   ```
   #### Refresh Dynamic Table
   As seen in pg_task, we put a command to auto-refresh dynamic tables. 
   However, if users want to do a REFRESH manually, exec command `REFRESH 
DYNAMIC TABLE` is also supported.
   ```sql
   REFRESH DYNAMIC TABLE dt0;
   REFRESH DYNAMIC TABLE
   ```
   #### REFRESH WITH NO DATA;
   Same as Materialized Views, Refresh with no data will truncate the Dynamic 
Table and make it unpopulated status.
   ```sql
   REFRESH DYNAMIC TABLE dt0 WITH NO DATA;
   REFRESH DYNAMIC TABLE
   ```
   
   #### Drop Dynamic Table:
   ```sql
   DROP DYNAMIC TABLE dt0;
   DROP DYNAMIC TABLE
   ```
   Drop a Dynamic Table will drop its scheduler job automatically.
   ```sql
   SELECT * FROM pg_task WHERE jobname LIKE 'gp_dynamic_table_refresh%' AND 
command LIKE '%dt0';
    jobid | schedule | command | nodename | nodeport | database | username | 
active | jobname
   
-------+----------+---------+----------+----------+----------+----------+--------+---------
   (0 rows)
   ```
   ## Privileges
   Same as Materialized Views in CBDB:
   ```sql
   \z
                                   Access privileges
    Schema | Name |     Type      | Access privileges | Column privileges | 
Policies
   
--------+------+---------------+-------------------+-------------------+----------
    public | dt1  | dynamic table |                   |                   |
    public | t1   | table         |                   |                   |
   (2 rows)
   ```
   
   ## Use Dynamic Tables to answer query
   
   Like Materialized Views, Dynamic Tables could be  used to answer query too:
   ```sql
   CREATE DYNAMIC TABLE dt1  AS
     SELECT * FROM t1 WHERE a = 1 DISTRIBUTED BY(b);
   ANALYZE dt1;
   SELECT 2
   ```
   ```sql
   SET enable_answer_query_using_materialized_views = ON;
   EXPLAIN(COSTS OFF, VERBOSE)
   SELECT * FROM t1 WHERE a = 1;
   SELECT * FROM t1 WHERE a = 1;
   SET
                                       QUERY PLAN
   
----------------------------------------------------------------------------------
    Gather Motion 3:1  (slice1; segments: 3)
      Output: a, b, c
      ->  Seq Scan on public.dt1
            Output: a, b, c
    Settings: enable_answer_query_using_materialized_views = 'on', optimizer = 
'off'
    Optimizer: Postgres query optimizer
   (6 rows)
   
    a | b | c
   ---+---+---
    1 | 2 | 3
    1 | 2 | 3
   (2 rows)
   ```
   
   Authored-by: Zhang Mingli [email protected]
   
   <!-- Thank you for your contribution to Apache Cloudberry (incubating)! -->
   
   Fixes #ISSUE_Number
   
   ### What does this PR do?
   <!-- Brief overview of the changes, including any major features or fixes -->
   
   ### Type of Change
   - [ ] Bug fix (non-breaking change)
   - [ ] New feature (non-breaking change)
   - [ ] Breaking change (fix or feature with breaking changes)
   - [ ] Documentation update
   
   ### Breaking Changes
   <!-- Remove if not applicable. If yes, explain impact and migration path -->
   
   ### Test Plan
   <!-- How did you test these changes? -->
   - [ ] Unit tests added/updated
   - [ ] Integration tests added/updated
   - [ ] Passed `make installcheck`
   - [ ] Passed `make -C src/test installcheck-cbdb-parallel`
   
   ### Impact
   <!-- Remove sections that don't apply -->
   **Performance:**
   <!-- Any performance implications? -->
   
   **User-facing changes:**
   <!-- Any changes visible to users? -->
   
   **Dependencies:**
   <!-- New dependencies or version changes? -->
   
   ### Checklist
   - [ ] Followed [contribution 
guide](https://cloudberry.apache.org/contribute/code)
   - [ ] Added/updated documentation
   - [ ] Reviewed code for security implications
   - [ ] Requested review from [cloudberry 
committers](https://github.com/orgs/apache/teams/cloudberry-committers)
   
   ### Additional Context
   <!-- Any other information that would help reviewers? Remove if none -->
   
   ⚠️ **To skip CI:** Add `[skip ci]` to your PR title. Only use when 
necessary! ⚠️
   
   ---
   <!-- Join our community:
   - Mailing list: 
[[email protected]](https://lists.apache.org/[email protected])
 (subscribe: [email protected])
   - Discussions: https://github.com/apache/cloudberry/discussions -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[PR] [Feature] Dynamic Table. [cloudberry]

Reply via email to