Hi all,
After some experimentation, we felt no problem putting the dynamic
storage outside of flink, and it also allowed us to design the
interface in more depth.
What do you think? If there is no problem, I am asking for PMC's help
here: we want to propose flink-dynamic-storage as a flink subproject,
and we want to build the project under apache.
Best,
Jingsong
On Wed, Nov 24, 2021 at 8:10 PM Jingsong Li <jingsongl...@gmail.com>
wrote:
Hi Stephan,
Thanks for your reply.
Data never expires automatically.
If there is a need for data retention, the user can choose one of the
following options:
- In the SQL for querying the managed table, users filter the data by
themselves
- Define the time partition, and users can delete the expired
partition by themselves. (DROP PARTITION ...)
- In the future version, we will support the "DELETE FROM" statement,
users can delete the expired data according to the conditions.
So to answer your question:
Will the VMQ send retractions so that the data will be removed from
the table (via compactions)?
The current implementation is not sending retraction, which I think
theoretically should be sent, currently the user can filter by
subsequent conditions.
And yes, the subscriber would not see strictly a correct result. I
think this is something we can improve for Flink SQL.
Do we want time retention semantics handled by the compaction?
Currently, no, Data never expires automatically.
Do we want to declare those types of queries "out of scope" initially?
I think we want users to be able to use three options above to
accomplish their requirements.
I will update FLIP to make the definition clearer and more explicit.
Best,
Jingsong
On Wed, Nov 24, 2021 at 5:01 AM Stephan Ewen <ewenstep...@gmail.com>
wrote:
Thanks for digging into this.
Regarding this query:
INSERT INTO the_table
SELECT window_end, COUNT(*)
FROM (TUMBLE(TABLE interactions, DESCRIPTOR(ts), INTERVAL '5'
MINUTES))
GROUP BY window_end
HAVING now() - window_end <= INTERVAL '14' DAYS;
I am not sure I understand what the conclusion is on the data
retention question, where the continuous streaming SQL query has retention
semantics. I think we would need to answer the following questions (I will
call the query that computed the managed table the "view materializer
query" - VMQ).
(1) I guess the VMQ will send no updates for windows beyond the
"retention period" is over (14 days), as you said. That makes sense.
(2) Will the VMQ send retractions so that the data will be removed
from the table (via compactions)?
- if yes, this seems semantically better for users, but it will be
expensive to keep the timers for retractions.
- if not, we can still solve this by adding filters to queries
against the managed table, as long as these queries are in Flink.
- any subscriber to the changelog stream would not see strictly a
correct result if we are not doing the retractions
(3) Do we want time retention semantics handled by the compaction?
- if we say that we lazily apply the deletes in the queries that
read the managed tables, then we could also age out the old data during
compaction.
- that is cheap, but it might be too much of a special case to be
very relevant here.
(4) Do we want to declare those types of queries "out of scope"
initially?
- if yes, how many users are we affecting? (I guess probably not
many, but would be good to hear some thoughts from others on this)
- should we simply reject such queries in the optimizer as "not
possible to support in managed tables"? I would suggest that, always better
to tell users exactly what works and what not, rather than letting them be
surprised in the end. Users can still remove the HAVING clause if they want
the query to run, and that would be better than if the VMQ just silently
ignores those semantics.
Thanks,
Stephan
--
Best, Jingsong Lee
--
Best, Jingsong Lee