+1 for a separate repository. And also +1 for finding a good name.

`flink-warehouse` would be definitely a good marketing name but I agree that we should not start marketing for code bases. Are we planning to make this storage also available to DataStream API users? If not, I would also vote for `flink-managed-table` or better: `flink-table-store`

Thanks,
Timo



On 29.12.21 07:58, Jingsong Li wrote:
Thanks Till for your suggestions.

Personally, I like flink-warehouse, this is what we want to convey to
the user, but it indicates a bit too much scope.

How about just calling it flink-store?
Simply to convey an impression: this is flink's store project,
providing a built-in store for the flink compute engine, which can be
used by flink-table as well as flink-datastream.

Best,
Jingsong

On Tue, Dec 28, 2021 at 5:15 PM Till Rohrmann <trohrm...@apache.org> wrote:

Hi Jingsong,

I think that developing flink-dynamic-storage as a separate sub project is
a very good idea since it allows us to move a lot faster and decouple
releases from Flink. Hence big +1.

Do we want to name it flink-dynamic-storage or shall we use a more
descriptive name? dynamic-storage sounds a bit generic to me and I wouldn't
know that this has something to do with letting Flink manage your tables
and their storage. I don't have a very good idea but maybe we can call it
flink-managed-tables, flink-warehouse, flink-olap or so.

Cheers,
Till

On Tue, Dec 28, 2021 at 9:49 AM Martijn Visser <mart...@ververica.com>
wrote:

Hi Jingsong,

That sounds promising! +1 from my side to continue development under
flink-dynamic-storage as a Flink subproject. I think having a more in-depth
interface will benefit everyone.

Best regards,

Martijn

On Tue, 28 Dec 2021 at 04:23, Jingsong Li <jingsongl...@gmail.com> wrote:

Hi all,

After some experimentation, we felt no problem putting the dynamic
storage outside of flink, and it also allowed us to design the
interface in more depth.

What do you think? If there is no problem, I am asking for PMC's help
here: we want to propose flink-dynamic-storage as a flink subproject,
and we want to build the project under apache.

Best,
Jingsong


On Wed, Nov 24, 2021 at 8:10 PM Jingsong Li <jingsongl...@gmail.com>
wrote:

Hi Stephan,

Thanks for your reply.

Data never expires automatically.

If there is a need for data retention, the user can choose one of the
following options:
- In the SQL for querying the managed table, users filter the data by
themselves
- Define the time partition, and users can delete the expired
partition by themselves. (DROP PARTITION ...)
- In the future version, we will support the "DELETE FROM" statement,
users can delete the expired data according to the conditions.

So to answer your question:

Will the VMQ send retractions so that the data will be removed from
the table (via compactions)?

The current implementation is not sending retraction, which I think
theoretically should be sent, currently the user can filter by
subsequent conditions.
And yes, the subscriber would not see strictly a correct result. I
think this is something we can improve for Flink SQL.

Do we want time retention semantics handled by the compaction?

Currently, no, Data never expires automatically.

Do we want to declare those types of queries "out of scope" initially?

I think we want users to be able to use three options above to
accomplish their requirements.

I will update FLIP to make the definition clearer and more explicit.

Best,
Jingsong

On Wed, Nov 24, 2021 at 5:01 AM Stephan Ewen <ewenstep...@gmail.com>
wrote:

Thanks for digging into this.
Regarding this query:

INSERT INTO the_table
   SELECT window_end, COUNT(*)
     FROM (TUMBLE(TABLE interactions, DESCRIPTOR(ts), INTERVAL '5'
MINUTES))
GROUP BY window_end
   HAVING now() - window_end <= INTERVAL '14' DAYS;

I am not sure I understand what the conclusion is on the data
retention question, where the continuous streaming SQL query has retention
semantics. I think we would need to answer the following questions (I will
call the query that computed the managed table the "view materializer
query" - VMQ).

(1) I guess the VMQ will send no updates for windows beyond the
"retention period" is over (14 days), as you said. That makes sense.

(2) Will the VMQ send retractions so that the data will be removed
from the table (via compactions)?
   - if yes, this seems semantically better for users, but it will be
expensive to keep the timers for retractions.
   - if not, we can still solve this by adding filters to queries
against the managed table, as long as these queries are in Flink.
   - any subscriber to the changelog stream would not see strictly a
correct result if we are not doing the retractions

(3) Do we want time retention semantics handled by the compaction?
   - if we say that we lazily apply the deletes in the queries that
read the managed tables, then we could also age out the old data during
compaction.
   - that is cheap, but it might be too much of a special case to be
very relevant here.

(4) Do we want to declare those types of queries "out of scope"
initially?
   - if yes, how many users are we affecting? (I guess probably not
many, but would be good to hear some thoughts from others on this)
   - should we simply reject such queries in the optimizer as "not
possible to support in managed tables"? I would suggest that, always better
to tell users exactly what works and what not, rather than letting them be
surprised in the end. Users can still remove the HAVING clause if they want
the query to run, and that would be better than if the VMQ just silently
ignores those semantics.

Thanks,
Stephan



--
Best, Jingsong Lee



--
Best, Jingsong Lee






Reply via email to