A reminder for tomorrow's community sync for the MV spec. https://calendar.app.google/T4zSk6qKWoy1vV6P7
We have one open question from the last meeting on how `max-stalenesss-ms` should be interpreted. You can find more details in the meeting notes. https://docs.google.com/document/d/1EVCM-hKr5tY33t0Yzq37cAXSPncySc6Ghke7OZEcqXU/edit?tab=t.0#heading=h.75r8e0rwq02o Please also bring other topics that we should discuss. On Sat, Nov 1, 2025 at 10:14 PM Steven Wu <[email protected]> wrote: > Sorry for the delay. Here are the recording and meeting notes for the MV > sync meeting on Wednesday, Oct 29. > > https://docs.google.com/document/d/1EVCM-hKr5tY33t0Yzq37cAXSPncySc6Ghke7OZEcqXU/edit?tab=t.0#heading=h.75r8e0rwq02o > > We have started to collect them in the above google doc. > > On Mon, Oct 27, 2025 at 8:58 AM Péter Váry <[email protected]> > wrote: > >> If we have materialized views (MVs) and support for incremental change >> scans, then by introducing a Java-based representation of the view, we can >> expose a scan API that always returns up-to-date results for the MV. >> >> The scan could include multiple tasks: >> >> - A task for reading the current version of the MV. >> - An incremental change log scan covering the range between the >> snapshot ID of the source table at the time the MV was last refreshed and >> its current snapshot ID. Applying the Java representation of the view when >> transformations are required. >> >> This approach allows us to build an always up-to-date index table/single >> source MV, using existing components. >> >> Benny Chow <[email protected]> ezt írta (időpont: 2025. okt. 24., P, >> 7:44): >> >>> Hi Peter >>> >>> I think the current proposal would support your example. In most >>> situations, replace table operations after a view is materialized wouldn’t >>> invalidate the materialization. However, if the view includes metadata >>> columns, then the replace operations should invalidate the materialization. >>> >>> >>> This also brings up another important point that engines will differ on >>> what views can be materialized or not. For example, maybe metadata columns >>> are not allowed similar to non deterministic functions like random. But >>> some engines like Dremio may allow views that use current date functions. >>> It should be possible for one engine to materialize a view and another >>> engine to look at the query tree and decide it’s not a view it supports >>> materializations on and choose not to use that materialization. >>> >>> Thanks >>> Benny >>> >>> >>> >>> On Oct 23, 2025, at 8:44 AM, Péter Váry <[email protected]> >>> wrote: >>> >>> >>> Hi All, >>> >>> I’ve been catching up on the discussion and wanted to share an >>> observation. One aspect that stands out to me in the proposed staleness >>> evaluation logic is that snapshots which don’t modify data can still affect >>> the view’s contents if the view includes metadata columns. >>> >>> I was considering using a materialized view as an index for a given >>> table to accelerate the conversion of equality deletes to position deletes. >>> For example, the query might look like: >>> >>> *SELECT _POS, _FILE, id FROM target_table* >>> >>> >>> During compaction, the materialized view would need to be refreshed to >>> ensure it reflects the correct data. >>> >>> Does this seem like a valid use case? Or should we explicitly exclude >>> scenarios like this? >>> >>> Thanks, >>> Peter >>> >>> Steven Wu <[email protected]> ezt írta (időpont: 2025. okt. 20., H, >>> 17:30): >>> >>>> Walaa, >>>> >>>> > while Option 2 is described in your summary as "giving engines >>>> *flexibility* to determine freshness recursively beyond a source MV", >>>> that *isn’t achievable* under the MV evaluation model itself. >>>> Because each MV treats upstream MVs as physical tables, recursion stops >>>> at the first materialized boundary; *deeper staleness cannot be >>>> discovered without switching to a logical-view evaluation model, i.e., >>>> stepping outside the MV model altogether (note that in Option 3 we can >>>> determine recursive staleness while still inside the MV model).* >>>> >>>> In option 2, when determining the freshness of mv_3, engines can choose >>>> to recursively evaluate the freshness of mv_1 and mv_2 since they are also >>>> MVs. But engines can also choose not to. >>>> >>>> > This means that there seems to be an implicit “Option 3”. This option >>>> treats MVs as logical views, i.e., storing only view versions + base table >>>> snapshot IDs (no MV storage snapshot IDs, no per-path lineage). >>>> >>>> In the new option 3 you described, how could the engine update mv3's >>>> refresh state for base table_a and table_b? unless all connected MVs are >>>> refreshed and committed in one single transaction, one entry per base table >>>> doesn't seem feasible. That's the main reason for option 1 to require the >>>> lineage path information in refresh state for base tables. >>>> >>>> It also seems that option 3 can only interpret freshness recursively, >>>> while today there are engines that support MVs without recursively >>>> evaluating source MVs. >>>> >>>> Thanks, >>>> Steven >>>> >>>> >>>> On Mon, Oct 20, 2025 at 1:44 AM Walaa Eldin Moustafa < >>>> [email protected]> wrote: >>>> >>>>> Hi Steven, >>>>> >>>>> Thanks for organizing the series and summarizing the outcome. >>>>> >>>>> After re-reading the Option 1/2 proposal, initially I interpreted >>>>> Option 1 as simply expanding MVs like regular logical views. On closer >>>>> look, it is actually more complex. It also preserves per-path lineage >>>>> state >>>>> (e.g., multiple entries for the same base table via different parents), >>>>> which increases expressiveness but significantly increases metadata >>>>> complexity. So I agree it is not a practical option. >>>>> >>>>> This means that there seems to be an implicit “Option 3”. This option >>>>> treats MVs as logical views, i.e., storing only view versions + base table >>>>> snapshot IDs (no MV storage snapshot IDs, no per-path lineage). Under this >>>>> model, mv_3’s metadata might look like: >>>>> >>>>> Type Name Tracked State >>>>> ----- ------- ----------------------- >>>>> view mv_1 view_version_id >>>>> view mv_2 view_version_id >>>>> table table_a table_snapshot_id >>>>> table table_b table_snapshot_id >>>>> >>>>> This preserves logical semantics and aligns MV behavior with pure >>>>> views. >>>>> >>>>> *If we choose Option 2 (treat source MV as a materialized table), we >>>>> may have to be consider those constraints:* >>>>> >>>>> * Staleness only degrades up the chain. mv_1 and mv_2 may already be >>>>> stale relative to the base tables, but if mv_3 is refreshed using their >>>>> storage snapshots, then mv_3 will be marked as fresh under Option 2, even >>>>> though all three MVs are stale relative to the base tables. >>>>> >>>>> * Engines can no longer discover staleness beyond mv_1. Once mv_3 sees >>>>> mv_1 (or mv_2) as fresh based only on their storage snapshots, it will not >>>>> expand into mv_1 or mv_2 to check whether they are stale relative to the >>>>> base tables. >>>>> >>>>> * If mv_2 and mv_3 were purely logical views instead of MVs, they >>>>> would evaluate directly against base tables and return newer data. Under >>>>> Option 2, the same definitions but materialized upstream produce different >>>>> data, not just different metadata. >>>>> >>>>> Therefore, while Option 2 is described in your summary as "giving >>>>> engines *flexibility* to determine freshness recursively beyond a >>>>> source MV", that *isn’t achievable* under the MV evaluation model >>>>> itself. >>>>> Because each MV treats upstream MVs as physical tables, recursion >>>>> stops at the first materialized boundary; *deeper staleness cannot be >>>>> discovered without switching to a logical-view evaluation model, i.e., >>>>> stepping outside the MV model altogether (note that in Option 3 we can >>>>> determine recursive staleness while still inside the MV model).* >>>>> >>>>> Let me know your thoughts. I slightly prefer Option 3. I’m also fine >>>>> with Option 2, but I don’t think the flexibility to recursively determine >>>>> freshness actually exists under its evaluation model. Not sure if this >>>>> changes anyone’s view, but I wanted to clarify how I’m reading it. >>>>> >>>>> Thanks, >>>>> Walaa. >>>>> >>>>> >>>>> On Wed, Oct 8, 2025 at 11:11 PM Benny Chow <[email protected]> wrote: >>>>> >>>>>> I just listened to the recording. I'm the tech lead for MVs at >>>>>> Dremio and responsible for both refresh management and query rewrites >>>>>> with >>>>>> MVs. >>>>>> >>>>>> It's great that we seem to agree that Iceberg MV spec won't require >>>>>> that MVs always be up to date in order to be usable for query rewrites. >>>>>> There can be many data consistency issues (as Dan pointed out) but that >>>>>> is >>>>>> the state of affairs today. >>>>>> >>>>>> It sounds like we are converging on the following scenarios for an >>>>>> engine to validate the MV freshness: >>>>>> >>>>>> 1. Use storage table without any validation. This might be the >>>>>> extreme "async MV" example. >>>>>> 2. Ignore storage table even if one exists because SQL command or >>>>>> use case requires that. >>>>>> 3. Use storage table only if data is not more than x hours old. >>>>>> This can be achieved with the proposed refresh-start-timestamp-ms which >>>>>> is currently in the proposed spec. For this to work with MVs built on >>>>>> MVs, >>>>>> we should probably state in the spec that if a MV is built on another MV, >>>>>> then it needs to inherit the refresh-start-timestamp-ms of the child MV. >>>>>> In Steven's example, when building mv3, refresh-start-timestamp-ms needs >>>>>> to >>>>>> be set to the minimum of mv1 or mv2's refresh-start-timestamp-ms. If >>>>>> this >>>>>> property name is confusing, we can rename it to >>>>>> "refresh-earliest-table-timestamp-ms". I originally proposed this >>>>>> property >>>>>> and also listed out other benefits here: >>>>>> https://github.com/apache/iceberg/pull/11041#discussion_r1779797796 >>>>>> Also, at the time, MVs built on MVs weren't being considered. Now that >>>>>> it >>>>>> is, I would recommend we have both "refresh-start-timestamp-ms" (when the >>>>>> refresh was started on the storage table) and >>>>>> "refresh-earliest-table-timestamp-ms" (used for freshness validation). >>>>>> 4. Don't use the storage table if it is older than X hours. This is >>>>>> what I had originally proposed for the >>>>>> *materialization.max-stalessness-ms* view property here: >>>>>> https://github.com/apache/iceberg/pull/11041#discussion_r1744837644 >>>>>> It wasn't meant to validate the freshness but more to prevent use of a >>>>>> materialization after some criteria. >>>>>> 5. Use storage table if recursive validation passes... i.e. >>>>>> refresh-state matches the current expanded query tree state. This is >>>>>> what >>>>>> I think Steven is calling the "synchronous MV". >>>>>> >>>>>> For scenario 1-4, it would support the nice use case of an Iceberg >>>>>> client using a view's data through the storage table without needing to >>>>>> know how to parse/validate/expand any view SQLs. >>>>>> >>>>>> In Dremio's planner, we primarily use scenario 1 and 4 together to >>>>>> determine MV validity for query rewrite. Scenario 2 and 5 also apply in >>>>>> certain situations. For scenario 3, Dremio only exposes the >>>>>> "refresh-earliest-table-timestamp-ms" as an fyi to the user but it would >>>>>> be >>>>>> interesting to allow the user to set this time so that they could run >>>>>> queries and be 100% certain that they were not seeing data older than x >>>>>> hours. >>>>>> >>>>>> Thanks >>>>>> Benny >>>>>> >>>>>> On Wed, Oct 8, 2025 at 3:37 PM Steven Wu <[email protected]> >>>>>> wrote: >>>>>> >>>>>>> correction for a typo. >>>>>>> >>>>>>> Prashanth brought up another scenario of compaction/rewrite where a >>>>>>> new snapshot was added *with* actual data change >>>>>>> --> >>>>>>> Prashanth brought up another scenario of compaction/rewrite where a >>>>>>> new snapshot was added *without* actual data change >>>>>>> >>>>>>> >>>>>>> On Wed, Oct 8, 2025 at 2:12 PM Steven Wu <[email protected]> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi, >>>>>>>> >>>>>>>> Thanks everyone for joining the MV discussion meeting. We will >>>>>>>> continue to have the recurring sync meeting on Wednesday 9 am (Pacific) >>>>>>>> every 3 weeks until we get to the finish line where Jan's MV spec PR >>>>>>>> [1] is >>>>>>>> merged. I have scheduled our next meeting on Oct 29 in the Iceberg dev >>>>>>>> events calendar. >>>>>>>> >>>>>>>> Here is the video recording for today's meeting. >>>>>>>> >>>>>>>> https://drive.google.com/file/d/1-nfhBPDWLoAFDu5cKP0rwLd_30HB6byR/view?usp=sharing >>>>>>>> >>>>>>>> We mostly discussed freshness evaluation. Here is the meeting >>>>>>>> summary. >>>>>>>> >>>>>>>> 1. For tracking the refresh state for the source MV [2], the >>>>>>>> consensus is option 2 (treating source MV as a materialized table) >>>>>>>> which >>>>>>>> would give engines the flexibility on freshness determination >>>>>>>> (recursive >>>>>>>> beyond source MV or not). >>>>>>>> 2. Earlier design doc [3] discussed max staleness config. But >>>>>>>> it wasn't reflected in the spec PR. The general opinion is to add >>>>>>>> the >>>>>>>> config to the spec PR. The open question is whether the ` >>>>>>>> materialization.max-staleness-ms` config should be added to the >>>>>>>> view metadata or the storage table metadata. Either can work. We >>>>>>>> just need >>>>>>>> to decide which makes a little better fit. >>>>>>>> 3. Prashanth brought up schema change with default value and >>>>>>>> how it may affect the MV refresh state (for SQL representation with >>>>>>>> select >>>>>>>> *). Jan mentioned that snapshot contains schema id when the >>>>>>>> snapshot was >>>>>>>> created. Engine can compare the snapshot schema id to the source >>>>>>>> table >>>>>>>> schema id during freshness evaluation. There is no need for >>>>>>>> additional >>>>>>>> schema info in refresh-state tracking in the storage table. >>>>>>>> 4. Prashanth brought up another scenario of compaction/rewrite >>>>>>>> where a new snapshot was added with actual data change. The general >>>>>>>> take is >>>>>>>> that the engine can optimize and decide that MV is fresh as the new >>>>>>>> snapshot doesn't have any data change. >>>>>>>> >>>>>>>> >>>>>>>> We can add some clarifications in the spec PR for freshness >>>>>>>> evaluation based on the above discussions. >>>>>>>> >>>>>>>> [1] https://github.com/apache/iceberg/pull/11041 >>>>>>>> [2] >>>>>>>> https://docs.google.com/document/d/1_StBW5hCQhumhIvgbdsHjyW0ED3dWMkjtNzyPp9Sfr8/edit?tab=t.0 >>>>>>>> [3] >>>>>>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?tab=t.0#heading=h.3wigecex0zls >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> On Thu, Sep 25, 2025 at 9:27 AM Steven Wu <[email protected]> >>>>>>>> wrote: >>>>>>>> >>>>>>>>> Hi all, >>>>>>>>> >>>>>>>>> Iceberg materialized view has been discussed in the community for >>>>>>>>> a long time. Thanks Jan Kaul for driving the discussion and the spec >>>>>>>>> PR. It >>>>>>>>> has been stalled for a long time due to lack of consensus on 1 or 2 >>>>>>>>> topics. >>>>>>>>> In Wed's Iceberg community sync meeting, Talat brought up the >>>>>>>>> question on >>>>>>>>> how to move forward and if we can have a dedicated meeting for MV. >>>>>>>>> >>>>>>>>> I have set up a meeting on *Oct 8 (9-10 am Pacific)*. If you >>>>>>>>> subscribe to the "Iceberg Dev Events" calendar, you should be >>>>>>>>> able to see it. If not, here is the link: >>>>>>>>> https://meet.google.com/nfe-guyq-pqf >>>>>>>>> >>>>>>>>> We are going to discuss >>>>>>>>> * remaining open questions >>>>>>>>> * unresolved concerns >>>>>>>>> * the next step and hopefully some consensus on moving forward >>>>>>>>> >>>>>>>>> MV spec PR is up to date. Jan has incorporated recent feedback. >>>>>>>>> This should be the base of the discussion. >>>>>>>>> https://github.com/apache/iceberg/pull/11041 >>>>>>>>> <https://www.google.com/url?q=https://github.com/apache/iceberg/pull/11041&sa=D&source=calendar&usd=2&usg=AOvVaw3w0TjRpwbC17AGzmxZmElM> >>>>>>>>> >>>>>>>>> Dev discussion thread (a long-running thread started by Jan). >>>>>>>>> https://lists.apache.org/thread/y1vlpzbn2x7xookjkffcl08zzyofk5hf >>>>>>>>> <https://www.google.com/url?q=https://lists.apache.org/thread/y1vlpzbn2x7xookjkffcl08zzyofk5hf&sa=D&source=calendar&usd=2&usg=AOvVaw0fotlsrnRBOb820mA5JRyB> >>>>>>>>> >>>>>>>>> The mail archive has broken lineage and doesn't show all replies. >>>>>>>>> Email subject is "*[DISCUSS] Iceberg Materialzied Views*". >>>>>>>>> >>>>>>>>> Thanks, >>>>>>>>> Steven >>>>>>>>> >>>>>>>>>
