Walaa, > while Option 2 is described in your summary as "giving engines *flexibility* to determine freshness recursively beyond a source MV", that *isn’t achievable* under the MV evaluation model itself. Because each MV treats upstream MVs as physical tables, recursion stops at the first materialized boundary; *deeper staleness cannot be discovered without switching to a logical-view evaluation model, i.e., stepping outside the MV model altogether (note that in Option 3 we can determine recursive staleness while still inside the MV model).*
In option 2, when determining the freshness of mv_3, engines can choose to recursively evaluate the freshness of mv_1 and mv_2 since they are also MVs. But engines can also choose not to. > This means that there seems to be an implicit “Option 3”. This option treats MVs as logical views, i.e., storing only view versions + base table snapshot IDs (no MV storage snapshot IDs, no per-path lineage). In the new option 3 you described, how could the engine update mv3's refresh state for base table_a and table_b? unless all connected MVs are refreshed and committed in one single transaction, one entry per base table doesn't seem feasible. That's the main reason for option 1 to require the lineage path information in refresh state for base tables. It also seems that option 3 can only interpret freshness recursively, while today there are engines that support MVs without recursively evaluating source MVs. Thanks, Steven On Mon, Oct 20, 2025 at 1:44 AM Walaa Eldin Moustafa <[email protected]> wrote: > Hi Steven, > > Thanks for organizing the series and summarizing the outcome. > > After re-reading the Option 1/2 proposal, initially I interpreted Option 1 > as simply expanding MVs like regular logical views. On closer look, it is > actually more complex. It also preserves per-path lineage state (e.g., > multiple entries for the same base table via different parents), which > increases expressiveness but significantly increases metadata complexity. > So I agree it is not a practical option. > > This means that there seems to be an implicit “Option 3”. This option > treats MVs as logical views, i.e., storing only view versions + base table > snapshot IDs (no MV storage snapshot IDs, no per-path lineage). Under this > model, mv_3’s metadata might look like: > > Type Name Tracked State > ----- ------- ----------------------- > view mv_1 view_version_id > view mv_2 view_version_id > table table_a table_snapshot_id > table table_b table_snapshot_id > > This preserves logical semantics and aligns MV behavior with pure views. > > *If we choose Option 2 (treat source MV as a materialized table), we may > have to be consider those constraints:* > > * Staleness only degrades up the chain. mv_1 and mv_2 may already be stale > relative to the base tables, but if mv_3 is refreshed using their storage > snapshots, then mv_3 will be marked as fresh under Option 2, even though > all three MVs are stale relative to the base tables. > > * Engines can no longer discover staleness beyond mv_1. Once mv_3 sees > mv_1 (or mv_2) as fresh based only on their storage snapshots, it will not > expand into mv_1 or mv_2 to check whether they are stale relative to the > base tables. > > * If mv_2 and mv_3 were purely logical views instead of MVs, they would > evaluate directly against base tables and return newer data. Under Option > 2, the same definitions but materialized upstream produce different data, > not just different metadata. > > Therefore, while Option 2 is described in your summary as "giving engines > *flexibility* to determine freshness recursively beyond a source MV", > that *isn’t achievable* under the MV evaluation model itself. > Because each MV treats upstream MVs as physical tables, recursion stops at > the first materialized boundary; *deeper staleness cannot be discovered > without switching to a logical-view evaluation model, i.e., stepping > outside the MV model altogether (note that in Option 3 we can determine > recursive staleness while still inside the MV model).* > > Let me know your thoughts. I slightly prefer Option 3. I’m also fine with > Option 2, but I don’t think the flexibility to recursively determine > freshness actually exists under its evaluation model. Not sure if this > changes anyone’s view, but I wanted to clarify how I’m reading it. > > Thanks, > Walaa. > > > On Wed, Oct 8, 2025 at 11:11 PM Benny Chow <[email protected]> wrote: > >> I just listened to the recording. I'm the tech lead for MVs at Dremio >> and responsible for both refresh management and query rewrites with MVs. >> >> It's great that we seem to agree that Iceberg MV spec won't require that >> MVs always be up to date in order to be usable for query rewrites. There >> can be many data consistency issues (as Dan pointed out) but that is the >> state of affairs today. >> >> It sounds like we are converging on the following scenarios for an engine >> to validate the MV freshness: >> >> 1. Use storage table without any validation. This might be the extreme >> "async MV" example. >> 2. Ignore storage table even if one exists because SQL command or use >> case requires that. >> 3. Use storage table only if data is not more than x hours old. This >> can be achieved with the proposed refresh-start-timestamp-ms which is >> currently in the proposed spec. For this to work with MVs built on MVs, we >> should probably state in the spec that if a MV is built on another MV, then >> it needs to inherit the refresh-start-timestamp-ms of the child MV. In >> Steven's example, when building mv3, refresh-start-timestamp-ms needs to be >> set to the minimum of mv1 or mv2's refresh-start-timestamp-ms. If this >> property name is confusing, we can rename it to >> "refresh-earliest-table-timestamp-ms". I originally proposed this property >> and also listed out other benefits here: >> https://github.com/apache/iceberg/pull/11041#discussion_r1779797796 >> Also, at the time, MVs built on MVs weren't being considered. Now that it >> is, I would recommend we have both "refresh-start-timestamp-ms" (when the >> refresh was started on the storage table) and >> "refresh-earliest-table-timestamp-ms" (used for freshness validation). >> 4. Don't use the storage table if it is older than X hours. This is >> what I had originally proposed for the >> *materialization.max-stalessness-ms* view property here: >> https://github.com/apache/iceberg/pull/11041#discussion_r1744837644 It >> wasn't meant to validate the freshness but more to prevent use of a >> materialization after some criteria. >> 5. Use storage table if recursive validation passes... i.e. >> refresh-state matches the current expanded query tree state. This is what >> I think Steven is calling the "synchronous MV". >> >> For scenario 1-4, it would support the nice use case of an Iceberg client >> using a view's data through the storage table without needing to know how >> to parse/validate/expand any view SQLs. >> >> In Dremio's planner, we primarily use scenario 1 and 4 together to >> determine MV validity for query rewrite. Scenario 2 and 5 also apply in >> certain situations. For scenario 3, Dremio only exposes the >> "refresh-earliest-table-timestamp-ms" as an fyi to the user but it would be >> interesting to allow the user to set this time so that they could run >> queries and be 100% certain that they were not seeing data older than x >> hours. >> >> Thanks >> Benny >> >> On Wed, Oct 8, 2025 at 3:37 PM Steven Wu <[email protected]> wrote: >> >>> correction for a typo. >>> >>> Prashanth brought up another scenario of compaction/rewrite where a new >>> snapshot was added *with* actual data change >>> --> >>> Prashanth brought up another scenario of compaction/rewrite where a new >>> snapshot was added *without* actual data change >>> >>> >>> On Wed, Oct 8, 2025 at 2:12 PM Steven Wu <[email protected]> wrote: >>> >>>> Hi, >>>> >>>> Thanks everyone for joining the MV discussion meeting. We will continue >>>> to have the recurring sync meeting on Wednesday 9 am (Pacific) every 3 >>>> weeks until we get to the finish line where Jan's MV spec PR [1] is merged. >>>> I have scheduled our next meeting on Oct 29 in the Iceberg dev events >>>> calendar. >>>> >>>> Here is the video recording for today's meeting. >>>> >>>> https://drive.google.com/file/d/1-nfhBPDWLoAFDu5cKP0rwLd_30HB6byR/view?usp=sharing >>>> >>>> We mostly discussed freshness evaluation. Here is the meeting summary. >>>> >>>> 1. For tracking the refresh state for the source MV [2], the >>>> consensus is option 2 (treating source MV as a materialized table) which >>>> would give engines the flexibility on freshness determination (recursive >>>> beyond source MV or not). >>>> 2. Earlier design doc [3] discussed max staleness config. But it >>>> wasn't reflected in the spec PR. The general opinion is to add the >>>> config >>>> to the spec PR. The open question is whether the ` >>>> materialization.max-staleness-ms` config should be added to the >>>> view metadata or the storage table metadata. Either can work. We just >>>> need >>>> to decide which makes a little better fit. >>>> 3. Prashanth brought up schema change with default value and how it >>>> may affect the MV refresh state (for SQL representation with select *). >>>> Jan >>>> mentioned that snapshot contains schema id when the snapshot was >>>> created. >>>> Engine can compare the snapshot schema id to the source table schema id >>>> during freshness evaluation. There is no need for additional schema >>>> info in >>>> refresh-state tracking in the storage table. >>>> 4. Prashanth brought up another scenario of compaction/rewrite >>>> where a new snapshot was added with actual data change. The general >>>> take is >>>> that the engine can optimize and decide that MV is fresh as the new >>>> snapshot doesn't have any data change. >>>> >>>> >>>> We can add some clarifications in the spec PR for freshness evaluation >>>> based on the above discussions. >>>> >>>> [1] https://github.com/apache/iceberg/pull/11041 >>>> [2] >>>> https://docs.google.com/document/d/1_StBW5hCQhumhIvgbdsHjyW0ED3dWMkjtNzyPp9Sfr8/edit?tab=t.0 >>>> [3] >>>> https://docs.google.com/document/d/1UnhldHhe3Grz8JBngwXPA6ZZord1xMedY5ukEhZYF-A/edit?tab=t.0#heading=h.3wigecex0zls >>>> >>>> >>>> >>>> >>>> On Thu, Sep 25, 2025 at 9:27 AM Steven Wu <[email protected]> wrote: >>>> >>>>> Hi all, >>>>> >>>>> Iceberg materialized view has been discussed in the community for a >>>>> long time. Thanks Jan Kaul for driving the discussion and the spec PR. It >>>>> has been stalled for a long time due to lack of consensus on 1 or 2 >>>>> topics. >>>>> In Wed's Iceberg community sync meeting, Talat brought up the question on >>>>> how to move forward and if we can have a dedicated meeting for MV. >>>>> >>>>> I have set up a meeting on *Oct 8 (9-10 am Pacific)*. If you >>>>> subscribe to the "Iceberg Dev Events" calendar, you should be able to >>>>> see it. If not, here is the link: https://meet.google.com/nfe-guyq-pqf >>>>> >>>>> We are going to discuss >>>>> * remaining open questions >>>>> * unresolved concerns >>>>> * the next step and hopefully some consensus on moving forward >>>>> >>>>> MV spec PR is up to date. Jan has incorporated recent feedback. This >>>>> should be the base of the discussion. >>>>> https://github.com/apache/iceberg/pull/11041 >>>>> <https://www.google.com/url?q=https://github.com/apache/iceberg/pull/11041&sa=D&source=calendar&usd=2&usg=AOvVaw3w0TjRpwbC17AGzmxZmElM> >>>>> >>>>> Dev discussion thread (a long-running thread started by Jan). >>>>> https://lists.apache.org/thread/y1vlpzbn2x7xookjkffcl08zzyofk5hf >>>>> <https://www.google.com/url?q=https://lists.apache.org/thread/y1vlpzbn2x7xookjkffcl08zzyofk5hf&sa=D&source=calendar&usd=2&usg=AOvVaw0fotlsrnRBOb820mA5JRyB> >>>>> >>>>> The mail archive has broken lineage and doesn't show all replies. >>>>> Email subject is "*[DISCUSS] Iceberg Materialzied Views*". >>>>> >>>>> Thanks, >>>>> Steven >>>>> >>>>>
