Re: Iceberg / Spark syncs

Anurag Mantripragada Tue, 20 Jan 2026 12:28:05 -0800

Thanks everyone for joining the first Iceberg/Spark community sync.

Here is the recording: https://youtu.be/g4n2hwdFosE?si=n9hVRhCThshuOqd5



Below are the discussion highlights.

Datafusion Comet integration

   -
      -

         Spark: Encapsulate parquet objects for Comet (#13786
         <https://github.com/apache/iceberg/pull/13786>)
         -

         Future of Iceberg support in Comet (datafusion-comet#2921
         <https://github.com/apache/datafusion-comet/issues/2921>)
         -

            Mailing List Discussion
            <https://lists.apache.org/thread/vr9nsbd5nhg3d20nmtyj4b3zsw9229gd>
            -

         Notes:
         -

            Rust vs Java - Discuss and vote in the dev list
            -

            To move forward with (#13786
            <https://github.com/apache/iceberg/pull/13786>) - Discuss in
            FileFormat API sync if there are any pending items this PR
needs updates on.
            -

            Make a decision to merge the PR vs waiting for FileFormat API


-

      Spark 3.4 Deprecation
      -

         Spark: Remove Spark 3.4 support (#14122
         <https://github.com/apache/iceberg/pull/14122>)
         -

         Notes:
         -

            Wait until comet integration is resolved.

            -

      Spark 4.1/4.2
      -

         Spark: Add support for 4.2.0-preview (#14984
         <https://github.com/apache/iceberg/pull/14984>)
         -

         Spark 4.1: Initial support for MERGE INTO schema evolution (#14970
         <https://github.com/apache/iceberg/pull/14970>)
         -

         Notes:
         -

            4.1 is the current latest version. New PRs must go to it
            -

            Spark 4.1 introduces a version framework. Anton is working on
            integrating it with Iceberg. This greatly simplifies
Iceberg lifecycle
            management but requires non-trivial integration work.
            -

            Prefer not to make any releases with 4.1 until this is in.


-

      DSv2 and sort order reporting
      -

         Spark (4.0, 3.5): Set data file sort_order_id in manifest for
         writes from Spark (#14683
         <https://github.com/apache/iceberg/pull/14683>)
         -

            The rebase has many changes. Ask author to fix.
            -

         Spark 4.0: Implement SupportsReportOrdering DSv2 API (#14948
         <https://github.com/apache/iceberg/pull/14948>)
         -

            Move to 4.1 for easier review


-

      Compaction/Table maintenance/DR
      -

         Spark 4.0: RewriteTablePath  support for multiple source and
         destination prefixes (#14355
         <https://github.com/apache/iceberg/pull/14355>)
         -

         Spark 4.0: Optional switch to log expire data files during
         ExpireSnapshots action (#14354
         <https://github.com/apache/iceberg/pull/14354>)
         -

         Notes:
         -

            Trace level logging
            -

            How about logging it to another Iceberg table?
            -

            Use the dataframe of files and log separately?


-

      V3 spec implementation
      -

         Spark: Support writing shredded variant in Iceberg-Spark (#14297
         <https://github.com/apache/iceberg/pull/14297>)
         -

         Notes:
         -

            Status of Variant type support - consolidate and track somewhere
            -

            Filter pushdown not implemented
            -

            The write support PR is new, will review. It should have
            Iceberg metadata changes to indicate the variant shredding
so Spark can use
            it.
            -

            #14297 <https://github.com/apache/iceberg/pull/14297> Will be
            reviewed


-

      Spark UDF Support
      -

         SQL UDF support Stage 1 (#14954
         <https://github.com/apache/iceberg/pull/14954>) (The corresponding
         Spark SPIP: SPIP: Catalog-backed Code-Literal Functions (SQL and
         Python) with Catalog SPI and CRUD
         
<https://docs.google.com/document/d/186cTAZxoXp1p8vaSunIaJmVLXcPR-FxSiLiDUl8kK8A/edit?tab=t.0#heading=h.for1fb3tezo3>
         )
         -

         Notes:
         -

            Waiting for the proposal vote and spark side SPIP related to
            this.
            -

         Spark 4.0: Spark UDF POC (#14505
         <https://github.com/apache/iceberg/pull/14505>)
         -

            Huaxin to delete this PR. This version is hacky.


-

      DDL and schema evolution
      -

         CREATE TABLE LIKE support (#14269
         <https://github.com/apache/iceberg/pull/14269>)
         -

         Notes:
         -

            Recommend not to add SQL extensions in Iceberg code anymore.
            -

               They are fragile and need maintenance and have to work well
               with Spark
               -

            Alternatively, consider writing a procedure to do this until
            Spark has native support.
            -

            Native Spark support for CREATE TABLE LIKE is not yet
            implemented.
            Spark PRs
      -

      Alpha family aggregate support - #52551
      <https://github.com/apache/spark/pull/52551>
      -

         Notes:
         -

            Okay to have Spark only changes that can potentially help in
            Iceberg use-cases
            -

            Elaborate on the use of this? How does this integrate with
            Iceberg?


   -

      Codegen for MergeRowsExec - #52399
      <https://github.com/apache/spark/pull/52399>
      -

         Notes:
         -

            This is a heavily used Exec node in Iceberg so this is good to
            have.
            -

            The community will review this




Thanks,
~Anurag

On Thu, Jan 15, 2026 at 6:48 PM Anton Okolnychyi <[email protected]>
wrote:

> If anyone has long-standing PRs related to Spark, it may be a good forum
> to get some reviews and help from the community.
>
> ср, 14 січ. 2026 р. о 11:23 Anurag Mantripragada <
> [email protected]> пише:
>
>> Thanks Kevin,
>>
>> All, please review the doc
>> <https://docs.google.com/document/d/19nno1RoPznbbxKOZZddZNHHafa7XULjbN6RPExdr2n4/edit?tab=t.0>
>>  and
>> add any agenda items I may have missed. See you on Tuesday.
>>
>> ~ Anurag
>>
>> On Wed, Jan 14, 2026 at 11:20 AM Kevin Liu <[email protected]> wrote:
>>
>>> Connected with Anurag on Slack. I just added a new event to the Iceberg
>>> Dev calendar for next week Tuesday Jan 20th from 10AM - 11AM PT, "*Iceberg
>>> - Spark Community Sync*". It's a monthly recurring meeting and the
>>> google meets link is set to open to the public.
>>> Happy to make changes based on feedback.
>>>
>>> Best,
>>> Kevin Liu
>>>
>>>
>>> On Wed, Jan 14, 2026 at 10:57 AM Kevin Liu <[email protected]>
>>> wrote:
>>>
>>>> Looking at the current Iceberg dev calendar schedule, we have a slot
>>>> next week Tuesday or Friday for a monthly recurring sync. Wednesday
>>>> corresponds with the main Community Sync in some weeks.
>>>> Please let me know the preferred day and time and I can help set it up!
>>>>
>>>> Best,
>>>> Kevin Liu
>>>>
>>>> On Tue, Jan 13, 2026 at 10:58 AM Anurag Mantripragada <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Kevin,
>>>>>
>>>>> I'm open to ideas, but I think we could start with monthly cadence for
>>>>> Spark syncs and increase the frequency if the community feels we need to
>>>>> meet more often. Could you please set up a time on the Iceberg dev
>>>>> calendar?
>>>>>
>>>>> Thanks,
>>>>> Anurag
>>>>>
>>>>> On Fri, Jan 9, 2026 at 10:16 AM Anurag Mantripragada <
>>>>> [email protected]> wrote:
>>>>>
>>>>>> Thanks Anton and Kevin,
>>>>>>
>>>>>> I wrote a doc with general themes from the Spark PRs and Issues I
>>>>>> browsed in the repo. Please feel free to add more if I may have missed
>>>>>> anything.
>>>>>>
>>>>>> https://docs.google.com/document/d/19nno1RoPznbbxKOZZddZNHHafa7XULjbN6RPExdr2n4/edit?tab=t.0
>>>>>>
>>>>>> Looking forward to meeting you all and talking about all things Spark!
>>>>>>
>>>>>> Thanks,
>>>>>> Anurag
>>>>>>
>>>>>> On Fri, Jan 9, 2026 at 10:03 AM Kevin Liu <[email protected]>
>>>>>> wrote:
>>>>>>
>>>>>>> +1 great idea!
>>>>>>> Let's start a doc with potential discussion items and find a time on
>>>>>>> the calendar. I have permission to add events to the "iceberg dev 
>>>>>>> events"
>>>>>>> calendar. Happy to help with the logistics once the time and cadence is
>>>>>>> decided.
>>>>>>>
>>>>>>> Best,
>>>>>>> Kevin Liu
>>>>>>>
>>>>>>> On Wed, Jan 7, 2026 at 4:35 PM Anton Okolnychyi <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> YES! I have been meaning to suggest the same.
>>>>>>>>
>>>>>>>> Can you start a doc with the pool of items to which everyone can
>>>>>>>> contribute to?
>>>>>>>>
>>>>>>>> - Anton
>>>>>>>>
>>>>>>>> ср, 7 січ. 2026 р. о 15:30 Anurag Mantripragada <
>>>>>>>> [email protected]> пише:
>>>>>>>>
>>>>>>>>> Hi folks, happy new year!
>>>>>>>>>
>>>>>>>>> (Sorry if I sent this email more than once, my attempts of
>>>>>>>>> sending this from a different email failed)
>>>>>>>>>
>>>>>>>>> There are a few Spark changes the community is working on including
>>>>>>>>> - Sort order reporting [1], [2]
>>>>>>>>> - Spark 4.1 support [3]
>>>>>>>>> - Future of Datafusion-Comet support [4] [5]
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Community members interested in the Spark integration have been
>>>>>>>>> discussing it in smaller groups. However, we believe that the general
>>>>>>>>> community sync should include all updates, and discussing 
>>>>>>>>> Spark-specific
>>>>>>>>> matters may not be the most effective use of that sync. I was 
>>>>>>>>> wondering if
>>>>>>>>> it will be useful to  create a Spark-Iceberg integration-specific 
>>>>>>>>> sync on
>>>>>>>>> the calendar, similar to what we have for individual proposals. This 
>>>>>>>>> sync
>>>>>>>>> will not replace the community sync, which will still be used for 
>>>>>>>>> broader
>>>>>>>>> discussions including any new spark topics that come out of the spark 
>>>>>>>>> sync.
>>>>>>>>>
>>>>>>>>> If there’s interest in doing these spark breakout syncs, I’m happy
>>>>>>>>> to volunteer to run them. Please let me know what you all think.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Thanks,
>>>>>>>>> ~ Anurag
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> [1] - https://github.com/apache/iceberg/pull/14683
>>>>>>>>> [2] - https://github.com/apache/iceberg/pull/14948
>>>>>>>>> [3] - https://github.com/apache/iceberg/pull/14970
>>>>>>>>> [4] - https://github.com/apache/datafusion-comet/issues/2921
>>>>>>>>> [5] -
>>>>>>>>> https://lists.apache.org/thread/vr9nsbd5nhg3d20nmtyj4b3zsw9229gd
>>>>>>>>>
>>>>>>>>

Re: Iceberg / Spark syncs

Reply via email to