Re: [DISCUSS] Relocate Parquet to Iceberg Core

rdb...@gmail.com Wed, 18 Dec 2024 13:21:14 -0800

I was the person that originally suggested that we not move iceberg-parquet
into core, so it would probably help if I gave some context for my
rationale as I remember it and what's changed since then.


I pushed back on the original suggestion to move Parquet classes into core
because it wasn't clear that we needed to move them and couldn't just
introduce an interface or abstraction that could be implemented by the
iceberg-parquet module or other formats, like ORC. The idea was to keep the
core module free of unnecessary dependencies. Most of the pieces of this
strategy exist already because all of the core metadata classes are
interfaces, like `DataFile` and `ManifestFile`, and there are writer
interfaces that can be used with them, like `FileAppender`.

I think it's a good thing to reconsider this because a few things have
become more clear in the meantime. First, the exploration work to plug in
formats through an interface never materialized, so we should reconsider
rather than blocking that work. Second, I think the Iceberg ecosystem has
moved from being mostly format-agnostic because Parquet is the format that
is widely supported and optimized (Comet, for example, supports Parquet,
not to mention vendor offerings). Third, the use case has expanded from
stats files to other metadata.

Now that we are looking at introducing more columnar metadata and have not
added an abstraction to support multiple formats in the last year or so,
I'm reconsidering the value of being format-agnostic. The original choice
to use Avro for metadata was to maximize compatibility and I think that
concern is stronger and stronger, given how widely Parquet is supported. If
we are going to keep the surface area small by specifying Parquet for
columnar metadata, then we don't need to do the extra work to add more
abstractions.

And now after having written that, I think I'm now convinced. +1 for
merging iceberg-parquet into core.

On Mon, Dec 9, 2024 at 3:21 AM Ajantha Bhat <ajanthab...@gmail.com> wrote:

> Thanks Dan for the reply.
>
> This is also a good time to consider adding a native parquet read/write
>> path for use in core as the generic path in 'iceberg-data' isn't ideal.
>> Parquet metadata has been brought up in relation to improving stats
>> handling (allowing tracking of more column stats without impacting planning
>> performance), improving stats representations along with other possible
>> benefits like improved compression and scan performance.
>
>
> These two proposals are super interesting. I don't see any public
> discussion on the same. Could you please provide more details or point
> me to the discussions? +1 for moving the parquet module to core if it helps
> for these proposals.
>
> I feel like ORC is a separate discussion and while we may want to include
>> it, I wouldn't say there's a definitive answer unless we know there is
>> adequate investment in it.
>
> Having a module for ORC but not Parquet looks odd. I don't think the
> effort is huge for moving these files.
> I can take it up as well.
>
> I wasn't aware you had a PR as part of the prior discussion, but I'm happy
>> to revisit that if we decide this is a reasonable path forward.
>
> Sure. I can revive my PR.
>
> For those who are looking for previous discussion on the same topic last
> year.
> https://lists.apache.org/thread/8m6f3k7b425czktzf22q902vxgp2y10r
>
> - Ajantha
>
>
>
> On Sat, Dec 7, 2024 at 10:26 AM Daniel Weeks <dwe...@apache.org> wrote:
>
>> Hey Ajantha,
>>
>> I understand it was discussed before, but I think a lot of recent
>> discussions around improvements for parquet metadata/stats/etc is good
>> justification for revisiting the earlier discussion.
>>
>> Parquet metadata has been brought up in relation to improving stats
>> handling (allowing tracking of more column stats without impacting planning
>> performance), improving stats representations along with other possible
>> benefits like improved compression and scan performance.
>>
>> The original decision was more narrowly focused on the stats case and
>> there were viable (though possibly not ideal) workarounds to keep the
>> existing separation of subprojects, but at this point I see this more as a
>> barrier to exploring some of these ideas as it's quite difficult to allow
>> core to work directly with parquet.
>>
>> This is also a good time to consider adding a native parquet read/write
>> path for use in core as the generic path in 'iceberg-data' isn't ideal
>> (this might also be useful for projects like Kafka Connect).
>>
>> I feel like ORC is a separate discussion and while we may want to include
>> it, I wouldn't say there's a definitive answer unless we know there is
>> adequate investment in it.
>>
>> I wasn't aware you had a PR as part of the prior discussion, but I'm
>> happy to revisit that if we decide this is a reasonable path forward.
>>
>> -Dan
>>
>>
>>
>> On Fri, Dec 6, 2024 at 5:31 PM Ajantha Bhat <ajanthab...@gmail.com>
>> wrote:
>>
>>> Hi Dan,
>>>
>>> I proposed the same last year while working on partition stats.
>>> I can revive this PR if required,
>>> https://github.com/apache/iceberg/pull/8500
>>>
>>> But we decided that `*iceberg-data`* can write these parquet stats
>>> files (metadata) and core can just register it.
>>> So, it is no longer needed for partition stats.
>>>
>>> a) Do we have any strong use case or feature that requires it now?
>>> b) I hope we do the same for ORC as well as it looks odd to have a
>>> module for that?
>>>
>>> - Ajantha
>>>
>>> On Sat, Dec 7, 2024 at 5:22 AM Daniel Weeks <dwe...@apache.org> wrote:
>>>
>>>> Everyone,
>>>>
>>>> I wanted to propose moving the parquet implementation from the
>>>> 'iceberg-parquet' project to the 'iceberg-core' project.
>>>>
>>>> The original motivation for keeping these subprojects separate was due
>>>> to Iceberg relying on avro (which is included in the core project) for
>>>> metadata and keeping other format implementations separate, but as we
>>>> consider adding support for partition stats and parquet metadata, we need
>>>> the ability to read and write parquet from core library.
>>>>
>>>> I've created a draft PR <https://github.com/apache/iceberg/pull/11716>
>>>> of the proposed changes, which relocates relatively cleanly, but wanted to
>>>> discuss whether there are concerns or other considerations for keeping them
>>>> separate.
>>>>
>>>> -Dan
>>>>
>>>

Re: [DISCUSS] Relocate Parquet to Iceberg Core

Reply via email to