Re: [DISCUSS] Relocate Parquet to Iceberg Core

Renjie Liu Wed, 18 Dec 2024 17:41:19 -0800

Hi:


> Third, the use case has expanded from stats files to other metadata.


I would +1 for this reason to move the parquet module to the core module.

But I think the right direction is that we still keep interfaces in the
core module so that data file format is pluggable. Moving parquet to core
make parquet a kind of metadata format, but this should not stop other data
file formats from being supported.

On Thu, Dec 19, 2024 at 5:21 AM [email protected] <[email protected]> wrote:

> I was the person that originally suggested that we not move
> iceberg-parquet into core, so it would probably help if I gave some context
> for my rationale as I remember it and what's changed since then.
>
> I pushed back on the original suggestion to move Parquet classes into core
> because it wasn't clear that we needed to move them and couldn't just
> introduce an interface or abstraction that could be implemented by the
> iceberg-parquet module or other formats, like ORC. The idea was to keep the
> core module free of unnecessary dependencies. Most of the pieces of this
> strategy exist already because all of the core metadata classes are
> interfaces, like `DataFile` and `ManifestFile`, and there are writer
> interfaces that can be used with them, like `FileAppender`.
>
> I think it's a good thing to reconsider this because a few things have
> become more clear in the meantime. First, the exploration work to plug in
> formats through an interface never materialized, so we should reconsider
> rather than blocking that work. Second, I think the Iceberg ecosystem has
> moved from being mostly format-agnostic because Parquet is the format that
> is widely supported and optimized (Comet, for example, supports Parquet,
> not to mention vendor offerings). Third, the use case has expanded from
> stats files to other metadata.
>
> Now that we are looking at introducing more columnar metadata and have not
> added an abstraction to support multiple formats in the last year or so,
> I'm reconsidering the value of being format-agnostic. The original choice
> to use Avro for metadata was to maximize compatibility and I think that
> concern is stronger and stronger, given how widely Parquet is supported. If
> we are going to keep the surface area small by specifying Parquet for
> columnar metadata, then we don't need to do the extra work to add more
> abstractions.
>
> And now after having written that, I think I'm now convinced. +1 for
> merging iceberg-parquet into core.
>
> On Mon, Dec 9, 2024 at 3:21 AM Ajantha Bhat <[email protected]> wrote:
>
>> Thanks Dan for the reply.
>>
>> This is also a good time to consider adding a native parquet read/write
>>> path for use in core as the generic path in 'iceberg-data' isn't ideal.
>>> Parquet metadata has been brought up in relation to improving stats
>>> handling (allowing tracking of more column stats without impacting planning
>>> performance), improving stats representations along with other possible
>>> benefits like improved compression and scan performance.
>>
>>
>> These two proposals are super interesting. I don't see any public
>> discussion on the same. Could you please provide more details or point
>> me to the discussions? +1 for moving the parquet module to core if it helps
>> for these proposals.
>>
>> I feel like ORC is a separate discussion and while we may want to include
>>> it, I wouldn't say there's a definitive answer unless we know there is
>>> adequate investment in it.
>>
>> Having a module for ORC but not Parquet looks odd. I don't think the
>> effort is huge for moving these files.
>> I can take it up as well.
>>
>> I wasn't aware you had a PR as part of the prior discussion, but I'm
>>> happy to revisit that if we decide this is a reasonable path forward.
>>
>> Sure. I can revive my PR.
>>
>> For those who are looking for previous discussion on the same topic last
>> year.
>> https://lists.apache.org/thread/8m6f3k7b425czktzf22q902vxgp2y10r
>>
>> - Ajantha
>>
>>
>>
>> On Sat, Dec 7, 2024 at 10:26 AM Daniel Weeks <[email protected]> wrote:
>>
>>> Hey Ajantha,
>>>
>>> I understand it was discussed before, but I think a lot of recent
>>> discussions around improvements for parquet metadata/stats/etc is good
>>> justification for revisiting the earlier discussion.
>>>
>>> Parquet metadata has been brought up in relation to improving stats
>>> handling (allowing tracking of more column stats without impacting planning
>>> performance), improving stats representations along with other possible
>>> benefits like improved compression and scan performance.
>>>
>>> The original decision was more narrowly focused on the stats case and
>>> there were viable (though possibly not ideal) workarounds to keep the
>>> existing separation of subprojects, but at this point I see this more as a
>>> barrier to exploring some of these ideas as it's quite difficult to allow
>>> core to work directly with parquet.
>>>
>>> This is also a good time to consider adding a native parquet read/write
>>> path for use in core as the generic path in 'iceberg-data' isn't ideal
>>> (this might also be useful for projects like Kafka Connect).
>>>
>>> I feel like ORC is a separate discussion and while we may want to
>>> include it, I wouldn't say there's a definitive answer unless we know there
>>> is adequate investment in it.
>>>
>>> I wasn't aware you had a PR as part of the prior discussion, but I'm
>>> happy to revisit that if we decide this is a reasonable path forward.
>>>
>>> -Dan
>>>
>>>
>>>
>>> On Fri, Dec 6, 2024 at 5:31 PM Ajantha Bhat <[email protected]>
>>> wrote:
>>>
>>>> Hi Dan,
>>>>
>>>> I proposed the same last year while working on partition stats.
>>>> I can revive this PR if required,
>>>> https://github.com/apache/iceberg/pull/8500
>>>>
>>>> But we decided that `*iceberg-data`* can write these parquet stats
>>>> files (metadata) and core can just register it.
>>>> So, it is no longer needed for partition stats.
>>>>
>>>> a) Do we have any strong use case or feature that requires it now?
>>>> b) I hope we do the same for ORC as well as it looks odd to have a
>>>> module for that?
>>>>
>>>> - Ajantha
>>>>
>>>> On Sat, Dec 7, 2024 at 5:22 AM Daniel Weeks <[email protected]> wrote:
>>>>
>>>>> Everyone,
>>>>>
>>>>> I wanted to propose moving the parquet implementation from the
>>>>> 'iceberg-parquet' project to the 'iceberg-core' project.
>>>>>
>>>>> The original motivation for keeping these subprojects separate was due
>>>>> to Iceberg relying on avro (which is included in the core project) for
>>>>> metadata and keeping other format implementations separate, but as we
>>>>> consider adding support for partition stats and parquet metadata, we need
>>>>> the ability to read and write parquet from core library.
>>>>>
>>>>> I've created a draft PR <https://github.com/apache/iceberg/pull/11716>
>>>>> of the proposed changes, which relocates relatively cleanly, but wanted to
>>>>> discuss whether there are concerns or other considerations for keeping 
>>>>> them
>>>>> separate.
>>>>>
>>>>> -Dan
>>>>>
>>>>

Re: [DISCUSS] Relocate Parquet to Iceberg Core

Reply via email to