Re: [Discussion] Move `iceberg-parquet` and `iceberg-orc` modules into `iceberg-core`

Renjie Liu Thu, 02 Nov 2023 19:27:34 -0700

>
> Is there an alternative where we do an implementation similar to how
> Position Deletes and Data Files are currently written? Like we have the
> more generic "writers" in core but the actual implementations still live in
> iceberg-parquet or iceberg-orc?



+1. What I'm thinking is also extracting a common read/write interface
while leaving concrete implementation with format in corresponding module.

On Fri, Nov 3, 2023 at 9:28 AM Ajantha Bhat <ajanthab...@gmail.com> wrote:

> Is there an alternative where we do an implementation similar to how
>> Position Deletes and Data Files are currently written? Like we have the
>> more generic "writers" in core but the actual implementations still live in
>> iceberg-parquet or iceberg-orc?
>
>
> Hi Russell,
> Let me explore this path and get back to you.
> Thanks.
>
> On Thu, Nov 2, 2023 at 8:09 PM Russell Spitzer <russell.spit...@gmail.com>
> wrote:
>
>> Is there an alternative where we do an implementation similar to how
>> Position Deletes and Data Files are currently written? Like we have the
>> more generic "writers" in core but the actual implementations still live in
>> iceberg-parquet or iceberg-orc?
>>
>> On Nov 2, 2023, at 9:38 AM, Ajantha Bhat <ajanthab...@gmail.com> wrote:
>>
>> Hi Renjie,
>>
>> I have highlighted the use case from the above mail,
>>
>>
>>>
>>> *However, with the addition of partition statistics
>>> <https://github.com/apache/iceberg/blob/main/format/spec.md#partition-statistics-file>,
>>> Iceberg's metadata (stats file) will berepresented in Parquet or ORC
>>> formats.*
>>> To enable the `iceberg-core` module to write metadata in Parquet or ORC
>>> format, it will make extensive use of the functions found in the
>>> `iceberg-parquet`
>>> and `iceberg-orc` modules. *However, due to a circular dependency issue*,
>>> *`iceberg-core` cannot directly rely on `iceberg-parquet` and
>>> `iceberg-orc`.*
>>> Consequently, I suggest merging `iceberg-parquet` and `iceberg-orc` as
>>> packages within the `iceberg-core` module.
>>
>>
>> A utility for reading and writing partition statistics in Parquet format
>> is expected to take the form outlined here
>> <https://github.com/apache/iceberg/pull/8503/commits/2ba244540bf9fd574ece909f4cb178fdf12defa8>,
>> leveraging the `iceberg-parquet` dependency.
>>
>> To facilitate on-demand partition statistics computation, this utility
>> can find a home in either `iceberg-data` or a new module that relies on
>> both `iceberg-parquet` and `iceberg-orc`. This approach would enable all
>> engines to make use of it.
>>
>> However, for the synchronous calculation of statistics during insertion,
>> similar to how Trino supports Puffin stats, the `iceberg-core` module's
>> snapshot producer must have access to this utility. This presents a
>> challenge due to the existing circular dependency, as `iceberg-parquet` and
>> `iceberg-orc` already depend on `iceberg-core`.
>>
>> To resolve this circular dependency issue, my proposal is to integrate
>> them as separate packages within the `iceberg-core` module.
>> I believe it's best to include them in the appropriate place during the
>> initial addition itself to support both synchronous and asynchronous writes,
>> instead of adding to `iceberg-data` just for asynchronous writes and
>> later deprecating and moving them to core during synchronous write
>> implementation.
>>
>> Moving them to `iceberg-core` can also open up the possibility of writing
>> existing metadata (like manifests, manifests lists) in Parquet or ORC
>> instead of avro in future.
>>
>> Thanks,
>> Ajantha
>>
>> On Thu, Nov 2, 2023 at 5:07 PM Renjie Liu <liurenjie2...@gmail.com>
>> wrote:
>>
>>> Hi:
>>>
>>> Could you provide concrete cases to elaborate this change?
>>>
>>> On Thu, Nov 2, 2023 at 4:22 PM Gabor Kaszab <gaborkas...@apache.org>
>>> wrote:
>>>
>>>> Hey Ajantha,
>>>>
>>>> Wouldn't this require a major version bump considering this is a
>>>> breaking change for users depending on iceberg-parquet or iceberg-orc now?
>>>>
>>>> Gabor
>>>>
>>>> On Thu, Nov 2, 2023 at 3:01 AM Ajantha Bhat <ajanthab...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi Everyone,
>>>>>
>>>>> At present, Iceberg exclusively utilizes Avro, JSON, and Puffin
>>>>> formats to handle metadata. Few discussions in the past have explored the
>>>>> possibility
>>>>> of supporting these existing metadata in Parquet or ORC format.
>>>>> However, with the addition of partition statistics
>>>>> <https://github.com/apache/iceberg/blob/main/format/spec.md#partition-statistics-file>,
>>>>> Iceberg's metadata (stats file) will be
>>>>> represented in Parquet or ORC formats.
>>>>>
>>>>> To enable the `iceberg-core` module to write metadata in Parquet or
>>>>> ORC format, it will make extensive use of the functions found in the
>>>>> `iceberg-parquet`
>>>>> and `iceberg-orc` modules. However, due to a circular dependency
>>>>> issue, `iceberg-core` cannot directly rely on `iceberg-parquet` and
>>>>> `iceberg-orc`.
>>>>> Consequently, I suggest merging `iceberg-parquet` and `iceberg-orc` as
>>>>> packages within the `iceberg-core` module.
>>>>>
>>>>> For end users, the main change in the new release package will be the
>>>>> absence of separate `iceberg-parquet` and `iceberg-orc` JAR files. 
>>>>> Instead,
>>>>> they can
>>>>> depend on `iceberg-core` (which they were likely doing already). This
>>>>> change will also be clearly documented in the release notes.
>>>>>
>>>>> I would appreciate hearing your thoughts on this proposal.
>>>>>
>>>>> For a detailed look at the code changes required to implement the
>>>>> integration of `iceberg-parquet` into `iceberg-core`,
>>>>> please refer to the following PR:
>>>>> https://github.com/apache/iceberg/pull/8500
>>>>>
>>>>> Thanks,
>>>>> Ajantha
>>>>>
>>>>
>>

Re: [Discussion] Move `iceberg-parquet` and `iceberg-orc` modules into `iceberg-core`

Reply via email to