> > Is there an alternative where we do an implementation similar to how > Position Deletes and Data Files are currently written? Like we have the > more generic "writers" in core but the actual implementations still live in > iceberg-parquet or iceberg-orc?
+1. What I'm thinking is also extracting a common read/write interface while leaving concrete implementation with format in corresponding module. On Fri, Nov 3, 2023 at 9:28 AM Ajantha Bhat <ajanthab...@gmail.com> wrote: > Is there an alternative where we do an implementation similar to how >> Position Deletes and Data Files are currently written? Like we have the >> more generic "writers" in core but the actual implementations still live in >> iceberg-parquet or iceberg-orc? > > > Hi Russell, > Let me explore this path and get back to you. > Thanks. > > On Thu, Nov 2, 2023 at 8:09 PM Russell Spitzer <russell.spit...@gmail.com> > wrote: > >> Is there an alternative where we do an implementation similar to how >> Position Deletes and Data Files are currently written? Like we have the >> more generic "writers" in core but the actual implementations still live in >> iceberg-parquet or iceberg-orc? >> >> On Nov 2, 2023, at 9:38 AM, Ajantha Bhat <ajanthab...@gmail.com> wrote: >> >> Hi Renjie, >> >> I have highlighted the use case from the above mail, >> >> >>> >>> *However, with the addition of partition statistics >>> <https://github.com/apache/iceberg/blob/main/format/spec.md#partition-statistics-file>, >>> Iceberg's metadata (stats file) will berepresented in Parquet or ORC >>> formats.* >>> To enable the `iceberg-core` module to write metadata in Parquet or ORC >>> format, it will make extensive use of the functions found in the >>> `iceberg-parquet` >>> and `iceberg-orc` modules. *However, due to a circular dependency issue*, >>> *`iceberg-core` cannot directly rely on `iceberg-parquet` and >>> `iceberg-orc`.* >>> Consequently, I suggest merging `iceberg-parquet` and `iceberg-orc` as >>> packages within the `iceberg-core` module. >> >> >> A utility for reading and writing partition statistics in Parquet format >> is expected to take the form outlined here >> <https://github.com/apache/iceberg/pull/8503/commits/2ba244540bf9fd574ece909f4cb178fdf12defa8>, >> leveraging the `iceberg-parquet` dependency. >> >> To facilitate on-demand partition statistics computation, this utility >> can find a home in either `iceberg-data` or a new module that relies on >> both `iceberg-parquet` and `iceberg-orc`. This approach would enable all >> engines to make use of it. >> >> However, for the synchronous calculation of statistics during insertion, >> similar to how Trino supports Puffin stats, the `iceberg-core` module's >> snapshot producer must have access to this utility. This presents a >> challenge due to the existing circular dependency, as `iceberg-parquet` and >> `iceberg-orc` already depend on `iceberg-core`. >> >> To resolve this circular dependency issue, my proposal is to integrate >> them as separate packages within the `iceberg-core` module. >> I believe it's best to include them in the appropriate place during the >> initial addition itself to support both synchronous and asynchronous writes, >> instead of adding to `iceberg-data` just for asynchronous writes and >> later deprecating and moving them to core during synchronous write >> implementation. >> >> Moving them to `iceberg-core` can also open up the possibility of writing >> existing metadata (like manifests, manifests lists) in Parquet or ORC >> instead of avro in future. >> >> Thanks, >> Ajantha >> >> On Thu, Nov 2, 2023 at 5:07 PM Renjie Liu <liurenjie2...@gmail.com> >> wrote: >> >>> Hi: >>> >>> Could you provide concrete cases to elaborate this change? >>> >>> On Thu, Nov 2, 2023 at 4:22 PM Gabor Kaszab <gaborkas...@apache.org> >>> wrote: >>> >>>> Hey Ajantha, >>>> >>>> Wouldn't this require a major version bump considering this is a >>>> breaking change for users depending on iceberg-parquet or iceberg-orc now? >>>> >>>> Gabor >>>> >>>> On Thu, Nov 2, 2023 at 3:01 AM Ajantha Bhat <ajanthab...@gmail.com> >>>> wrote: >>>> >>>>> Hi Everyone, >>>>> >>>>> At present, Iceberg exclusively utilizes Avro, JSON, and Puffin >>>>> formats to handle metadata. Few discussions in the past have explored the >>>>> possibility >>>>> of supporting these existing metadata in Parquet or ORC format. >>>>> However, with the addition of partition statistics >>>>> <https://github.com/apache/iceberg/blob/main/format/spec.md#partition-statistics-file>, >>>>> Iceberg's metadata (stats file) will be >>>>> represented in Parquet or ORC formats. >>>>> >>>>> To enable the `iceberg-core` module to write metadata in Parquet or >>>>> ORC format, it will make extensive use of the functions found in the >>>>> `iceberg-parquet` >>>>> and `iceberg-orc` modules. However, due to a circular dependency >>>>> issue, `iceberg-core` cannot directly rely on `iceberg-parquet` and >>>>> `iceberg-orc`. >>>>> Consequently, I suggest merging `iceberg-parquet` and `iceberg-orc` as >>>>> packages within the `iceberg-core` module. >>>>> >>>>> For end users, the main change in the new release package will be the >>>>> absence of separate `iceberg-parquet` and `iceberg-orc` JAR files. >>>>> Instead, >>>>> they can >>>>> depend on `iceberg-core` (which they were likely doing already). This >>>>> change will also be clearly documented in the release notes. >>>>> >>>>> I would appreciate hearing your thoughts on this proposal. >>>>> >>>>> For a detailed look at the code changes required to implement the >>>>> integration of `iceberg-parquet` into `iceberg-core`, >>>>> please refer to the following PR: >>>>> https://github.com/apache/iceberg/pull/8500 >>>>> >>>>> Thanks, >>>>> Ajantha >>>>> >>>> >>