Hi:
> Third, the use case has expanded from stats files to other metadata. I would +1 for this reason to move the parquet module to the core module. But I think the right direction is that we still keep interfaces in the core module so that data file format is pluggable. Moving parquet to core make parquet a kind of metadata format, but this should not stop other data file formats from being supported. On Thu, Dec 19, 2024 at 5:21 AM rdb...@gmail.com <rdb...@gmail.com> wrote: > I was the person that originally suggested that we not move > iceberg-parquet into core, so it would probably help if I gave some context > for my rationale as I remember it and what's changed since then. > > I pushed back on the original suggestion to move Parquet classes into core > because it wasn't clear that we needed to move them and couldn't just > introduce an interface or abstraction that could be implemented by the > iceberg-parquet module or other formats, like ORC. The idea was to keep the > core module free of unnecessary dependencies. Most of the pieces of this > strategy exist already because all of the core metadata classes are > interfaces, like `DataFile` and `ManifestFile`, and there are writer > interfaces that can be used with them, like `FileAppender`. > > I think it's a good thing to reconsider this because a few things have > become more clear in the meantime. First, the exploration work to plug in > formats through an interface never materialized, so we should reconsider > rather than blocking that work. Second, I think the Iceberg ecosystem has > moved from being mostly format-agnostic because Parquet is the format that > is widely supported and optimized (Comet, for example, supports Parquet, > not to mention vendor offerings). Third, the use case has expanded from > stats files to other metadata. > > Now that we are looking at introducing more columnar metadata and have not > added an abstraction to support multiple formats in the last year or so, > I'm reconsidering the value of being format-agnostic. The original choice > to use Avro for metadata was to maximize compatibility and I think that > concern is stronger and stronger, given how widely Parquet is supported. If > we are going to keep the surface area small by specifying Parquet for > columnar metadata, then we don't need to do the extra work to add more > abstractions. > > And now after having written that, I think I'm now convinced. +1 for > merging iceberg-parquet into core. > > On Mon, Dec 9, 2024 at 3:21 AM Ajantha Bhat <ajanthab...@gmail.com> wrote: > >> Thanks Dan for the reply. >> >> This is also a good time to consider adding a native parquet read/write >>> path for use in core as the generic path in 'iceberg-data' isn't ideal. >>> Parquet metadata has been brought up in relation to improving stats >>> handling (allowing tracking of more column stats without impacting planning >>> performance), improving stats representations along with other possible >>> benefits like improved compression and scan performance. >> >> >> These two proposals are super interesting. I don't see any public >> discussion on the same. Could you please provide more details or point >> me to the discussions? +1 for moving the parquet module to core if it helps >> for these proposals. >> >> I feel like ORC is a separate discussion and while we may want to include >>> it, I wouldn't say there's a definitive answer unless we know there is >>> adequate investment in it. >> >> Having a module for ORC but not Parquet looks odd. I don't think the >> effort is huge for moving these files. >> I can take it up as well. >> >> I wasn't aware you had a PR as part of the prior discussion, but I'm >>> happy to revisit that if we decide this is a reasonable path forward. >> >> Sure. I can revive my PR. >> >> For those who are looking for previous discussion on the same topic last >> year. >> https://lists.apache.org/thread/8m6f3k7b425czktzf22q902vxgp2y10r >> >> - Ajantha >> >> >> >> On Sat, Dec 7, 2024 at 10:26 AM Daniel Weeks <dwe...@apache.org> wrote: >> >>> Hey Ajantha, >>> >>> I understand it was discussed before, but I think a lot of recent >>> discussions around improvements for parquet metadata/stats/etc is good >>> justification for revisiting the earlier discussion. >>> >>> Parquet metadata has been brought up in relation to improving stats >>> handling (allowing tracking of more column stats without impacting planning >>> performance), improving stats representations along with other possible >>> benefits like improved compression and scan performance. >>> >>> The original decision was more narrowly focused on the stats case and >>> there were viable (though possibly not ideal) workarounds to keep the >>> existing separation of subprojects, but at this point I see this more as a >>> barrier to exploring some of these ideas as it's quite difficult to allow >>> core to work directly with parquet. >>> >>> This is also a good time to consider adding a native parquet read/write >>> path for use in core as the generic path in 'iceberg-data' isn't ideal >>> (this might also be useful for projects like Kafka Connect). >>> >>> I feel like ORC is a separate discussion and while we may want to >>> include it, I wouldn't say there's a definitive answer unless we know there >>> is adequate investment in it. >>> >>> I wasn't aware you had a PR as part of the prior discussion, but I'm >>> happy to revisit that if we decide this is a reasonable path forward. >>> >>> -Dan >>> >>> >>> >>> On Fri, Dec 6, 2024 at 5:31 PM Ajantha Bhat <ajanthab...@gmail.com> >>> wrote: >>> >>>> Hi Dan, >>>> >>>> I proposed the same last year while working on partition stats. >>>> I can revive this PR if required, >>>> https://github.com/apache/iceberg/pull/8500 >>>> >>>> But we decided that `*iceberg-data`* can write these parquet stats >>>> files (metadata) and core can just register it. >>>> So, it is no longer needed for partition stats. >>>> >>>> a) Do we have any strong use case or feature that requires it now? >>>> b) I hope we do the same for ORC as well as it looks odd to have a >>>> module for that? >>>> >>>> - Ajantha >>>> >>>> On Sat, Dec 7, 2024 at 5:22 AM Daniel Weeks <dwe...@apache.org> wrote: >>>> >>>>> Everyone, >>>>> >>>>> I wanted to propose moving the parquet implementation from the >>>>> 'iceberg-parquet' project to the 'iceberg-core' project. >>>>> >>>>> The original motivation for keeping these subprojects separate was due >>>>> to Iceberg relying on avro (which is included in the core project) for >>>>> metadata and keeping other format implementations separate, but as we >>>>> consider adding support for partition stats and parquet metadata, we need >>>>> the ability to read and write parquet from core library. >>>>> >>>>> I've created a draft PR <https://github.com/apache/iceberg/pull/11716> >>>>> of the proposed changes, which relocates relatively cleanly, but wanted to >>>>> discuss whether there are concerns or other considerations for keeping >>>>> them >>>>> separate. >>>>> >>>>> -Dan >>>>> >>>>