Hi Ryan, Thank you for the feedback. I responded on the PR let's try to close out the conversation there.
Thanks, Micah On Thursday, May 1, 2025, Ryan Blue <rdb...@gmail.com> wrote: > I just commented on the PR, but I don't think that file naming should be > part of the spec requirements. This is informational and there are lots of > ways to determine file compression, including magic bytes and catalog > metadata. There isn't a need for this to be required by the spec so I think > it should be an implementation note at the most. > > On Wed, Apr 30, 2025 at 11:20 PM Micah Kornfield <emkornfi...@gmail.com> > wrote: > >> I'll plan on starting a vote for the proposed PR tomorrow, unless there >> are any objections. I look forward to follow-ups on ways we can improve >> compression here. >> >> Thanks, >> Micah >> >> On Tue, Apr 29, 2025 at 10:38 AM Micah Kornfield <emkornfi...@gmail.com> >> wrote: >> >>> >>> I wanted to clarify, as others have pointed out, that the PR documents >>> existing functionality and making changes to it at this point risks >>> breaking clients >>> >>> I think any changes to naming convention would have to be done as part >>> of a new version of the spec (and file system based commits must be >>> completely removed as of that version). >>> >>> I think ZSTD could be useful but that again is a strict improvement out >>> of scope of this PR. >>> >>> Thanks, >>> Micah >>> >>> On Monday, April 28, 2025, Ryan Blue <rdb...@gmail.com> wrote: >>> >>>> It would be great to mention how to determine the compression of the >>>> metadata JSON file in the spec. Thanks for bringing this up. It makes sense >>>> to me to use the file name and get a bit more strict about this. >>>> >>>> That said, we will need to make sure that the current default behavior >>>> is documented and required for anyone using the now-deprecated "hadoop >>>> tables" that used atomic rename to coordinate. The atomic rename commits >>>> only work when all clients are using the exact same path. It's a good thing >>>> that this is deprecated so we can move forward with catalog-based uses. >>>> >>>> Ryan >>>> >>>> On Mon, Apr 28, 2025 at 9:47 AM Kevin Liu <kevinjq...@apache.org> >>>> wrote: >>>> >>>>> Thanks for bringing this up Micah! >>>>> >>>>> I think it's better to treat `.json.gz` as the "default" file scheme >>>>> and `.gz.json` as the "legacy". >>>>> >>>>> I agree with the other points brought up here. Across the broader >>>>> ecosystem, I think `.json.gz` is used more often. DuckDB, for example, can >>>>> automatically detect compression at the suffix, `.json.gz`, but not the >>>>> other way around. >>>>> See https://duckdb.org/docs/stable/data/json/loading_json#parameters >>>>> >>>>> Best, >>>>> Kevin Liu >>>>> >>>>> >>>>> On Sun, Apr 27, 2025 at 11:54 PM Fokko Driesprong <fo...@apache.org> >>>>> wrote: >>>>> >>>>>> Hey Micah, >>>>>> >>>>>> For some reason, your email ended up in my spam box 😨 >>>>>> >>>>>> There is a reason for everything! >>>>>> >>>>>> .gz.metadata.json is quite uncommon and can't be read by most >>>>>>> existing tools. Would it be better to support .metadata.json.gz and >>>>>>> treat >>>>>>> .gz.metadata.json as legacy for backward compatibility? >>>>>> >>>>>> >>>>>> The Java client supports both >>>>>> <https://github.com/apache/iceberg/blob/dc26b72ad016840b79d62bf8a84b7f2109e9b71b/core/src/test/java/org/apache/iceberg/TableMetadataParserCodecTest.java#L29-L40>. >>>>>> I looked into this years ago, and if I recall correctly, it was to >>>>>> bypass the decompressor of Hadoop >>>>>> <https://github.com/apache/iceberg/pull/258/>. Hadoop would detect >>>>>> the .gz and handle all the (de)compression, which we wanted to do >>>>>> ourselves. >>>>>> >>>>>> gzip is becoming increasingly outdated due to its lack of support >>>>>>> for modern CPUs. New algorithms like zstd are gaining popularity, >>>>>>> so should we consider allowing users to use .metadata.json.zst as >>>>>>> well? >>>>>> >>>>>> >>>>>> Yes, I think that would make a lot of sense. >>>>>> >>>>>> Kind regards, >>>>>> Fokko >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> Op ma 28 apr 2025 om 08:41 schreef Xuanwo <xua...@apache.org>: >>>>>> >>>>>>> I've copied my comments from GitHub here for a broader discussion: >>>>>>> >>>>>>> >>>>>>> >>>>>>> Hi, I have two concerns about this change: >>>>>>> >>>>>>> - .gz.metadata.json is quite uncommon and can't be read by most >>>>>>> existing tools. Would it be better to support .metadata.json.gz and >>>>>>> treat .gz.metadata.json as legacy for backward compatibility? >>>>>>> - gzip is becoming increasingly outdated due to its lack of >>>>>>> support for modern CPUs. New algorithms like zstd are gaining >>>>>>> popularity, so should we consider allowing users to use >>>>>>> .metadata.json.zst as well? >>>>>>> >>>>>>> >>>>>>> On Sun, Apr 27, 2025, at 07:36, Micah Kornfield wrote: >>>>>>> >>>>>>> I created https://github.com/apache/iceberg/pull/12598 to document >>>>>>> this feature. Kevin Liu already took a look, but I would like to get >>>>>>> more >>>>>>> eyes on it before starting a vote for merging. >>>>>>> >>>>>>> Thanks, >>>>>>> Micah >>>>>>> >>>>>>> Xuanwo >>>>>>> >>>>>>> https://xuanwo.io/ >>>>>>> >>>>>>>