I'll plan on starting a vote for the proposed PR tomorrow, unless there are any objections. I look forward to follow-ups on ways we can improve compression here.
Thanks, Micah On Tue, Apr 29, 2025 at 10:38 AM Micah Kornfield <emkornfi...@gmail.com> wrote: > > I wanted to clarify, as others have pointed out, that the PR documents > existing functionality and making changes to it at this point risks > breaking clients > > I think any changes to naming convention would have to be done as part of > a new version of the spec (and file system based commits must be completely > removed as of that version). > > I think ZSTD could be useful but that again is a strict improvement out of > scope of this PR. > > Thanks, > Micah > > On Monday, April 28, 2025, Ryan Blue <rdb...@gmail.com> wrote: > >> It would be great to mention how to determine the compression of the >> metadata JSON file in the spec. Thanks for bringing this up. It makes sense >> to me to use the file name and get a bit more strict about this. >> >> That said, we will need to make sure that the current default behavior is >> documented and required for anyone using the now-deprecated "hadoop tables" >> that used atomic rename to coordinate. The atomic rename commits only work >> when all clients are using the exact same path. It's a good thing that this >> is deprecated so we can move forward with catalog-based uses. >> >> Ryan >> >> On Mon, Apr 28, 2025 at 9:47 AM Kevin Liu <kevinjq...@apache.org> wrote: >> >>> Thanks for bringing this up Micah! >>> >>> I think it's better to treat `.json.gz` as the "default" file scheme and >>> `.gz.json` as the "legacy". >>> >>> I agree with the other points brought up here. Across the broader >>> ecosystem, I think `.json.gz` is used more often. DuckDB, for example, can >>> automatically detect compression at the suffix, `.json.gz`, but not the >>> other way around. >>> See https://duckdb.org/docs/stable/data/json/loading_json#parameters >>> >>> Best, >>> Kevin Liu >>> >>> >>> On Sun, Apr 27, 2025 at 11:54 PM Fokko Driesprong <fo...@apache.org> >>> wrote: >>> >>>> Hey Micah, >>>> >>>> For some reason, your email ended up in my spam box 😨 >>>> >>>> There is a reason for everything! >>>> >>>> .gz.metadata.json is quite uncommon and can't be read by most existing >>>>> tools. Would it be better to support .metadata.json.gz and treat >>>>> .gz.metadata.json as legacy for backward compatibility? >>>> >>>> >>>> The Java client supports both >>>> <https://github.com/apache/iceberg/blob/dc26b72ad016840b79d62bf8a84b7f2109e9b71b/core/src/test/java/org/apache/iceberg/TableMetadataParserCodecTest.java#L29-L40>. >>>> I looked into this years ago, and if I recall correctly, it was to >>>> bypass the decompressor of Hadoop >>>> <https://github.com/apache/iceberg/pull/258/>. Hadoop would detect the >>>> .gz and handle all the (de)compression, which we wanted to do >>>> ourselves. >>>> >>>> gzip is becoming increasingly outdated due to its lack of support for >>>>> modern CPUs. New algorithms like zstd are gaining popularity, so >>>>> should we consider allowing users to use .metadata.json.zst as well? >>>> >>>> >>>> Yes, I think that would make a lot of sense. >>>> >>>> Kind regards, >>>> Fokko >>>> >>>> >>>> >>>> >>>> Op ma 28 apr 2025 om 08:41 schreef Xuanwo <xua...@apache.org>: >>>> >>>>> I've copied my comments from GitHub here for a broader discussion: >>>>> >>>>> >>>>> >>>>> Hi, I have two concerns about this change: >>>>> >>>>> - .gz.metadata.json is quite uncommon and can't be read by most >>>>> existing tools. Would it be better to support .metadata.json.gz and >>>>> treat .gz.metadata.json as legacy for backward compatibility? >>>>> - gzip is becoming increasingly outdated due to its lack of >>>>> support for modern CPUs. New algorithms like zstd are gaining >>>>> popularity, so should we consider allowing users to use >>>>> .metadata.json.zst as well? >>>>> >>>>> >>>>> On Sun, Apr 27, 2025, at 07:36, Micah Kornfield wrote: >>>>> >>>>> I created https://github.com/apache/iceberg/pull/12598 to document >>>>> this feature. Kevin Liu already took a look, but I would like to get more >>>>> eyes on it before starting a vote for merging. >>>>> >>>>> Thanks, >>>>> Micah >>>>> >>>>> Xuanwo >>>>> >>>>> https://xuanwo.io/ >>>>> >>>>>