I'll plan on starting a vote for the proposed PR tomorrow, unless there are
any objections.  I look forward to follow-ups on ways we can improve
compression here.

Thanks,
Micah

On Tue, Apr 29, 2025 at 10:38 AM Micah Kornfield <emkornfi...@gmail.com>
wrote:

>
> I wanted to clarify, as others have pointed out, that the PR documents
> existing functionality and making changes to it at this point risks
> breaking clients
>
> I think any changes to naming convention would have to be done as part of
> a new version of the spec (and file system based commits must be completely
> removed as of that version).
>
> I think ZSTD could be useful but that again is a strict improvement out of
> scope of this PR.
>
> Thanks,
> Micah
>
> On Monday, April 28, 2025, Ryan Blue <rdb...@gmail.com> wrote:
>
>> It would be great to mention how to determine the compression of the
>> metadata JSON file in the spec. Thanks for bringing this up. It makes sense
>> to me to use the file name and get a bit more strict about this.
>>
>> That said, we will need to make sure that the current default behavior is
>> documented and required for anyone using the now-deprecated "hadoop tables"
>> that used atomic rename to coordinate. The atomic rename commits only work
>> when all clients are using the exact same path. It's a good thing that this
>> is deprecated so we can move forward with catalog-based uses.
>>
>> Ryan
>>
>> On Mon, Apr 28, 2025 at 9:47 AM Kevin Liu <kevinjq...@apache.org> wrote:
>>
>>> Thanks for bringing this up Micah!
>>>
>>> I think it's better to treat `.json.gz` as the "default" file scheme and
>>> `.gz.json` as the "legacy".
>>>
>>> I agree with the other points brought up here. Across the broader
>>> ecosystem, I think `.json.gz` is used more often. DuckDB, for example, can
>>> automatically detect compression at the suffix, `.json.gz`, but not the
>>> other way around.
>>> See https://duckdb.org/docs/stable/data/json/loading_json#parameters
>>>
>>> Best,
>>> Kevin Liu
>>>
>>>
>>> On Sun, Apr 27, 2025 at 11:54 PM Fokko Driesprong <fo...@apache.org>
>>> wrote:
>>>
>>>> Hey Micah,
>>>>
>>>> For some reason, your email ended up in my spam box 😨
>>>>
>>>> There is a reason for everything!
>>>>
>>>> .gz.metadata.json is quite uncommon and can't be read by most existing
>>>>> tools. Would it be better to support .metadata.json.gz and treat
>>>>> .gz.metadata.json as legacy for backward compatibility?
>>>>
>>>>
>>>> The Java client supports both
>>>> <https://github.com/apache/iceberg/blob/dc26b72ad016840b79d62bf8a84b7f2109e9b71b/core/src/test/java/org/apache/iceberg/TableMetadataParserCodecTest.java#L29-L40>.
>>>> I looked into this years ago, and if I recall correctly, it was to
>>>> bypass the decompressor of Hadoop
>>>> <https://github.com/apache/iceberg/pull/258/>. Hadoop would detect the
>>>> .gz and handle all the (de)compression, which we wanted to do
>>>> ourselves.
>>>>
>>>> gzip is becoming increasingly outdated due to its lack of support for
>>>>> modern CPUs. New algorithms like zstd are gaining popularity, so
>>>>> should we consider allowing users to use .metadata.json.zst as well?
>>>>
>>>>
>>>> Yes, I think that would make a lot of sense.
>>>>
>>>> Kind regards,
>>>> Fokko
>>>>
>>>>
>>>>
>>>>
>>>> Op ma 28 apr 2025 om 08:41 schreef Xuanwo <xua...@apache.org>:
>>>>
>>>>> I've copied my comments from GitHub here for a broader discussion:
>>>>>
>>>>>
>>>>>
>>>>> Hi, I have two concerns about this change:
>>>>>
>>>>>    - .gz.metadata.json is quite uncommon and can't be read by most
>>>>>    existing tools. Would it be better to support .metadata.json.gz and
>>>>>    treat .gz.metadata.json as legacy for backward compatibility?
>>>>>    - gzip is becoming increasingly outdated due to its lack of
>>>>>    support for modern CPUs. New algorithms like zstd are gaining
>>>>>    popularity, so should we consider allowing users to use
>>>>>    .metadata.json.zst as well?
>>>>>
>>>>>
>>>>> On Sun, Apr 27, 2025, at 07:36, Micah Kornfield wrote:
>>>>>
>>>>> I created https://github.com/apache/iceberg/pull/12598 to document
>>>>> this feature.  Kevin Liu already took a look, but I would like to get more
>>>>> eyes on it before starting a vote for merging.
>>>>>
>>>>> Thanks,
>>>>> Micah
>>>>>
>>>>> Xuanwo
>>>>>
>>>>> https://xuanwo.io/
>>>>>
>>>>>

Reply via email to