Re: [DISCUSS] Spec update to cover compressed JSON metadata files

Ryan Blue Thu, 01 May 2025 08:57:49 -0700

I just commented on the PR, but I don't think that file naming should be
part of the spec requirements. This is informational and there are lots of
ways to determine file compression, including magic bytes and catalog
metadata. There isn't a need for this to be required by the spec so I think
it should be an implementation note at the most.


On Wed, Apr 30, 2025 at 11:20 PM Micah Kornfield <emkornfi...@gmail.com>
wrote:

> I'll plan on starting a vote for the proposed PR tomorrow, unless there
> are any objections.  I look forward to follow-ups on ways we can improve
> compression here.
>
> Thanks,
> Micah
>
> On Tue, Apr 29, 2025 at 10:38 AM Micah Kornfield <emkornfi...@gmail.com>
> wrote:
>
>>
>> I wanted to clarify, as others have pointed out, that the PR documents
>> existing functionality and making changes to it at this point risks
>> breaking clients
>>
>> I think any changes to naming convention would have to be done as part of
>> a new version of the spec (and file system based commits must be completely
>> removed as of that version).
>>
>> I think ZSTD could be useful but that again is a strict improvement out
>> of scope of this PR.
>>
>> Thanks,
>> Micah
>>
>> On Monday, April 28, 2025, Ryan Blue <rdb...@gmail.com> wrote:
>>
>>> It would be great to mention how to determine the compression of the
>>> metadata JSON file in the spec. Thanks for bringing this up. It makes sense
>>> to me to use the file name and get a bit more strict about this.
>>>
>>> That said, we will need to make sure that the current default behavior
>>> is documented and required for anyone using the now-deprecated "hadoop
>>> tables" that used atomic rename to coordinate. The atomic rename commits
>>> only work when all clients are using the exact same path. It's a good thing
>>> that this is deprecated so we can move forward with catalog-based uses.
>>>
>>> Ryan
>>>
>>> On Mon, Apr 28, 2025 at 9:47 AM Kevin Liu <kevinjq...@apache.org> wrote:
>>>
>>>> Thanks for bringing this up Micah!
>>>>
>>>> I think it's better to treat `.json.gz` as the "default" file scheme
>>>> and `.gz.json` as the "legacy".
>>>>
>>>> I agree with the other points brought up here. Across the broader
>>>> ecosystem, I think `.json.gz` is used more often. DuckDB, for example, can
>>>> automatically detect compression at the suffix, `.json.gz`, but not the
>>>> other way around.
>>>> See https://duckdb.org/docs/stable/data/json/loading_json#parameters
>>>>
>>>> Best,
>>>> Kevin Liu
>>>>
>>>>
>>>> On Sun, Apr 27, 2025 at 11:54 PM Fokko Driesprong <fo...@apache.org>
>>>> wrote:
>>>>
>>>>> Hey Micah,
>>>>>
>>>>> For some reason, your email ended up in my spam box 😨
>>>>>
>>>>> There is a reason for everything!
>>>>>
>>>>> .gz.metadata.json is quite uncommon and can't be read by most existing
>>>>>> tools. Would it be better to support .metadata.json.gz and treat
>>>>>> .gz.metadata.json as legacy for backward compatibility?
>>>>>
>>>>>
>>>>> The Java client supports both
>>>>> <https://github.com/apache/iceberg/blob/dc26b72ad016840b79d62bf8a84b7f2109e9b71b/core/src/test/java/org/apache/iceberg/TableMetadataParserCodecTest.java#L29-L40>.
>>>>> I looked into this years ago, and if I recall correctly, it was to
>>>>> bypass the decompressor of Hadoop
>>>>> <https://github.com/apache/iceberg/pull/258/>. Hadoop would detect
>>>>> the .gz and handle all the (de)compression, which we wanted to do
>>>>> ourselves.
>>>>>
>>>>> gzip is becoming increasingly outdated due to its lack of support for
>>>>>> modern CPUs. New algorithms like zstd are gaining popularity, so
>>>>>> should we consider allowing users to use .metadata.json.zst as well?
>>>>>
>>>>>
>>>>> Yes, I think that would make a lot of sense.
>>>>>
>>>>> Kind regards,
>>>>> Fokko
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> Op ma 28 apr 2025 om 08:41 schreef Xuanwo <xua...@apache.org>:
>>>>>
>>>>>> I've copied my comments from GitHub here for a broader discussion:
>>>>>>
>>>>>>
>>>>>>
>>>>>> Hi, I have two concerns about this change:
>>>>>>
>>>>>>    - .gz.metadata.json is quite uncommon and can't be read by most
>>>>>>    existing tools. Would it be better to support .metadata.json.gz and
>>>>>>    treat .gz.metadata.json as legacy for backward compatibility?
>>>>>>    - gzip is becoming increasingly outdated due to its lack of
>>>>>>    support for modern CPUs. New algorithms like zstd are gaining
>>>>>>    popularity, so should we consider allowing users to use
>>>>>>    .metadata.json.zst as well?
>>>>>>
>>>>>>
>>>>>> On Sun, Apr 27, 2025, at 07:36, Micah Kornfield wrote:
>>>>>>
>>>>>> I created https://github.com/apache/iceberg/pull/12598 to document
>>>>>> this feature.  Kevin Liu already took a look, but I would like to get 
>>>>>> more
>>>>>> eyes on it before starting a vote for merging.
>>>>>>
>>>>>> Thanks,
>>>>>> Micah
>>>>>>
>>>>>> Xuanwo
>>>>>>
>>>>>> https://xuanwo.io/
>>>>>>
>>>>>>

Re: [DISCUSS] Spec update to cover compressed JSON metadata files

Reply via email to