[DISCUSS] Spec update to cover compressed JSON metadata files

Micah Kornfield Thu, 01 May 2025 10:09:48 -0700

Hi Ryan,
Thank you for the feedback.  I responded on the PR let's try to close out
the conversation there.


Thanks,
Micah

On Thursday, May 1, 2025, Ryan Blue <[email protected]> wrote:

> I just commented on the PR, but I don't think that file naming should be
> part of the spec requirements. This is informational and there are lots of
> ways to determine file compression, including magic bytes and catalog
> metadata. There isn't a need for this to be required by the spec so I think
> it should be an implementation note at the most.
>
> On Wed, Apr 30, 2025 at 11:20 PM Micah Kornfield <[email protected]>
> wrote:
>
>> I'll plan on starting a vote for the proposed PR tomorrow, unless there
>> are any objections.  I look forward to follow-ups on ways we can improve
>> compression here.
>>
>> Thanks,
>> Micah
>>
>> On Tue, Apr 29, 2025 at 10:38 AM Micah Kornfield <[email protected]>
>> wrote:
>>
>>>
>>> I wanted to clarify, as others have pointed out, that the PR documents
>>> existing functionality and making changes to it at this point risks
>>> breaking clients
>>>
>>> I think any changes to naming convention would have to be done as part
>>> of a new version of the spec (and file system based commits must be
>>> completely removed as of that version).
>>>
>>> I think ZSTD could be useful but that again is a strict improvement out
>>> of scope of this PR.
>>>
>>> Thanks,
>>> Micah
>>>
>>> On Monday, April 28, 2025, Ryan Blue <[email protected]> wrote:
>>>
>>>> It would be great to mention how to determine the compression of the
>>>> metadata JSON file in the spec. Thanks for bringing this up. It makes sense
>>>> to me to use the file name and get a bit more strict about this.
>>>>
>>>> That said, we will need to make sure that the current default behavior
>>>> is documented and required for anyone using the now-deprecated "hadoop
>>>> tables" that used atomic rename to coordinate. The atomic rename commits
>>>> only work when all clients are using the exact same path. It's a good thing
>>>> that this is deprecated so we can move forward with catalog-based uses.
>>>>
>>>> Ryan
>>>>
>>>> On Mon, Apr 28, 2025 at 9:47 AM Kevin Liu <[email protected]>
>>>> wrote:
>>>>
>>>>> Thanks for bringing this up Micah!
>>>>>
>>>>> I think it's better to treat `.json.gz` as the "default" file scheme
>>>>> and `.gz.json` as the "legacy".
>>>>>
>>>>> I agree with the other points brought up here. Across the broader
>>>>> ecosystem, I think `.json.gz` is used more often. DuckDB, for example, can
>>>>> automatically detect compression at the suffix, `.json.gz`, but not the
>>>>> other way around.
>>>>> See https://duckdb.org/docs/stable/data/json/loading_json#parameters
>>>>>
>>>>> Best,
>>>>> Kevin Liu
>>>>>
>>>>>
>>>>> On Sun, Apr 27, 2025 at 11:54 PM Fokko Driesprong <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hey Micah,
>>>>>>
>>>>>> For some reason, your email ended up in my spam box 😨
>>>>>>
>>>>>> There is a reason for everything!
>>>>>>
>>>>>> .gz.metadata.json is quite uncommon and can't be read by most
>>>>>>> existing tools. Would it be better to support .metadata.json.gz and 
>>>>>>> treat
>>>>>>> .gz.metadata.json as legacy for backward compatibility?
>>>>>>
>>>>>>
>>>>>> The Java client supports both
>>>>>> <https://github.com/apache/iceberg/blob/dc26b72ad016840b79d62bf8a84b7f2109e9b71b/core/src/test/java/org/apache/iceberg/TableMetadataParserCodecTest.java#L29-L40>.
>>>>>> I looked into this years ago, and if I recall correctly, it was to
>>>>>> bypass the decompressor of Hadoop
>>>>>> <https://github.com/apache/iceberg/pull/258/>. Hadoop would detect
>>>>>> the .gz and handle all the (de)compression, which we wanted to do
>>>>>> ourselves.
>>>>>>
>>>>>> gzip is becoming increasingly outdated due to its lack of support
>>>>>>> for modern CPUs. New algorithms like zstd are gaining popularity,
>>>>>>> so should we consider allowing users to use .metadata.json.zst as
>>>>>>> well?
>>>>>>
>>>>>>
>>>>>> Yes, I think that would make a lot of sense.
>>>>>>
>>>>>> Kind regards,
>>>>>> Fokko
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> Op ma 28 apr 2025 om 08:41 schreef Xuanwo <[email protected]>:
>>>>>>
>>>>>>> I've copied my comments from GitHub here for a broader discussion:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Hi, I have two concerns about this change:
>>>>>>>
>>>>>>>    - .gz.metadata.json is quite uncommon and can't be read by most
>>>>>>>    existing tools. Would it be better to support .metadata.json.gz and
>>>>>>>    treat .gz.metadata.json as legacy for backward compatibility?
>>>>>>>    - gzip is becoming increasingly outdated due to its lack of
>>>>>>>    support for modern CPUs. New algorithms like zstd are gaining
>>>>>>>    popularity, so should we consider allowing users to use
>>>>>>>    .metadata.json.zst as well?
>>>>>>>
>>>>>>>
>>>>>>> On Sun, Apr 27, 2025, at 07:36, Micah Kornfield wrote:
>>>>>>>
>>>>>>> I created https://github.com/apache/iceberg/pull/12598 to document
>>>>>>> this feature.  Kevin Liu already took a look, but I would like to get 
>>>>>>> more
>>>>>>> eyes on it before starting a vote for merging.
>>>>>>>
>>>>>>> Thanks,
>>>>>>> Micah
>>>>>>>
>>>>>>> Xuanwo
>>>>>>>
>>>>>>> https://xuanwo.io/
>>>>>>>
>>>>>>>

[DISCUSS] Spec update to cover compressed JSON metadata files

Reply via email to