[Gen-art] Re: draft-ietf-cellar-tags-19 ietf last call Genart review

Steve Lhomme Sun, 19 Oct 2025 03:15:03 -0700

Hello Ines,

Thanks a lot for your review. Replies on your comments can be found inline 
below.

> On 13 Oct 2025, at 17:07, Ines Robles via Datatracker <[email protected]> 
> wrote:
> 
> Document: draft-ietf-cellar-tags
> Title: Matroska Media Container Tag Specifications
> Reviewer: Ines Robles
> Review result: Almost Ready
> 
> I am the assigned Gen-ART reviewer for this draft. The General Area
> Review Team (Gen-ART) reviews all IETF documents being processed
> by the IESG for the IETF Chair.  Please treat these comments just
> like any other last call comments.
> 
> For more information, please see the FAQ at
> 
> <https://wiki.ietf.org/en/group/gen/GenArtFAQ>.
> 
> Document: draft-ietf-cellar-tags-19
> Reviewer: Ines Robles
> Review Date: 2025-10-13
> IETF LC End Date: 2025-10-13
> IESG Telechat date: Not scheduled for a telechat
> 
> Summary:
> 
> This document defines the Matroska multimedia container tags, namely the tag
> names and their respective semantic meaning.
> 
> I have a few comments and questions below that I would appreciate being
> addressed before publication.
> 
> Comments:
> 
> 1- Section 3.2.2, states "Multiple items MUST NOT be stored as a list in a
> single TagString. If there is more than one tag value with the same name to be
> stored, then more than one SimpleTag MUST be used."
> 
> However, several tag definitions (for example, INSTRUMENTS in Section 4.4 and
> KEYWORDS in Section 4.6) explicitly describe values as being “separated by a
> comma.” This wording suggests that multiple items may appear within a single
> TagString, which seems to contradict the rule in Section 3.2.2.
> 
> Could you please clarify whether these tags are intended to be exceptions to
> that rule, or if the text should instead indicate that each value must be
> stored in a separate SimpleTag?

The general idea is that the whole Tag element can be seen as a database. So if 
you’re looking for a particular artist you should just look for ARTIST=someone, 
even the content includes many artists. This is easier that looking for a 
particular string which may or may not have the comma escaped (and we have to 
define how to escape it).

The fact the splitting is mandatory is very new 
https://github.com/ietf-wg-cellar/matroska-specification/pull/1021
And as said in the next paragraph, it’s possible some old tags may have been 
combined, except they did not have proper formatting rules. And the next 
paragraph also mentions that some values may be combined from an other source. 
So although it’s a MUST, it’s a rule that can't be 100% enforced. Maybe we 
should turn it into a recommended ?
It used to be RECOMMENDED but was changed after Orie’s suggestion. See 
https://mailarchive.ietf.org/arch/msg/cellar/slnYqYQ2sy1w48lA3ENyuvnoYWA/
Maybe we went too far here.

INSTRUMENTS and KEYWORDS have always had the separator properly defined so they 
should be allowed, although it’s not ideal for parsing if the comma is supposed 
to be in the name of an instrument or a keyword (very unlikely).

I made a proposal to go back to Recommended, even for instruments and keywords: 
https://github.com/ietf-wg-cellar/matroska-specification/pull/1030

> 2- Section 3.3: In Table 2 (“TargetTypeValue for Video”), the draft lists 
> MOVIE
> / EPISODE / CONCERT and describes them as “the most common grouping level of
> video (e.g., an episode for a TV series).” This correctly indicates that movie
> is intended as a representative example.
> 
> However, in the document, several tag descriptions (e.g., DIRECTOR, ACTOR,
> LAW_RATING, etc.) refer specifically to “a movie.”
> 
> For precision and inclusivity, these occurrences should be generalized, since
> the tagging system applies to any audiovisual work; including films, 
> television
> episodes, animated content, image-based sequences, podcasts, concerts, or 
> other
> recorded video content.
> 
> It is therefore suggested to replace movie with a broader term such as video
> work, video content, or audiovisual work, as appropriate to the context.
> 
> What do you think?

The term “MOVIE” is value that has been defined as such for a very long time.
It is defined in RFC9559 section 5.1.8.1.1.2 
(https://www.rfc-editor.org/rfc/rfc9559#section-5.1.8.1.1.2).
So changing it is not an option now. It is possible to add new values via a 
document update. However the TargetType element is just marked as 
informational. It’s mostly there to give more sense to the TargetTypeValue 
element which is mandatory.

Although the list of values is restricted in RFC9559 there is no IANA registry 
defined to add more values.

> 3- Section 3.3, states: “Tags from a TargetTypeValue apply to the all lower
> TargetTypeValues.”
> 
> It is not always clear whether “lower” refers to numerically smaller values or
> to semantically subordinate entities. It is implicit that smaller numbers
> indicate lower levels in the hierarchy; however, the current wording could
> confuse newcomers.
> 
> What about to add a clarification such as:
> 
> “A tag defined for a given TargetTypeValue applies to all Targets with
> numerically smaller TargetTypeValues in the same hierarchy, that is, from
> higher-level groups to lower-level entities.”
> 
> What do you think?

Yes, it’s clearer. I proposed your changes here (with a slight modification): 
https://github.com/ietf-wg-cellar/matroska-specification/pull/1031

> 4- Section 3.3 defines TargetTypeValue and provides two tables: Table 1 for
> audio and Table 2 for video. Both tables list the same numeric values (e.g.,
> 50, 40, 30, etc.) but associate them with different semantic examples. For
> instance, Table 1 maps 50 to Album, while Table 2 maps 50 to Movie / Episode /
> Concert.
> 
> It would be helpful to clarify whether these tables represent one shared
> TargetTypeValue numbering system that applies to all media types (where the
> numbers define structural hierarchy levels, and the examples simply illustrate
> common use cases for each media type), or two independent numbering systems
> (one for audio and one for video) that happen to reuse the same numeric values
> for different purposes.
> 
> For example, how should this be interpreted in a Matroska file that contains
> both audio and video streams, such as a concert film?

As said above, the TargetType is an informational string for human readability. 
If you strip those, the TargetTypeValue remains (and is mandatory).
In that case you can’t tell whether the described content is a concert or a 
movie or an album. If a TagTrackUUID 
(https://www.rfc-editor.org/rfc/rfc9559#name-tagtrackuid-element) 
is used then you can tell if this is audio or video content you’re describing 
but that’s the only case.

So from your question, the tables do represent one shared TargetTypeValue 
numbering system that applies to all media types.

> 5- Section 3.3.1: The current description of PART_OFFSET (“... which is the
> number of tracks on the first CD”) correctly implies that it represents a
> cumulative or absolute offset, i.e., the number of lower-level items that
> precede the current group in the overall collection. To avoid potential
> misinterpretation as a relative (per-disc) offset, it might be clearer to
> rephrase to something like:
> 
> “PART_OFFSET, at TargetTypeValue 30 (TRACK), represents the number of
> lower-level items that precede the current group in the overall collection. 
> For
> example, if CD 1 contains 5 tracks, then the first track of CD 2 has
> PART_OFFSET = 5.”
> 
> What do you think?

I don’t really understand how it can be misinterpreted. Also this new/different 
example is less clear IMO because it defines the PART_OFFSET for the first 
track of the second CD. But the PART_OFFSET is always the same value for all 
tracks of the second CD. That’s why in the current example the third track of 
the second CD is described, to really show that the PART_OFFSET only depends on 
the number of tracks on the first CD.

> 6- Section 4.10: It appears to be an inconsistent treatment of numeric tags
> with respect to their encoding type.
> 
> For example: The EBU_R128_* tags (e.g., EBU_R128_LOUDNESS) are defined as
> binary and store floating-point values in <TagBinary>. The REPLAYGAIN_* tags
> (e.g., REPLAYGAIN_GAIN, REPLAYGAIN_PEAK) represent similar floating-point
> values but are defined as UTF-8 strings in <TagString>. This means that two
> groups of tags describing essentially the same kind of data (gain/loudness
> values in dB or LUFS) are stored using different data types.

This document defines already existing tags. The REPLAYGAIN ones have been 
there for a long time
and have been defined as strings which can sometimes have a unit. 
The EBU_R128 tags are new and we realized parsing what is just a floating 
number (which is strict unit) would be easier and safer handled a floating 
point in a binary blob.

So it’s just for historical reasons. The new one being better and easier to 
handle.

> 6.1- Could you please clarify whether this distinction is intentional (for
> example, due to backward compatibility) or whether a consistent approach is
> intended?

Yes, totally intentional.

> 6.2- It might be helpful to include a short explanatory note in Section 4.10
> such as "..ReplayGain tags retain textual representation for compatibility 
> with
> legacy implementations, whereas EBU R128 tags use binary floats for higher
> precision..."?

OK. I proposed some explanation and recommendation in 
https://github.com/ietf-wg-cellar/matroska-specification/pull/1032

> 6.3- Additionally, it may be useful to provide brief guidance for future tag
> definitions on when to prefer binary versus textual representation for numeric
> values. For example, recommending binary floats for precision-critical
> engineering data, and UTF-8 strings for human-readable or legacy-compatible
> values. This would help ensure consistent design choices in future extensions.

We may not want to go too deep in explanation on why binary is better. We can’t 
be sure
this is always the case. I think saying the binary format is “stricter” and 
“RECOMMENDED” should do it.

> 7- Section 5, states: "Most of the time strings are kept as-is and don't pose 
> a
> security issue, apart from invalid UTF-8 values."
> 
> While the mention of “invalid UTF-8 values” is helpful, this phrasing might
> still understate the potential risk. Implementations that handle TagStrings
> without proper UTF-8 validation or size checks could encounter parsing errors,
> crashes, or buffer overruns if presented with malformed or excessively large
> input data. It may be useful to add a clarifying sentence such as:
> 
> "Implementations MUST validate TagString inputs for UTF-8 correctness and
> reasonable length before use, in accordance with the security considerations 
> in
> [RFC 3629]"
> 
> What do you think?

Section 5 also mentions it inherits Security Considerations from RFC 8794 and 
RFC 9559.
UTF-8 is somehow covered by EBML (RFC 8794) 
https://www.rfc-editor.org/rfc/rfc8794.html#name-security-considerations

It does mention out of bounds and buffer overflow issues. But additional 
concern and sources could be used.
I proposed your addition in 
https://github.com/ietf-wg-cellar/matroska-specification/pull/1033

> 8- The draft describes how multiple SimpleTag elements may appear under the
> same Tag element, allowing multiple values for the same tag name.
> 
> However, how should applications interpret or prioritize these values if
> conflicting tags occur. For example, two TITLE tags with different TagString
> values within the same Targets element?

It’s implied by the 3.2.2 section on splitting. Multiple values are split with 
the same tag name (eg TITLE) which a single value in each.
Maybe it can be more explicitly said that in that case the values “add up” to 
the list of values. In other words they don’t conflict. They all apply.

> Nits:
> 
> 9- choregrapher → choreographer

OK

> 10- the values is stored → the value is stored

OK

> 11- parts that are inside or outside a given file → ambiguous. Consider
> clarifying to something like: “parts located either within or externally
> referenced by a given file” ?

I used the within word and updated the following example to clarify what we 
mean by “outside”:

The `TargetTypeValue` element allows tagging of different parts that are within 
or outside a
given file. For example, in an audio file with one song you could have 
information about
the album it comes from the CD set even if the whole CD set is not found in the 
file. 

> 12- Due to the various nature of tag sources → Due to the varied nature of tag
> sources

OK

> 13- each demand needs to balance if it makes sense… → each request needs to be
> evaluated to determine if it makes sense…

OK

> 14- an host app → a host app

OK

> 15- A Tag element has a single Targets element with a single TargetTypeValue
> element. But the Targets element… → replace “But..” with “However,...”

OK

> 16- It is RECOMMENDED to start a tag name… → It is RECOMMENDED that tag names
> start…

OK

> 17- for non official tags than are not meant to make it to the list… → for
> non-official tags that are not meant to be added to the list of official 
> tags...

OK

> 18- apply to the all lower TargetTypeValues → “…apply to all lower
> TargetTypeValues..”

OK.

All nit changes have been proposed in 
https://github.com/ietf-wg-cellar/matroska-specification/pull/1034

> Thanks for this document,

Thanks for the review!
Steve

> Ines.
> 
> 
>

_______________________________________________
Gen-art mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[Gen-art] Re: draft-ietf-cellar-tags-19 ietf last call Genart review

Reply via email to