Document: draft-ietf-cellar-tags
Title: Matroska Media Container Tag Specifications
Reviewer: Ines Robles
Review result: Almost Ready

I am the assigned Gen-ART reviewer for this draft. The General Area
Review Team (Gen-ART) reviews all IETF documents being processed
by the IESG for the IETF Chair.  Please treat these comments just
like any other last call comments.

For more information, please see the FAQ at

<https://wiki.ietf.org/en/group/gen/GenArtFAQ>.

Document: draft-ietf-cellar-tags-19
Reviewer: Ines Robles
Review Date: 2025-10-13
IETF LC End Date: 2025-10-13
IESG Telechat date: Not scheduled for a telechat

Summary:

This document defines the Matroska multimedia container tags, namely the tag
names and their respective semantic meaning.

I have a few comments and questions below that I would appreciate being
addressed before publication.

Comments:

1- Section 3.2.2, states "Multiple items MUST NOT be stored as a list in a
single TagString. If there is more than one tag value with the same name to be
stored, then more than one SimpleTag MUST be used."

However, several tag definitions (for example, INSTRUMENTS in Section 4.4 and
KEYWORDS in Section 4.6) explicitly describe values as being “separated by a
comma.” This wording suggests that multiple items may appear within a single
TagString, which seems to contradict the rule in Section 3.2.2.

Could you please clarify whether these tags are intended to be exceptions to
that rule, or if the text should instead indicate that each value must be
stored in a separate SimpleTag?

2- Section 3.3: In Table 2 (“TargetTypeValue for Video”), the draft lists MOVIE
/ EPISODE / CONCERT and describes them as “the most common grouping level of
video (e.g., an episode for a TV series).” This correctly indicates that movie
is intended as a representative example.

However, in the document, several tag descriptions (e.g., DIRECTOR, ACTOR,
LAW_RATING, etc.) refer specifically to “a movie.”

For precision and inclusivity, these occurrences should be generalized, since
the tagging system applies to any audiovisual work; including films, television
episodes, animated content, image-based sequences, podcasts, concerts, or other
recorded video content.

It is therefore suggested to replace movie with a broader term such as video
work, video content, or audiovisual work, as appropriate to the context.

What do you think?

3- Section 3.3, states: “Tags from a TargetTypeValue apply to the all lower
TargetTypeValues.”

It is not always clear whether “lower” refers to numerically smaller values or
to semantically subordinate entities. It is implicit that smaller numbers
indicate lower levels in the hierarchy; however, the current wording could
confuse newcomers.

What about to add a clarification such as:

“A tag defined for a given TargetTypeValue applies to all Targets with
numerically smaller TargetTypeValues in the same hierarchy, that is, from
higher-level groups to lower-level entities.”

What do you think?

4- Section 3.3 defines TargetTypeValue and provides two tables: Table 1 for
audio and Table 2 for video. Both tables list the same numeric values (e.g.,
50, 40, 30, etc.) but associate them with different semantic examples. For
instance, Table 1 maps 50 to Album, while Table 2 maps 50 to Movie / Episode /
Concert.

It would be helpful to clarify whether these tables represent one shared
TargetTypeValue numbering system that applies to all media types (where the
numbers define structural hierarchy levels, and the examples simply illustrate
common use cases for each media type), or two independent numbering systems
(one for audio and one for video) that happen to reuse the same numeric values
for different purposes.

For example, how should this be interpreted in a Matroska file that contains
both audio and video streams, such as a concert film?

5- Section 3.3.1: The current description of PART_OFFSET (“... which is the
number of tracks on the first CD”) correctly implies that it represents a
cumulative or absolute offset, i.e., the number of lower-level items that
precede the current group in the overall collection. To avoid potential
misinterpretation as a relative (per-disc) offset, it might be clearer to
rephrase to something like:

“PART_OFFSET, at TargetTypeValue 30 (TRACK), represents the number of
lower-level items that precede the current group in the overall collection. For
example, if CD 1 contains 5 tracks, then the first track of CD 2 has
PART_OFFSET = 5.”

What do you think?

6- Section 4.10: It appears to be an inconsistent treatment of numeric tags
with respect to their encoding type.

For example: The EBU_R128_* tags (e.g., EBU_R128_LOUDNESS) are defined as
binary and store floating-point values in <TagBinary>. The REPLAYGAIN_* tags
(e.g., REPLAYGAIN_GAIN, REPLAYGAIN_PEAK) represent similar floating-point
values but are defined as UTF-8 strings in <TagString>. This means that two
groups of tags describing essentially the same kind of data (gain/loudness
values in dB or LUFS) are stored using different data types.

6.1- Could you please clarify whether this distinction is intentional (for
example, due to backward compatibility) or whether a consistent approach is
intended?

6.2- It might be helpful to include a short explanatory note in Section 4.10
such as "..ReplayGain tags retain textual representation for compatibility with
legacy implementations, whereas EBU R128 tags use binary floats for higher
precision..."?

6.3- Additionally, it may be useful to provide brief guidance for future tag
definitions on when to prefer binary versus textual representation for numeric
values. For example, recommending binary floats for precision-critical
engineering data, and UTF-8 strings for human-readable or legacy-compatible
values. This would help ensure consistent design choices in future extensions.

7- Section 5, states: "Most of the time strings are kept as-is and don't pose a
security issue, apart from invalid UTF-8 values."

While the mention of “invalid UTF-8 values” is helpful, this phrasing might
still understate the potential risk. Implementations that handle TagStrings
without proper UTF-8 validation or size checks could encounter parsing errors,
crashes, or buffer overruns if presented with malformed or excessively large
input data. It may be useful to add a clarifying sentence such as:

"Implementations MUST validate TagString inputs for UTF-8 correctness and
reasonable length before use, in accordance with the security considerations in
[RFC 3629]"

What do you think?

8- The draft describes how multiple SimpleTag elements may appear under the
same Tag element, allowing multiple values for the same tag name.

However, how should applications interpret or prioritize these values if
conflicting tags occur. For example, two TITLE tags with different TagString
values within the same Targets element?

Nits:

9- choregrapher → choreographer

10- the values is stored → the value is stored

11- parts that are inside or outside a given file → ambiguous. Consider
clarifying to something like: “parts located either within or externally
referenced by a given file” ?

12- Due to the various nature of tag sources → Due to the varied nature of tag
sources

13- each demand needs to balance if it makes sense… → each request needs to be
evaluated to determine if it makes sense…

14- an host app → a host app

15- A Tag element has a single Targets element with a single TargetTypeValue
element. But the Targets element… → replace “But..” with “However,...”

16- It is RECOMMENDED to start a tag name… → It is RECOMMENDED that tag names
start…

17- for non official tags than are not meant to make it to the list… → for
non-official tags that are not meant to be added to the list of official tags...

18- apply to the all lower TargetTypeValues → “…apply to all lower
TargetTypeValues..”

Thanks for this document,

Ines.



_______________________________________________
Gen-art mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to