Thanks for the reply. Comments in line.
On 1/22/2024 3:27 AM, François Michel wrote:
Hi Christian,
Great to hear from you, especially given you expertise in the topic!
Thank you for all your comments. See my answers below.
I am adding Rachel to the loop, that was interested in progressing in
the draft together and ensure the design can handle their use-case as well.
Le 19/01/24 à 07:29, Christian Huitema a écrit :
François, Olivier,
I just spent some time studying your draft on QUIC FEC. I like the
idea of having an FEC framework independent from the algorithm used to
actually compute the FEC data and repair packets. Your draft solves a
number of practical problems, such as how to notify peers when FEC
helps receive a frame from an otherwise lost packet, or how to
identify "symblos" independently of packet numbers using the symbol
identifier frame (SID).
The draft is obviously a work in progress.
Yes, the aim of this draft and current papers under submission is to
spark the interest on the topic again. I've been working on FEC for
years now and was part of the earlier QUIC-FEC work at NWCRG where we
already wrote interesting drafts.
My intent with this one here is to propose a short and simple
specification that people can wrap their head around. We can then make
it progress together with quicwg folks instead of proposing a first,
exhaustive but complex draft that is difficult to apprehend.
You propose two alternatives for linking frames to a SID. I wish you
picked just one, and I prefer your first alternative, in which your
SID frames brackets a list of protected frames.
I prefer the first one as well, but I was insure it fits well the design
of existing implementations. If people are okay, I'd be more than happy
to only keep alternative 1.
However, I an not quite sure
how this should be parsed. You give an example as:
| Pkt(6)[STREAM(2, "xyz"), |
| SOURCE_SYMBOL(1, { STREAM(8, "def"), |
| DATAGRAM("msg") }] |
In that example, the frame STREAM(8..) and DATAGRAM() are protected,
while the "STREAM(2)" is not. Fine, but the syntax is described as:
SOURCE_SYMBOL {
SID (i),
FEC Protected Payload (..)
}
... and I don't know how to parse that. There is no indication of the
length of the "FEC Protected Payload". Do you mean to indicate that
the SOURCE_SYMBOL frame extends to the end of the packet, and that all
frames following the SID are protected?
Yes, that's right. In the current design, the frame stops at the end of
the packet. We can add a length field or a number of protected frames.
OK. It would be simple enough to define packets as:
<non protected frames><SID><protected frames>
You define a framework in which client and server negotiate to use
FEC, and also to select a FEC scheme. The syntax of your transport
parameter seems a bit restrictive: the client proposes to use FEC and
a specific scheme, and the server accepts or refuse. Given the
experimental nature of FEC, I expect that we will try several
algorithms. It would be nice for the client to propose a list, and for
the server to pick one -- or zero, if it does not support any of the
proposed values. In fact, I think that you could merge the "enable
FEC" parameter that negotiates use of FEC with the "decoder FEC
scheme" negotiation.
Agree, we could use the transport parameter to propose a list of FEC
schemes, that's indeed how most negotiation mechanisms work. The absence
of this parameter indicates FEC is not supported and an empty list would
announce that FEC SOURCE_SYMBOL frames are parsed but not used. This
might cause problems though as the REPAIR frame format depends on the
negotiated scheme.
The classic negotiation is "client proposes many, sender pick one, and
that one defines the REPAIR format." The alternative would be to not
have a repair format, and let each scheme define its own.
Your draft does not assign identifiers to existing FEC schemes. To
facilitate interop tests, I suggest that you define at least one. In
fact, I would suggest a very simple one, in which the REPAIR frame
identifies a range of SID, and then carries the XOR of all packets in
that range.
Agree. I am not a fan of the XOR code as it performs really badly in
many scenarios when losses occur in bursts and it might lead
implementers to only implement XOR and I would like to avoid that. We
could maybe define xor (e.g. "interop_xor") in another draft especially
for interop purpose so that it is clear.
Yes, that was my intent too.
The suggestion above brings a discussion of the relative size of the
"FEC Protected Payload" and the REPAIR frames. As in the example
above, I would expect REPAIR frames to include a small header followed
by a combination of the content of several FEC Protected Payload, with
that combination being at least as long as the longest FEC Protected
Payload in the set. That longest size, by default, can be a full
packet payload (per PMTU), minus the length of the SID prefix. But
that leave very little room for encoding the prefix of the REPAIR
frame, which is likely to require at list the REPAIR frame type
(arguably same length as the SOURCE_SYMBOL frame type), and SID
identifying the range (same length as the SID parameter of the
SOURCE_SYMBOL frame), and an additional parameter indicating the
variant of he repair frame according to the selected scheme (arguably
the same length as the coding window). Is that the problem that you
are discussing in section 4.2.3?
If you refer to the end section 5.3 and not 4.2.3, yes. The REPAIR frame
may contain metadata that may increase its overhead. So I see two ways
to cope with this problem:
1) Restrict the maximum size of FEC Protected Payload (what we do now in
our implems)
2) Make it possible to "stream" REPAIR frames. This is a bit sad as you
may need more packets than repair symbols but on average that could work
well.
If you see the need to support streaming of repairs, then it would make
sense to define how to do that in the generic "framing" document.
Restricting the length of the protected frames feels much simpler, for
example not having to worry about receiving only fractions of REPAIR
frames. But you would want that to be a function of the PMTU, otherwise
you introduce overhead, which is why I was mentioning "maximum overhead
of repair"
Should there be some property associated to the FEC scheme, such as
the maximum overhead of a REPAIR frame?
If we decide 1) above, yes, that would be really helpful to have the FEC
scheme signal the max repair frame overhead, but that may not work with
schemes that e.g. explicitly list the IDs of protected symbols.
Make that a "don't care" option, but it requires defining how to split
and reassemble repair frames.
(Also, why pad the FEC-protected data at the beginning rather than at
the end? Or leave that as a property of the FEC scheme?)
We pad it at the beginning so that padding can be naturally handled at
the QUIC layer. The padding can be parsed and handled as classical QUIC
padding frames, so the decoded does not have to process the recovered
payload before handling it to QUIC.
That does not feel like a very good reason. Padding at the end is
"virtual", and does not introduce packet overhead. Yes, this means the
final "repaired" frame may contain an arbitrary number of zeroes at the
end, but these will be parsed as QUIC padding and ignored by the decoder.
I am not sure that I fully understand how to use the FEC WINDOW frame.
You allow it to change, but what if the packet containing that frame
is lost? How can the peer know when exactly the use of the new window
starts, and which window is associated with a particular SOURCE_SYMBOL
or REPAIR frame?
I think our draft is not clear enough about that. The FEC_WINDOW frame
announces that maximum amount of symbols that can be stored by the FEC
decoder. It has a purpose comparable to QUIC MAX_DATA frames. This
prevent the sender to send REPAIR frames that protect more symbols than
what the receiver is able to store.
In our view, the actual window protected by a repair symbol should be
announced inside the REPAIR frame carrying the repair symbol. Source
symbols can be associated with many different windows with different
sizes, all being defined by the repair symbols.
If we announce a reduced FEC_WINDOW and if the packet containing it is
lost, there will be a period of time where the server may send REPAIR
frames that protect more symbols than what the receiver is allowed to
store and those REPAIR frames will be useless. In our scenarios, this
FEC_WINDOW limit was rarely hit though, as if you send at a limited
bitrate (e.g. real-time video), you'll probably protect less packets
than the maximum buffer size of the FEC Decoder. Same thing is you
protect bulk transfers with mdeium/low BDPs, but this limit will likely
be hit with high BDPs or low-memory FEC Decoders.
Reed-Solomon codes are often characterized by two numbers, the length
of the coding window and the number of redundant copies -- in our
case, the number of REPAIR frames for a given coding window. It seems
that in your proposal these two numbers are set arbitrarily by the
sender. Should there me some negotiation of maximum values? Or would
those maximum values be deduced from the scheme identifier, something
like "reed solomon 32 + 8"? Or should the "repair" frame indicate the
length of the coding widow over which it operates?
I like the idea of the repair symbols being self-contained. However, I
understand that there might be good reasons to define limits, especially
on constrained devices, as complex codes (e.g. large reed solomon
blocks) might be too heavy CPU-wise.
I have used 40+8 in previous implementations. Yes, there is a CPU cost,
but it is smaller than or comparable to the cost of encryption. Some
implementations might want to do that.
I am also not sure how the update of the coding window works for a
convolutional code
(see my answer in the paragraph below)
One way to understand the coding window is "the number of frames over
which a given REPAIR may operate," but we are concerned with
correlated losses happening in trains. To protect against that, it is
nice to send the repair frames some time after the protected frames,
in which case the window would an indication of how long a copy of a
given frame has to be kept. This could be expressed as a number of
packets, but
if multipath is supported we may want to send the repairs on a
different path, and then using number of packets is not natural.
In our implem, the sender stores the maximum window size (announced by
FEC_WINDOW frames). When generating a repair symbol, the size of the
window protected by this repair symbol is set to min(max_window_size,
n_symbols_in_flight). That window size is announced as part of the
repair symbol. Concerning the receiver, it keeps track of the source
symbol with the highest received SID (highest_sid). It keeps in memory
the symbols with SID [highest_sid-max_window_size, highest_sid] and can
therefore perform decoding for repair symbols protecting windows sitting
inside this interval.
I can see a negotiation between sender and receiver. In my mind, there
are two critical numbers, both on the receiver side:
* the maximum number of SID that it will remember, because that requires
committing memory.
* the maximum number of SID+REPAIR frames that it is willing to combine,
i.e., the size of the "repair matrix", because of both CPU and memory.
Arguably, that one depends on the scheme.
If we have a negotiation, then I would expect it at the beginning of the
connection. The variable window seems like a secondary optimization, as
in "we negotiated a maximum of 16+5, but for now i am sending 8+2
because it seems sufficient".
I would rather encode that in the SID frame than in a separate frame,
because we have some room in the SID frame (see length of REPAIR frame
issue). Also because that removes the need for handling synchronization
of FEC_WINDOW, SID and REPAIR.
So using SIDs instead of packet numbers and number of frames seems more
natural to me.
Speaking of multipath: I had hopes at some point to be able to define
FEC-dedicated paths, even if the FEC-dedicated path is running on the
same network path (i.e. like a "virtual path" concept). Let me explain:
frames protected by FEC could be sent on packets over the FEC-dedicated
path only and frames not protected by FEC would be sent on another path.
Both path could actually use the same network path, but this could
naturally allow to decouple FEC-protected frames from others and remove
the need of an SID frame (and spare space in packets!) if we ensure the
packet numbers are sent contiguously, as we could use packet numbers as
SIDs. I know contiguous packet numbers is not how QUIC works, and this
idea may not be the ideal solution, but I describe it here as it might
spark clever ideas from you or other folks reading this thread.
Why specialize paths? You could spray both the SID and the REPAIR over
multiple paths. You don't need much there, but you do need a common SID
number space between all paths. And you need to handle reordering of out
of order SID and REPAIR packets, with a reordering buffer the size of
the FEC window.
OK, that's a lot of text. Some of that may be because I did not fully
understand your intent. I expect things to get clearer with your next
draft, or when we start interop testing of different implementations...
Sorry if the draft is still in a bit of a rough form! We currently
concentrate on having implems and papers published (publishing FEC work
is a *real* effort, as the ideas behind FEC are old, it is difficult to
convince reviewers of the novelty... :-) and that's fine but it takes time)
I will certainly integrate your comments in the next version(s), they
are all valuable and totally on-point ! Thank you so much for that.
Waiting to work on that!
Same, after all these years ! Maybe Brisbane is too early to host a
hackathon table (especially that I don't know how many folks would be
interested), but that might be something we could do for next IETFs.
Once I get my work published, I can also release my FEC encoder/decoder
lib that has C bindings that could really be valuable for interop as well.
Will look into that, thanks.
-- Christian Huitema