Re: [aqm] draft-ietf-aqm-pie-02 review

Rong Pan (ropan) Tue, 17 Nov 2015 15:20:07 -0800

Polina,

A new version of PIE, draft-ietf-aqm-pie-03, is uploaded which addresses your 
comments below. I also address your comments below.

Thanks,

Rong

From: aqm <[email protected]<mailto:[email protected]>> on behalf of 
Polina Goltsman 
<[email protected]<mailto:[email protected]>>
Date: Friday, August 14, 2015 at 5:25 AM
To: AQM IETF list <[email protected]<mailto:[email protected]>>
Subject: [aqm] draft-ietf-aqm-pie-02 review

Hello all,

Currently, I'm implementing various AQMs and one of them is PIE. However, one 
problem is that the implementation is made difficult due to slight 
inconsistencies between the draft section 4 and the pseudocode as well as it's 
difficulty
to understand without the corresponding (theoretical) background knowledge. 
Therefore, here are my suggestions for improvement from my review of the latest 
draft version [thanks to Roland for helping with the suggestions and text]:
General feedback:
IHMO the current version of the draft, especially Section 4, is hard to 
understand without reading the PIE paper ([HSPR-PIE]) first. Ideally I would 
prefer an introduction section with the overview of the design before the 
"Terminology" section, but it should at least be suggested to read the paper 
first. BTW is the paper publicly available?

>>>>>>>>>>>>>>>RP: I have added more explanations before each of the sessions 
>>>>>>>>>>>>>>>to make them more readable. I have also added a paper link to 
>>>>>>>>>>>>>>>the HSPR-PIE paper in the draft.

In Section 4 there is a subsection per feature added, as opposed to a 
subsection per component. As a result, requirements from the same component are 
written in pieces in different subsections. For me it is very hard to combine 
them together. One solution could be to include the pseudocode as last 
subsection of Section 4. The same applies to Section 5.

>>>>>>>>>>>>>>>RP: some functions are implemented in multiple components. I 
>>>>>>>>>>>>>>>have reorganized the sections. For each feature, I try to 
>>>>>>>>>>>>>>>explain what components are involved.

The variables in formulas are not introduced before the formulas. I would 
prefer to have a short description of what a formula does before the code and a 
long description after the code.

>>>>>>>>>>>>>>>RP: I have added descriptions in each subsection explaining the 
>>>>>>>>>>>>>>>intent of each components.

PIE is described as consisting of three components, however the "preambles" in 
the formulas (the lines that start with stars) do not referred directly to the 
components and the code for the same components are titled with different 
star-lines: e.g., the specification of "random dropping at enqueue" component 
appears under:

* upon a packet arrival MUST
* upon packet arrival
* if rand() < p

In the previous draft version there was only the link capacity estimation 
version and there was no latency component, only the capacity estimation 
component. In the current draft this is replaced with two options, which made 
the whole draft very confusing. For instance, why is delay called est-delay if 
it is either estimated or sampled? It is also not clear that estimated delay is 
calculated in drop_probability_calculation. I would guess it after reading the 
paper...  IMO it would be better to describe the whole original version with 
link capacity estimation and then suggest the alternatives: implementation with 
timestamps or reading capacity from some other (external to PIE) system 
component as in DOCSIS-PIE. Is it expected that most versions will use capacity 
estimation?

>>>>>>>>>>>>>>>>>>>>>>>>>>>RP: I have reworded the latency, not as 
>>>>>>>>>>>>>>>>>>>>>>>>>>>estimated_delay, but as delay samples. Hopefully it 
>>>>>>>>>>>>>>>>>>>>>>>>>>>is clear now.

There are SHOULD values for parameters of PIE, but it is never explained in 
what range of conditions these parameters are [expected to be] valid. There are 
slides for PIE in Data Centers here: 
http://www.ietf.org/proceedings/86/slides/slides-86-iccrg-5.pdf. They use 
different parameters for alpha, beta, and T_update. So I assume that the 
parameters in the draft are not valid in these scenario. Also, since 
target/reference delay is what a user probably wants to configure, should it be 
SHOULD as opposed to RECOMMENDED. Finally, what minimum buffer size is required 
for these values of target and burst-allow, so that PIE doesn't revert to 
taildrop in the process?

Moreover, it is IMHO questionable whether implementers are familiar with the 
control theory and can perform calculations based on the formulas in the paper. 
Ideally, I would prefer a spreadsheet, where I can input visible network 
parameters - e.g.,  link capacity, average or max RTT, max_available_buffer, 
desired delay and get values for PIE variables.

>>>>>>>>>>>>>>>>>>>>>>>>>>>RP: I have added rule of thumbs at the end of the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>Section 4.2. The idea is that the operators don’t 
>>>>>>>>>>>>>>>>>>>>>>>>>>>need to change those parameters. There is no 
>>>>>>>>>>>>>>>>>>>>>>>>>>>requirement of them learning the control theory. All 
>>>>>>>>>>>>>>>>>>>>>>>>>>>those are internal parameters for PIE.

As a nit, IMHO delay should always be called delay and not sometimes delay and 
sometimes latency.

>>>>>>>>>>>>>>>>>>>>>>>>>>>RP: I have tried to change as much as possible.

per-Section feedback:
Sec4.2:
>in the formulas for autotune: if (drop_prob_ < X)
is p a value between 0 and 1 or between 0 and 100% ?

>>>>>>>>>>>>>>>>>>>>>>>>>>>>RP: 0 to 1.

Sec4.2:
>Here, the current queue length is denoted by qlen.
there is no qlen in the formulas

>>>>>>>>>>>>>>>>>>>>>>>>>>>>RP: removed.

Sec4.2:
> Variables, est_del and est_del_old represent the current and previous 
> estimation of the queueing delay.
suggestion: which are calculate by the latency component (see Section 4.3)

>>>>>>>>>>>>>>>>>>>>>>>>>>>>RP: changed to current_qdelay and qdelay_old.

Sec4.2:
IMO the rationale for PI controller belongs to Design Goals in Section 3 and 
not to algorithm specification. By this line, it should already be clear that 
PI is used.

>>>>>>>>>>>>>>>>>>>>>>>>>>>RP: PI is not the design goal of PIE. It is a 
>>>>>>>>>>>>>>>>>>>>>>>>>>>mechanism that can achieve our goals. I moved the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>high level discussion about PI controller up to the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>beginning of Section 4.

Sec4.4:
>* if p == 0 and est_del < del_ref and est_del_old < del_ref
>burst_allowance = max_burst;
to what component does this formula belongs?

>>>>>>>>>>>>>>>>>>>>>>>>>>RP:  initializing the burst allowance, it should be 
>>>>>>>>>>>>>>>>>>>>>>>>>>in "Random Dropping” block.

Sec5.1

* if rand() < p
This is implementation consideration that belongs to Section 6, here should be 
"upon packet arrival"

>>>>>>>>>>>>>>>>>>>>>>>>>>>>RP: done

Sec5.2
see the comment to the PIE enhanced pseudocode, denoted as [**]. it affects 
this line too.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>RP: addressed below.

Sec5.3
Is the size of a buffer specified somewhere? It can be bufferbloated, so that 
1/3buffer is more than reference delay and burst_allowance together.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>RP: sizing the buffer is not the goal of PIE. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>Also, micro burst over 1/3 buffer is not the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>issue, and we don’t want to stop it from 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>happening. The goal is to avoid long term 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>persistent congestion that causes the delay bloat, 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>explained in Section 5.3.

Sec6
>If the implementation doesn’t rely on packet timestamps for calculating 
>latency, PIE does not require extra memory.
extra operations?
I assume that Linux's Codel does not allocate extra space in skb for 
timestamps, it is already there...

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>RP: timestamp has to be stored 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>somewhere, all we claiming is that we 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>don’t need anything extra.

>The state requirement is only two variables per queue: est_del and 
>est_del_old. Hence the memory overhead is small.
well the state is est_del_old, p, and burst_allow, and depending on the delay 
calculation either est_delay or average drain rate and drained bytes for 
measurement.

>PIE calculates latency using the departure rate, which can be implemented 
>using a multiplication.
see the comment to the PIE enhanced pseudocode, denoted as [**]. it affects 
this line too.

>In summary, PIE is simple to implement
given the current state of the draft I really doubt this ...

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>RP: the statement is subjective. I removed 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>it and changed it to "PIE is simple enough 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>to be implemented in both hardware and 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>software". This is a fact.

>SFQ can be combined with PIE to provide further improvement of latency for 
>various flows with different priorities.
There was a discussion started by me here 
(http://www.ietf.org/mail-archive/web/aqm/current/msg01269.html) in which it 
was concluded that for SFQ scheduler there is supposed to be a single PIE state 
and drop probability of each queue is multiplied by its qlen/max qlen 
(http://www.ietf.org/mail-archive/web/aqm/current/msg01363.html). Additionally, 
it was stated that the same approach was taken for a scheduler with small 
number of queues ( probably in this email 
http://www.ietf.org/mail-archive/web/aqm/current/msg01358.html)
Since discussing the interaction of AQM with a scheduler is a requirement from 
draft-aqm-eval-guidelines, this qlen/max_qlen recomendation should be summarized

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>RP: Added a paragraph at the end of 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>Section 6 to discuss SFQ and PIE.

PIE BASIC pseudocode feedback:
>current_qdelay_, -qdelay_old_,
aren't they called est_delay and est_delay_old in Section 4. It would be good 
to use consistent names

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>RP: done.

enqueue():

Burst allowance is reset in this routine. However in the next pseudocode it is 
reset in calculate_drop_prob() which does make more sense to me.

>if (PIE->burst_allowance_ < 0 && drop_early() == DROP && PIE->burst_allowance_ 
><= 0) {
the burst_allowance condition is verified two times, with < and with <=

>if ( (PIE->qdelay_old_ < QDELAY_REF/2 && PIE->drop_prob_ < 20%)
>|| (queue_.byte_length() <= 2 * MEAN_PKTSIZE) ) {
>return ENQUE;
>}

is this line explained in Section 4?

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>RP: this is to make PIE work conserving, which is 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>not the key part of PIE.

calculate_drop_prob():

>//can be implemented using integer multiply,
>qdelay = PIE->current_qdelay_;
it probably can, but isn't simple assignment better :)

>PIE->last_timestamp_ = now;
I assume this line came from calling the timer function based on timestamps, 
which is implementation decision and should not be present in pseudocode. Linux 
e.g., calls calculate_drop_prob() by timer.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>RP: pseudo code is simply showing how it can be 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>conceptually done, not trying to match Linux 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>code.

PIE enhanced pseudocode feedback:
enqueue:

>if (queue_.byte_length()+packet.size() > TAIL_DROP) {
>drop(packet);

this I assume is from Bob Briscoe's review 
(http://www.ietf.org/mail-archive/web/aqm/current/msg01175.html), is it 
explained somewhere in Section 5?

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>RP: yes, but since it is regarding tail 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>dropping but PIE per se. We don’t want 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>to over explain everything in the pseudo 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>code.

calculate_drop_prob:

>if ( (now - PIE->last_timestampe_) >= T_UPDATE &&
>PIE->active_ == ACTIVE) {
1) it is not explained in Section 5 that when PIE is inactive, drop probability 
is not recalculated. Also should it really not?

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>RP: If PIE is not active, drop 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>probability calculation is ignored. We 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>don’t want to get into a situation 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>where as soon as PIE is back to be 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>active, packets would incur massive 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>drops.

2) now - timestamp is implementation choice and should not be here

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>RP: it is meant to illustrate the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>point, I.e. Pseudo code, not the real 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>code.

3) there is a typo in timestampe

>if (PIE->drop_prob_ >= 10% && p > 2%) {
>p = 0.02;
>}
>PIE->drop_prob_ += p;
it this line explained somewhere in Section 5? I believe this line was 
commented in this email 
(http://www.ietf.org/mail-archive/web/aqm/current/msg01216.html) and should 
have appeared as MAY requirement.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>RP: added a section Section 5.5. 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>addressing this.

dequeue:

>weight = DQ_THRESHOLD/2^16
1) this is new and not explained in Section 5.2
2) since dq_thresh is in bytes, weight also has units - bytes. which makes 
avg_dq_time  in units bytes * sec or something unknown like (1 - bytes) * sec.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>RP: weight does not have unit. Consider it 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>as weight = DQ_THRESHOLD/2^16Byte

in Linux code it is first checked whether a new measurement cycle can be 
started, and then departed bytes are updated. With current version if there is 
exactly dq_threshold bytes in queue and new packets will not come the code will 
not count bytes in the first packet and the condition  (queue_.byte_length() >= 
DQ_THRESHOLD ) will not happen after new packets arrive.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> RP: Sorry, Linux code is not maintained 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> at this moment.

[**]
the PIE version in [HSPR-PIE] and Linux version calculate instantaneous 
capacity (also called departure rate) as dq_count/dq_time. Then the average 
capacity was calculated. The new code proposes to calculate instantaneous 
dq_time and average dq_time and then calculate capacity as dq_thresh/dq_time in 
calculate_drop_prob. Since the queue can send packets and not bytes, dq_count 
is not dq_thresh and it may be well different and measured capacity not precise.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>RP:  Yes, it is not precise as you 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>mentioned. However, we tried to avoid a 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>non-trivial (from hardware point of view) 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>divide by using DQ_THRESHOLD which can be 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>simply done using a right shift. Given that 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>our DQ_THRESHOLD is 2^16, one or two packet 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>can only affect the resolution around 1/64 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>=1.5%, which we believe is a good trade off.

I think it would be better to include the correct formulas, and then suggest a 
less precise implementation, which requires less overhead in Section 6. In some 
situations there can be enough hardware power so that the optimization is not 
necessary (e.g., Linux on modern  Desktops). Plus on 10Gb interface Linux sends 
65K tso segments on 10Gb, I don't think I would want the value to be += one 
segment precise.

PIE Linux Code:

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>RP: the person who wrote the PIE Linux code 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>has left Cisco. We don’t support the 
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>maintenance of Linux code at the moment.

This is code from 
https://github.com/torvalds/linux/blob/master/net/sched/sch_pie.c
As far as I've checked, the code in 
https://github.com/hironoriokano/fq-pie/blob/master/pie.h is the same.

enqueue:

line 125 - byte mode
byte mode is not explained anywhere in the draft

dequeue:

line 286:
do I understand correctly: old epsilon is 1/8 but new epsilon is threshold / 
2^16 which is supposedly  16 * 2^10 / 2^16 = 2^-4.

line 300 updates burst-allow-= dtime:
according to the draft, burst-allow should be updated every Tupdate together 
with drop probability and not on every measure ?
since there is no guarantee that measurement is always active these are two 
different results. Is this code then wrong?

calculate_drop_prob:

line 378:
>if (qdelay > (PSCHED_NS2TICKS(250 * NSEC_PER_MSEC)))
>delta += MAX_PROB / (100 / 2);
this is not in the draft, is it ?

line 416 and below:

> /* We restart the measurement cycle if the following conditions are met ...
why do we need to restart measurement cycle? In my understanding the 
measurement cycle is independent of p , and only update to burst-allow is 
necessary.
Nits:
just before 4.
>In the following, the design of PIE and its operation are described in deta.
Detail

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>RP: done.

Sec4.1

>PIE optionally supports ECN and will be discussed in Section 5.1.
PIE optionally supports ECN. See Section 5.1.

or

PIE optionally supports ECN which will be discussed in Section 5.1.

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>RP: done.

>It can b reduced
it can be reduced
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>RP: done.

Sec5.2

> This threshold would allow us a long enough period
us should not be there

>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>RP: done.

Best Regards,
Polina

_______________________________________________
aqm mailing list
[email protected]
https://www.ietf.org/mailman/listinfo/aqm

Re: [aqm] draft-ietf-aqm-pie-02 review

Reply via email to