Re: [pve-devel] [PATCH pve-qemu] fix: 3865: backup restore human readable output

Daniel Tschlatscher Fri, 25 Mar 2022 07:05:21 -0700


On 3/25/22 12:05, Thomas Lamprecht wrote:

(re-sending, seems you forgot to hit Reply-All and only sent it to me directly).


On 25.03.22 10:41, Daniel Tschlatscher wrote:

On 3/25/22 09:29, Thomas Lamprecht wrote:

Thanks for the patch, some comments inline.
On 24.03.22 16:44, Daniel Tschlatscher wrote:
please format the commit message according our submission rules:
https://pve.proxmox.com/wiki/Developer_Documentation#Commits_and_Commit_Messages

an output example (maybe even an before/after) excerpt would be nice to have 
here.

Just leave out the ":" after "fix", right? Or is there something else I should 
change?

I actually meant the per-line character length, quoting:

Make sure the line length of the commit's message is not longer than 70 
characters.
Note, HTTPS links are an exception and should not be split.

but yeah, now that you say it, it really should be `fix #3865: ...`

   +typedef struct VmaRestoreState {
   +    BlockBackend *target;
   +    bool write_zeroes;
@@ -649,13 +653,31 @@ index 0000000000..2b1d1cdab3
   +
   +        if (verbose) {
   +            time_t duration = time(NULL) - vmar->start_time;

The resulting accuracy for time in seconds could be not that good, did you make 
any
thoughts or comparison regarding that.

I "built" upon what was here. This is the way that seconds passed were measured 
in the original code and works well enough in this case. time_t has some functionality 
for tracking milliseconds though but I don't think that much fine control over the time 
elapsed has any real benefits here.

sorry but what? Example we have a byte delta of 100 MiB since last report, and 
nothing
guarantees that we're currently very close to a second boundary, so lets see
how the error compares comparing second resolution to some millisecond ones.

Using second granularity anything between 1 <= t < 2 gets you 100 MiB/s
On the other hand:

                     "actual"     "1s cut off"   "error"
100 MiB / 1.000s = 100.0 MiB/s   100.0 MiB/s      0.0
100 MiB / 1.050s =  95.2 MiB/s   100.0 MiB/s      4.8
100 MiB / 1.123s =  89.0 MiB/s   100.0 MiB/s     11.0
100 MiB / 1.130s =  88.5 MiB/s   100.0 MiB/s     11.5
100 MiB / 1.500s =  66.7 MiB/s   100.0 MiB/s     33.3
100 MiB / 1.750s =  57.1 MiB/s   100.0 MiB/s     42.9
100 MiB / 1.999s =  50.0 MiB/s   100.0 MiB/s     50.0

So even just 7 ms (1.123s vs 1.13s) make a 0.5 MiB/s difference, IMO already 
relevant.

Wouldn't dare to call that "working well enough", or are we talking about
different things?

++
++                if (delta != 0)
++                    bps /= delta;

currently, if delta would be null you'd print a size unit as throughput unit?

This is due to the underlying code. The function that is wrapping this part of the 
code is called multiple times every second (something like ~20 times / sec) and in 
the original code a print statement would only occur after the percentage of bytes 
transferred increased by at least 1%.>
The time elapsed between an increase in percent could be less than a second. In 
this case the output simply prints how much bytes were transferred since this 
second began. This means the value for throughput would at the time of the 
print be always lower than the real world B/s and only the last print in that 
second would represent the actual B/s.

This just doesn't makes sense to me, the if clauses enforces that its called at
max every 2s (or 100% edge case) already, and using wrong units is using wrong
units, you cannot really argue basic math/physics away..

The 100% edge case is the problem for the most part.

It comes down to the inaccuracy of keeping time in seconds rather thansomething more granular. While it is true that in 0 time no actual datacould be transferred, 0 in this case means somewhere from 0 => 0.999seconds. Delta could be zero while the transferred bytes for examplecount 20MB.

Just printing 0 would be confusing because *some* data was in facttransferred -> reflected in the bps variable. It could also give theuser a false impression that their disk is slow or something similar ifthey see a lot of "0 B/s" entries (which was the actual reasoning here).The current implementation means, if delta is 0, bps = 20MB and inreality 0.33 seconds have elapsed that the program prints "20MB/s". Inactuality the expected rate of transfer would be "60 MB/s" though.

My thinking not using milliseconds here was: Even though I couldcalculate how much the transfer rate per second is, during the time ofthe print, it is still an approximation. Imagine following scenario:In the first 0.33 secs 20MB/s are transferred (=> print goes out with60 MB/s)

In the next 0.33 secs only 10MB/s are transferred
In the last 0.33 secs only 0.5MB/s are transferred
In this case "60MB/s" was printed but 30.5MB have been transferred.

The current implementation might be severely understating the actualthroughput, but will never show a value for B/s that is higher than theactual throughput per second. The unaccounted for bytes here would bethen reflected in the following output.The millisecond approach would have both the same and opposite problem,e.g. if delta is only a few milliseconds and the throughput in that timea lot higher than average, then the printed B/s value could be highlyexaggerated.

I let the current implementation be because, in practice, at the end ofthe backup restore there are usually a lot of sparse bytes (I hope I amusing this term correctly). This means that the restore would go from56% to 100% in less than a second. Simply writing 0 here would probablyconfuse the user more than anything because the value suddenly jumped bye.g. 6GiB with 0B/s.

Still, I do concede that using milliseconds just makes more sensebecause it's the more "proper" way to implement something like this andespecially because of the unit. In the current implementation it wouldbe something like B/last-output.Also, it seems to me that the rate of throughput plummeting orskyrocketing in the span of less than a second is rather unlikely, themillisecond method would statistically just be more correct.

I will change it in v3.

Nonetheless, I later adapted this code to behave like the backup create 
function does, which updates every 3 seconds, eliminating this problem 
entirely. This is therefore dead code anyway.

later where?

Actual improvement, track duration in milliseconds or less and still catch the 
edge
case for when delta is zero and just set throughput to zero then (as no data can
transfer in zero time).

Later in the sense of time. I rewrote the if statement above to onlyexecute if delta is greater than 2 seconds (before it updated with everypercent increase)._>> __"eliminating this problem entirely"_ is, I retract, not truethough because I forgot about the check whether progress is at 100%, inwhich case delta could still be 0.

would IMO be worth it to factor above lines out in a static local helper to 
avoid
crowding this function to much, e.g. (types may be adapted if reasonable):

print_restore_progress(uint64_t total_byte, uint64_t restored_byte, uint64_t 
duration_ms);

I would handle the last_X statics in there and also re-calculate the percentage 
as float
in there, so that we can print a similar amount of digits after the decimal 
place like
we do in backup.

A function sounds good, though I am not quite sure what you mean by "print a similar 
amount of digits after the decimal place like we do in backup"

Hmm, I thought percentage isn't a integer there, so meant something like 5.1%,
but apparently it is a plain integer (may just have been in an experiment done
once I rewrote that whole thing a bit ago), so for now you can ignore that.

_______________________________________________
pve-devel mailing list
pve-devel@lists.proxmox.com
https://lists.proxmox.com/cgi-bin/mailman/listinfo/pve-devel

Re: [pve-devel] [PATCH pve-qemu] fix: 3865: backup restore human readable output

Reply via email to