On Wednesday, December 4, 2019 at 5:19:23 AM UTC+1, Vasily Khoruzhick wrote: > > On Mon, Jan 14, 2019 at 1:25 AM Marc Zyngier <marc....@arm.com > <javascript:>> wrote: > > > > Hi Samuel, > > Hi Samuel, > > > On 13/01/2019 02:17, Samuel Holland wrote: > > > The Allwinner A64 SoC is known[1] to have an unstable architectural > > > timer, which manifests itself most obviously in the time jumping > forward > > > a multiple of 95 years[2][3]. This coincides with 2^56 cycles at a > > > timer frequency of 24 MHz, implying that the time went slightly > backward > > > (and this was interpreted by the kernel as it jumping forward and > > > wrapping around past the epoch). > > > > > > Investigation revealed instability in the low bits of CNTVCT at the > > > point a high bit rolls over. This leads to power-of-two cycle forward > > > and backward jumps. (Testing shows that forward jumps are about twice > as > > > likely as backward jumps.) Since the counter value returns to normal > > > after an indeterminate read, each "jump" really consists of both a > > > forward and backward jump from the software perspective. > > > > > > Unless the kernel is trapping CNTVCT reads, a userspace program is > able > > > to read the register in a loop faster than it changes. A test program > > > running on all 4 CPU cores that reported jumps larger than 100 ms was > > > run for 13.6 hours and reported the following: > > > > > > Count | Event > > > -------+--------------------------- > > > 9940 | jumped backward 699ms > > > 268 | jumped backward 1398ms > > > 1 | jumped backward 2097ms > > > 16020 | jumped forward 175ms > > > 6443 | jumped forward 699ms > > > 2976 | jumped forward 1398ms > > > 9 | jumped forward 356516ms > > > 9 | jumped forward 357215ms > > > 4 | jumped forward 714430ms > > > 1 | jumped forward 3578440ms > > > > > > This works out to a jump larger than 100 ms about every 5.5 seconds on > > > each CPU core. > > > > > > The largest jump (almost an hour!) was the following sequence of > reads: > > > 0x0000007fffffffff → 0x00000093feffffff → 0x0000008000000000 > > > > > > Note that the middle bits don't necessarily all read as all zeroes or > > > all ones during the anomalous behavior; however the low 10 bits > checked > > > by the function in this patch have never been observed with any other > > > value. > > > > > > Also note that smaller jumps are much more common, with backward jumps > > > of 2048 (2^11) cycles observed over 400 times per second on each core. > > > (Of course, this is partially explained by lower bits rolling over > more > > > frequently.) Any one of these could have caused the 95 year time skip. > > > > > > Similar anomalies were observed while reading CNTPCT (after patching > the > > > kernel to allow reads from userspace). However, the CNTPCT jumps are > > > much less frequent, and only small jumps were observed. The same > program > > > as before (except now reading CNTPCT) observed after 72 hours: > > > > > > Count | Event > > > -------+--------------------------- > > > 17 | jumped backward 699ms > > > 52 | jumped forward 175ms > > > 2831 | jumped forward 699ms > > > 5 | jumped forward 1398ms > > > > > > Further investigation showed that the instability in CNTPCT/CNTVCT > also > > > affected the respective timer's TVAL register. The following values > were > > > observed immediately after writing CNVT_TVAL to 0x10000000: > > > > > > CNTVCT | CNTV_TVAL | CNTV_CVAL | CNTV_TVAL > Error > > > > --------------------+------------+--------------------+----------------- > > > 0x000000d4a2d8bfff | 0x10003fff | 0x000000d4b2d8bfff | +0x00004000 > > > 0x000000d4a2d94000 | 0x0fffffff | 0x000000d4b2d97fff | -0x00004000 > > > 0x000000d4a2d97fff | 0x10003fff | 0x000000d4b2d97fff | +0x00004000 > > > 0x000000d4a2d9c000 | 0x0fffffff | 0x000000d4b2d9ffff | -0x00004000 > > > > > > The pattern of errors in CNTV_TVAL seemed to depend on exactly which > > > value was written to it. For example, after writing 0x10101010: > > > > > > CNTVCT | CNTV_TVAL | CNTV_CVAL | CNTV_TVAL > Error > > > > --------------------+------------+--------------------+----------------- > > > 0x000001ac3effffff | 0x1110100f | 0x000001ac4f10100f | +0x1000000 > > > 0x000001ac40000000 | 0x1010100f | 0x000001ac5110100f | -0x1000000 > > > 0x000001ac58ffffff | 0x1110100f | 0x000001ac6910100f | +0x1000000 > > > 0x000001ac66000000 | 0x1010100f | 0x000001ac7710100f | -0x1000000 > > > 0x000001ac6affffff | 0x1110100f | 0x000001ac7b10100f | +0x1000000 > > > 0x000001ac6e000000 | 0x1010100f | 0x000001ac7f10100f | -0x1000000 > > > > > > I was also twice able to reproduce the issue covered by Allwinner's > > > workaround[4], that writing to TVAL sometimes fails, and both CVAL and > > > TVAL are left with entirely bogus values. One was the following > values: > > > > > > CNTVCT | CNTV_TVAL | CNTV_CVAL > > > > --------------------+------------+-------------------------------------- > > > 0x000000d4a2d6014c | 0x8fbd5721 | 0x000000d132935fff (615s in the > past) > > > > > > > ======================================================================== > > > > > > Because the CPU can read the CNTPCT/CNTVCT registers faster than they > > > change, performing two reads of the register and comparing the high > bits > > > (like other workarounds) is not a workable solution. And because the > > > timer can jump both forward and backward, no pair of reads can > > > distinguish a good value from a bad one. The only way to guarantee a > > > good value from consecutive reads would be to read _three_ times, and > > > take the middle value only if the three values are 1) each unique and > > > 2) increasing. This takes at minimum 3 counter cycles (125 ns), or > more > > > if an anomaly is detected. > > > > > > However, since there is a distinct pattern to the bad values, we can > > > optimize the common case (1022/1024 of the time) to a single read by > > > simply ignoring values that match the error pattern. This still takes > no > > > more than 3 cycles in the worst case, and requires much less code. As > an > > > additional safety check, we still limit the loop iteration to the > number > > > of max-frequency (1.2 GHz) CPU cycles in three 24 MHz counter periods. > > > > > > For the TVAL registers, the simple solution is to not use them. > Instead, > > > read or write the CVAL and calculate the TVAL value in software. > > > > > > Although the manufacturer is aware of at least part of the erratum[4], > > > there is no official name for it. For now, use the kernel-internal > name > > > "UNKNOWN1". > > > > > > [1]: https://github.com/armbian/build/commit/a08cd6fe7ae9 > > > [2]: https://forum.armbian.com/topic/3458-a64-datetime-clock-issue/ > > > [3]: https://irclog.whitequark.org/linux-sunxi/2018-01-26 > > > [4]: > https://github.com/Allwinner-Homlet/H6-BSP4.9-linux/blob/master/drivers/clocksource/arm_arch_timer.c#L272 > > > > > nit: In general, I'm not overly keen on URLs in commit messages, as they > > may vanish without notice and the commit message becomes less useful. In > > the future, please keep those in the cover letter (though in this > > particular case, the commit message explains the issue pretty well, so > > no harm done once GitHub dies a horrible death... ;-). > > > > The fix itself looks pretty solid, and will hopefully make the > > "AllLoosers" HW more usable. > > Unfortunately this patch doesn't completely eliminate the jumps. There > have been reports from users who still saw 95y jump even with the > patch applied. > > Personally I've seen it once or twice on my Pine64-LTS. > > Looks like we need bigger hammer. Does anyone have any idea what it could > be? > > Regards, > Vasily > > > > Reviewed-by: Marc Zyngier <marc....@arm.com <javascript:>> > > > > Daniel, please consider this for v5.1. > > > > Thanks, > > > > M. > > -- > > Jazz is not dead. It just smells funny... > > > > -- > > You received this message because you are subscribed to the Google > Groups "linux-sunxi" group. > > To unsubscribe from this group and stop receiving emails from it, send > an email to linux...@googlegroups.com <javascript:>. > > For more options, visit https://groups.google.com/d/optout. >
On Wednesday, December 4, 2019 at 5:19:23 AM UTC+1, Vasily Khoruzhick wrote: > > On Mon, Jan 14, 2019 at 1:25 AM Marc Zyngier <marc....@arm.com > <javascript:>> wrote: > > > > Hi Samuel, > > Hi Samuel, > > > On 13/01/2019 02:17, Samuel Holland wrote: > > > The Allwinner A64 SoC is known[1] to have an unstable architectural > > > timer, which manifests itself most obviously in the time jumping > forward > > > a multiple of 95 years[2][3]. This coincides with 2^56 cycles at a > > > timer frequency of 24 MHz, implying that the time went slightly > backward > > > (and this was interpreted by the kernel as it jumping forward and > > > wrapping around past the epoch). > > > > > > Investigation revealed instability in the low bits of CNTVCT at the > > > point a high bit rolls over. This leads to power-of-two cycle forward > > > and backward jumps. (Testing shows that forward jumps are about twice > as > > > likely as backward jumps.) Since the counter value returns to normal > > > after an indeterminate read, each "jump" really consists of both a > > > forward and backward jump from the software perspective. > > > > > > Unless the kernel is trapping CNTVCT reads, a userspace program is > able > > > to read the register in a loop faster than it changes. A test program > > > running on all 4 CPU cores that reported jumps larger than 100 ms was > > > run for 13.6 hours and reported the following: > > > > > > Count | Event > > > -------+--------------------------- > > > 9940 | jumped backward 699ms > > > 268 | jumped backward 1398ms > > > 1 | jumped backward 2097ms > > > 16020 | jumped forward 175ms > > > 6443 | jumped forward 699ms > > > 2976 | jumped forward 1398ms > > > 9 | jumped forward 356516ms > > > 9 | jumped forward 357215ms > > > 4 | jumped forward 714430ms > > > 1 | jumped forward 3578440ms > > > > > > This works out to a jump larger than 100 ms about every 5.5 seconds on > > > each CPU core. > > > > > > The largest jump (almost an hour!) was the following sequence of > reads: > > > 0x0000007fffffffff → 0x00000093feffffff → 0x0000008000000000 > > > > > > Note that the middle bits don't necessarily all read as all zeroes or > > > all ones during the anomalous behavior; however the low 10 bits > checked > > > by the function in this patch have never been observed with any other > > > value. > > > > > > Also note that smaller jumps are much more common, with backward jumps > > > of 2048 (2^11) cycles observed over 400 times per second on each core. > > > (Of course, this is partially explained by lower bits rolling over > more > > > frequently.) Any one of these could have caused the 95 year time skip. > > > > > > Similar anomalies were observed while reading CNTPCT (after patching > the > > > kernel to allow reads from userspace). However, the CNTPCT jumps are > > > much less frequent, and only small jumps were observed. The same > program > > > as before (except now reading CNTPCT) observed after 72 hours: > > > > > > Count | Event > > > -------+--------------------------- > > > 17 | jumped backward 699ms > > > 52 | jumped forward 175ms > > > 2831 | jumped forward 699ms > > > 5 | jumped forward 1398ms > > > > > > Further investigation showed that the instability in CNTPCT/CNTVCT > also > > > affected the respective timer's TVAL register. The following values > were > > > observed immediately after writing CNVT_TVAL to 0x10000000: > > > > > > CNTVCT | CNTV_TVAL | CNTV_CVAL | CNTV_TVAL > Error > > > > --------------------+------------+--------------------+----------------- > > > 0x000000d4a2d8bfff | 0x10003fff | 0x000000d4b2d8bfff | +0x00004000 > > > 0x000000d4a2d94000 | 0x0fffffff | 0x000000d4b2d97fff | -0x00004000 > > > 0x000000d4a2d97fff | 0x10003fff | 0x000000d4b2d97fff | +0x00004000 > > > 0x000000d4a2d9c000 | 0x0fffffff | 0x000000d4b2d9ffff | -0x00004000 > > > > > > The pattern of errors in CNTV_TVAL seemed to depend on exactly which > > > value was written to it. For example, after writing 0x10101010: > > > > > > CNTVCT | CNTV_TVAL | CNTV_CVAL | CNTV_TVAL > Error > > > > --------------------+------------+--------------------+----------------- > > > 0x000001ac3effffff | 0x1110100f | 0x000001ac4f10100f | +0x1000000 > > > 0x000001ac40000000 | 0x1010100f | 0x000001ac5110100f | -0x1000000 > > > 0x000001ac58ffffff | 0x1110100f | 0x000001ac6910100f | +0x1000000 > > > 0x000001ac66000000 | 0x1010100f | 0x000001ac7710100f | -0x1000000 > > > 0x000001ac6affffff | 0x1110100f | 0x000001ac7b10100f | +0x1000000 > > > 0x000001ac6e000000 | 0x1010100f | 0x000001ac7f10100f | -0x1000000 > > > > > > I was also twice able to reproduce the issue covered by Allwinner's > > > workaround[4], that writing to TVAL sometimes fails, and both CVAL and > > > TVAL are left with entirely bogus values. One was the following > values: > > > > > > CNTVCT | CNTV_TVAL | CNTV_CVAL > > > > --------------------+------------+-------------------------------------- > > > 0x000000d4a2d6014c | 0x8fbd5721 | 0x000000d132935fff (615s in the > past) > > > > > > > ======================================================================== > > > > > > Because the CPU can read the CNTPCT/CNTVCT registers faster than they > > > change, performing two reads of the register and comparing the high > bits > > > (like other workarounds) is not a workable solution. And because the > > > timer can jump both forward and backward, no pair of reads can > > > distinguish a good value from a bad one. The only way to guarantee a > > > good value from consecutive reads would be to read _three_ times, and > > > take the middle value only if the three values are 1) each unique and > > > 2) increasing. This takes at minimum 3 counter cycles (125 ns), or > more > > > if an anomaly is detected. > > > > > > However, since there is a distinct pattern to the bad values, we can > > > optimize the common case (1022/1024 of the time) to a single read by > > > simply ignoring values that match the error pattern. This still takes > no > > > more than 3 cycles in the worst case, and requires much less code. As > an > > > additional safety check, we still limit the loop iteration to the > number > > > of max-frequency (1.2 GHz) CPU cycles in three 24 MHz counter periods. > > > > > > For the TVAL registers, the simple solution is to not use them. > Instead, > > > read or write the CVAL and calculate the TVAL value in software. > > > > > > Although the manufacturer is aware of at least part of the erratum[4], > > > there is no official name for it. For now, use the kernel-internal > name > > > "UNKNOWN1". > > > > > > [1]: https://github.com/armbian/build/commit/a08cd6fe7ae9 > > > [2]: https://forum.armbian.com/topic/3458-a64-datetime-clock-issue/ > > > [3]: https://irclog.whitequark.org/linux-sunxi/2018-01-26 > > > [4]: > https://github.com/Allwinner-Homlet/H6-BSP4.9-linux/blob/master/drivers/clocksource/arm_arch_timer.c#L272 > > > > > nit: In general, I'm not overly keen on URLs in commit messages, as they > > may vanish without notice and the commit message becomes less useful. In > > the future, please keep those in the cover letter (though in this > > particular case, the commit message explains the issue pretty well, so > > no harm done once GitHub dies a horrible death... ;-). > > > > The fix itself looks pretty solid, and will hopefully make the > > "AllLoosers" HW more usable. > > Unfortunately this patch doesn't completely eliminate the jumps. There > have been reports from users who still saw 95y jump even with the > patch applied. > > Personally I've seen it once or twice on my Pine64-LTS. > > Looks like we need bigger hammer. Does anyone have any idea what it could > be? > > Regards, > Vasily > > > > Reviewed-by: Marc Zyngier <marc....@arm.com <javascript:>> > > > > Daniel, please consider this for v5.1. > > > > Thanks, > > > > M. > > -- > > Jazz is not dead. It just smells funny... > > > > -- > > You received this message because you are subscribed to the Google > Groups "linux-sunxi" group. > > To unsubscribe from this group and stop receiving emails from it, send > an email to linux...@googlegroups.com <javascript:>. > > For more options, visit https://groups.google.com/d/optout. > On Wednesday, December 4, 2019 at 5:19:23 AM UTC+1, Vasily Khoruzhick wrote: > > On Mon, Jan 14, 2019 at 1:25 AM Marc Zyngier <marc....@arm.com > <javascript:>> wrote: > > > > Hi Samuel, > > Hi Samuel, > > > On 13/01/2019 02:17, Samuel Holland wrote: > > > The Allwinner A64 SoC is known[1] to have an unstable architectural > > > timer, which manifests itself most obviously in the time jumping > forward > > > a multiple of 95 years[2][3]. This coincides with 2^56 cycles at a > > > timer frequency of 24 MHz, implying that the time went slightly > backward > > > (and this was interpreted by the kernel as it jumping forward and > > > wrapping around past the epoch). > > > > > > Investigation revealed instability in the low bits of CNTVCT at the > > > point a high bit rolls over. This leads to power-of-two cycle forward > > > and backward jumps. (Testing shows that forward jumps are about twice > as > > > likely as backward jumps.) Since the counter value returns to normal > > > after an indeterminate read, each "jump" really consists of both a > > > forward and backward jump from the software perspective. > > > > > > Unless the kernel is trapping CNTVCT reads, a userspace program is > able > > > to read the register in a loop faster than it changes. A test program > > > running on all 4 CPU cores that reported jumps larger than 100 ms was > > > run for 13.6 hours and reported the following: > > > > > > Count | Event > > > -------+--------------------------- > > > 9940 | jumped backward 699ms > > > 268 | jumped backward 1398ms > > > 1 | jumped backward 2097ms > > > 16020 | jumped forward 175ms > > > 6443 | jumped forward 699ms > > > 2976 | jumped forward 1398ms > > > 9 | jumped forward 356516ms > > > 9 | jumped forward 357215ms > > > 4 | jumped forward 714430ms > > > 1 | jumped forward 3578440ms > > > > > > This works out to a jump larger than 100 ms about every 5.5 seconds on > > > each CPU core. > > > > > > The largest jump (almost an hour!) was the following sequence of > reads: > > > 0x0000007fffffffff → 0x00000093feffffff → 0x0000008000000000 > > > > > > Note that the middle bits don't necessarily all read as all zeroes or > > > all ones during the anomalous behavior; however the low 10 bits > checked > > > by the function in this patch have never been observed with any other > > > value. > > > > > > Also note that smaller jumps are much more common, with backward jumps > > > of 2048 (2^11) cycles observed over 400 times per second on each core. > > > (Of course, this is partially explained by lower bits rolling over > more > > > frequently.) Any one of these could have caused the 95 year time skip. > > > > > > Similar anomalies were observed while reading CNTPCT (after patching > the > > > kernel to allow reads from userspace). However, the CNTPCT jumps are > > > much less frequent, and only small jumps were observed. The same > program > > > as before (except now reading CNTPCT) observed after 72 hours: > > > > > > Count | Event > > > -------+--------------------------- > > > 17 | jumped backward 699ms > > > 52 | jumped forward 175ms > > > 2831 | jumped forward 699ms > > > 5 | jumped forward 1398ms > > > > > > Further investigation showed that the instability in CNTPCT/CNTVCT > also > > > affected the respective timer's TVAL register. The following values > were > > > observed immediately after writing CNVT_TVAL to 0x10000000: > > > > > > CNTVCT | CNTV_TVAL | CNTV_CVAL | CNTV_TVAL > Error > > > > --------------------+------------+--------------------+----------------- > > > 0x000000d4a2d8bfff | 0x10003fff | 0x000000d4b2d8bfff | +0x00004000 > > > 0x000000d4a2d94000 | 0x0fffffff | 0x000000d4b2d97fff | -0x00004000 > > > 0x000000d4a2d97fff | 0x10003fff | 0x000000d4b2d97fff | +0x00004000 > > > 0x000000d4a2d9c000 | 0x0fffffff | 0x000000d4b2d9ffff | -0x00004000 > > > > > > The pattern of errors in CNTV_TVAL seemed to depend on exactly which > > > value was written to it. For example, after writing 0x10101010: > > > > > > CNTVCT | CNTV_TVAL | CNTV_CVAL | CNTV_TVAL > Error > > > > --------------------+------------+--------------------+----------------- > > > 0x000001ac3effffff | 0x1110100f | 0x000001ac4f10100f | +0x1000000 > > > 0x000001ac40000000 | 0x1010100f | 0x000001ac5110100f | -0x1000000 > > > 0x000001ac58ffffff | 0x1110100f | 0x000001ac6910100f | +0x1000000 > > > 0x000001ac66000000 | 0x1010100f | 0x000001ac7710100f | -0x1000000 > > > 0x000001ac6affffff | 0x1110100f | 0x000001ac7b10100f | +0x1000000 > > > 0x000001ac6e000000 | 0x1010100f | 0x000001ac7f10100f | -0x1000000 > > > > > > I was also twice able to reproduce the issue covered by Allwinner's > > > workaround[4], that writing to TVAL sometimes fails, and both CVAL and > > > TVAL are left with entirely bogus values. One was the following > values: > > > > > > CNTVCT | CNTV_TVAL | CNTV_CVAL > > > > --------------------+------------+-------------------------------------- > > > 0x000000d4a2d6014c | 0x8fbd5721 | 0x000000d132935fff (615s in the > past) > > > > > > > ======================================================================== > > > > > > Because the CPU can read the CNTPCT/CNTVCT registers faster than they > > > change, performing two reads of the register and comparing the high > bits > > > (like other workarounds) is not a workable solution. And because the > > > timer can jump both forward and backward, no pair of reads can > > > distinguish a good value from a bad one. The only way to guarantee a > > > good value from consecutive reads would be to read _three_ times, and > > > take the middle value only if the three values are 1) each unique and > > > 2) increasing. This takes at minimum 3 counter cycles (125 ns), or > more > > > if an anomaly is detected. > > > > > > However, since there is a distinct pattern to the bad values, we can > > > optimize the common case (1022/1024 of the time) to a single read by > > > simply ignoring values that match the error pattern. This still takes > no > > > more than 3 cycles in the worst case, and requires much less code. As > an > > > additional safety check, we still limit the loop iteration to the > number > > > of max-frequency (1.2 GHz) CPU cycles in three 24 MHz counter periods. > > > > > > For the TVAL registers, the simple solution is to not use them. > Instead, > > > read or write the CVAL and calculate the TVAL value in software. > > > > > > Although the manufacturer is aware of at least part of the erratum[4], > > > there is no official name for it. For now, use the kernel-internal > name > > > "UNKNOWN1". > > > > > > [1]: https://github.com/armbian/build/commit/a08cd6fe7ae9 > > > [2]: https://forum.armbian.com/topic/3458-a64-datetime-clock-issue/ > > > [3]: https://irclog.whitequark.org/linux-sunxi/2018-01-26 > > > [4]: > https://github.com/Allwinner-Homlet/H6-BSP4.9-linux/blob/master/drivers/clocksource/arm_arch_timer.c#L272 > > > > > nit: In general, I'm not overly keen on URLs in commit messages, as they > > may vanish without notice and the commit message becomes less useful. In > > the future, please keep those in the cover letter (though in this > > particular case, the commit message explains the issue pretty well, so > > no harm done once GitHub dies a horrible death... ;-). > > > > The fix itself looks pretty solid, and will hopefully make the > > "AllLoosers" HW more usable. > > Unfortunately this patch doesn't completely eliminate the jumps. There > have been reports from users who still saw 95y jump even with the > patch applied. > > Personally I've seen it once or twice on my Pine64-LTS. > I can conform that. Our team in Prusa Research has built a printer on top of the A64 and as soon as the testing production reached a few dozens of units, bug reports started coming in. > > Looks like we need bigger hammer. Does anyone have any idea what it could > be? > We've decided to apply the QorIQ Erratum A-008585 (FSL_ERRATUM_A008585) workaround instead and that solved the issue for us: over a thousand units have been shipped to our customers and so far so good. > > Regards, > Vasily > > > > Reviewed-by: Marc Zyngier <marc....@arm.com <javascript:>> > > > > Daniel, please consider this for v5.1. > > > > Thanks, > > > > M. > > -- > > Jazz is not dead. It just smells funny... > > > > -- > > You received this message because you are subscribed to the Google > Groups "linux-sunxi" group. > > To unsubscribe from this group and stop receiving emails from it, send > an email to linux...@googlegroups.com <javascript:>. > > For more options, visit https://groups.google.com/d/optout. > Best regards Roman B. -- You received this message because you are subscribed to the Google Groups "linux-sunxi" group. To unsubscribe from this group and stop receiving emails from it, send an email to linux-sunxi+unsubscr...@googlegroups.com. To view this discussion on the web, visit https://groups.google.com/d/msgid/linux-sunxi/4005fa8b-6f72-4f2a-b8cb-669c2eb8c067%40googlegroups.com.