> Is your firmware (sysupgrade) bigger than 16MB? No, the sysupgrade file is currently 13MB.
> So maybe it has to do with switching to 4-address-mode... What is this exactly? > My guess is that the error already happens when reading the flash. At least we know that the flash is not being written to incorrectly since after a reboot the flash is intact and does not produce any errors. It's simply random if the system boots into this "faulty state" or not (happens approx 1-2% of the time). Does anyone maybe know how I can re-read the squashfs partition and verify the integrity while the system is booted to see if I encounter the squashfs errors. I'm really at a loss here - no idea where to even look into diagnosing the issue. On Fri, May 21, 2021 at 6:16 PM Vincent Wiemann <[email protected]> wrote: > > > > On 5/21/21 3:58 PM, Koen Vandeputte wrote: > > > > On 21.05.21 13:19, Ibrahim Tachijian wrote: > >> Hello, > >> > >> We use approximately 10k IPQ40XX devices and we have noticed that > >> every time we run "sysupgrade -n" we lose approximately 1% of the > >> routers in the process. > >> After further investigation I'm almost confident that it is not the > >> sysupgrade process that is the culprit - so what I did was that I put > >> one test router into a reboot loop. > >> > >> This is what I do; > >> > >> Boot the router in a fresh state after a newly installed image. > >> The image contains a reboot loop that consists of a shell script that > >> runs every minute. > >> > >> The shell script tries to run a php-script which simply echoes "Hello > >> World". If the php-script exists normally then we reboot the router. > >> > >> However the php-script exists abnormally then the router stops and > >> does nothing other than informing me that there was a bus-error making > >> php not able to process the hello world script. > >> > >> When this process runs the router reboots approximately 50 times > >> before it boots into a state which is faulty where I see bus-errors > >> when I try to run php scripts for example. > >> > >> > >> Looking into dmesg you can see some errors such as, > >> > >> [10985.209438] SQUASHFS error: squashfs_read_data failed to read block > >> 0x3a803e > >> [11045.218685] SQUASHFS error: xz decompression failed, data probably > >> corrupt > >> [11045.218731] SQUASHFS error: squashfs_read_data failed to read block > >> 0x3a803e > >> [11105.228157] SQUASHFS error: xz decompression failed, data probably > >> corrupt > >> [11105.228203] SQUASHFS error: squashfs_read_data failed to read block > >> 0x3a803e > >> > >> or > >> > >> [26218.687905] SQUASHFS error: Unable to read page, block 1b99a, size > >> 10234 > >> [26221.057472] SQUASHFS error: Unable to read data cache entry [1b99a] > >> [26221.057551] SQUASHFS error: Unable to read page, block 1b99a, size > >> 10234 > >> [26221.062926] SQUASHFS error: Unable to read data cache entry [1b99a] > >> [26221.069742] SQUASHFS error: Unable to read page, block 1b99a, size > >> 10234 > >> [26224.460239] SQUASHFS error: Unable to read data cache entry [1b99a] > >> [26224.460320] SQUASHFS error: Unable to read page, block 1b99a, size > >> 10234 > >> > >> or > >> > >> [62745.801178] SQUASHFS error: squashfs_read_data failed to read block > >> 0x732ae2 > >> [62773.347234] SQUASHFS error: xz decompression failed, data probably > >> corrupt > >> [62773.347281] SQUASHFS error: squashfs_read_data failed to read block > >> 0x732ae2 > >> [62790.132661] SQUASHFS error: xz decompression failed, data probably > >> corrupt > >> [62790.132706] SQUASHFS error: squashfs_read_data failed to read block > >> 0x732ae2 > >> [62790.216746] SQUASHFS error: xz decompression failed, data probably > >> corrupt > >> [62790.216792] SQUASHFS error: squashfs_read_data failed to read block > >> 0x732ae2 > >> [62800.810525] SQUASHFS error: xz decompression failed, data probably > >> corrupt > >> [62800.810570] SQUASHFS error: squashfs_read_data failed to read block > >> 0x732ae2 > >> [62828.336267] SQUASHFS error: xz decompression failed, data probably > >> corrupt > >> > >> > >> > >> Now, you would assume that the squashfs-partition is broken - but if > >> this was the case then a reboot should not help. It does. > >> Rebooting the router after it boots in this faulty state fixes the issue. > >> > >> So approximately 1-2% of my reboots make the router go into this > >> faulty state. > >> > >> I am clueless on how to further investigate this issue. For now my > >> work around is restarting the router via a bash script should it > >> notice there are bus-errors or i/o errors. > >> > >> Thanks > >> > > In the next kernel bump, following patch is also present: > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=v5.10.38&id=2ed1d90162a0c0683ecbe0c4802187fa22d641c3 > > > > > > I think it's worth a shot to retry the tests once it's bumped. > > > > Koen > > > > My guess is that the error already happens when reading the flash. > Is your firmware (sysupgrade) bigger than 16MB? > So maybe it has to do with switching to 4-address-mode... > > Best, > > Vincent > > _______________________________________________ > openwrt-devel mailing list > [email protected] > https://lists.openwrt.org/mailman/listinfo/openwrt-devel -- Ibrahim Tachijian _______________________________________________ openwrt-devel mailing list [email protected] https://lists.openwrt.org/mailman/listinfo/openwrt-devel
