Mark Cave-Ayland <mark.cave-ayl...@ilande.co.uk> writes:
> On 19/02/2019 18:25, Emilio G. Cota wrote: > >> On Tue, Feb 19, 2019 at 14:22:37 +0000, Alex Bennée wrote: >>> Emilio, >>> >>> Any chance you could run it through your benchmark suite? >> >> Something isn't quite right. For instance, gcc in SPEC doesn't >> complete; it fails after 2s with some errors about the syntax of >> the source being compiled. Before the patches the error isn't >> there (the test completes in ~7s, as usual), so I suspect >> some guest memory accesses aren't going to the right place. >> >> I am too busy right now to try to debug this, but if you have >> patches to test, I can run them. The patches I tested are in >> your v3 branch on github, i.e. with 430f28154b5 at the head. > > I spent a few hours yesterday and today testing this patchset against my > OpenBIOS > test images and found that all of them were able to boot, except for one > MacOS image > which started exhibiting flashing blocks on a progress bar during boot. > Tracing this > back, I found the problem was still present when just the first patch > "accel/tcg: > demacro cputlb" was applied. Yeah in the original version of the patch I de-macroed each element one by one which made bisecting errors easier. This is more of a big bang approach which has made it harder to find which bit of the conversion broke. That said I did test it fairly hard on ARM but I guess we see less unaligned and page-crossing access there. > > First of all there were a couple of typos I found in load_helper() and > store_helper() > which can be fixed up with the following diff: > > > diff --git a/accel/tcg/cputlb.c b/accel/tcg/cputlb.c > index 351f579fed..f3cc2dd0d9 100644 > --- a/accel/tcg/cputlb.c > +++ b/accel/tcg/cputlb.c > @@ -1251,7 +1251,7 @@ static tcg_target_ulong load_helper(CPUArchState *env, > target_ulong addr, > } > > tmp = io_readx(env, iotlbentry, mmu_idx, addr, retaddr, > - addr & tlb_addr & TLB_RECHECK, > + tlb_addr & TLB_RECHECK, > code_read ? MMU_INST_FETCH : MMU_DATA_LOAD, size); > return handle_bswap(tmp, size, big_endian); > } > @@ -1405,14 +1405,14 @@ tcg_target_ulong > helper_le_ldsl_mmu(CPUArchState *env, target_ulong addr, TCGMemOpIdx oi, > uintptr_t retaddr) > { > - return (int32_t)helper_le_lduw_mmu(env, addr, oi, retaddr); > + return (int32_t)helper_le_ldul_mmu(env, addr, oi, retaddr); > } > > tcg_target_ulong > helper_be_ldsl_mmu(CPUArchState *env, target_ulong addr, TCGMemOpIdx oi, > uintptr_t retaddr) > { > - return (int32_t)helper_be_lduw_mmu(env, addr, oi, retaddr); > + return (int32_t)helper_be_ldul_mmu(env, addr, oi, retaddr); > } > + return (int32_t)helper_le_ldul_mmu(env, addr, oi, retaddr); > } > > tcg_target_ulong > helper_be_ldsl_mmu(CPUArchState *env, target_ulong addr, TCGMemOpIdx oi, > uintptr_t retaddr) > { > - return (int32_t)helper_be_lduw_mmu(env, addr, oi, retaddr); > + return (int32_t)helper_be_ldul_mmu(env, addr, oi, retaddr); > } Awesome, I shall apply those. > > /* > @@ -1427,6 +1427,7 @@ static void store_helper(CPUArchState *env, > target_ulong addr, > uint64_t val, > uintptr_t index = tlb_index(env, mmu_idx, addr); > CPUTLBEntry *entry = tlb_entry(env, mmu_idx, addr); > target_ulong tlb_addr = tlb_addr_write(entry); > + const size_t tlb_off = offsetof(CPUTLBEntry, addr_write); > unsigned a_bits = get_alignment_bits(get_memop(oi)); > uintptr_t haddr; > > @@ -1438,7 +1439,8 @@ static void store_helper(CPUArchState *env, > target_ulong addr, > uint64_t val, > > /* If the TLB entry is for a different page, reload and try again. */ > if (!tlb_hit(tlb_addr, addr)) { > - if (!VICTIM_TLB_HIT(addr_write, addr)) { > + if (!victim_tlb_hit(env, mmu_idx, index, tlb_off, > + addr & TARGET_PAGE_MASK)) { > tlb_fill(ENV_GET_CPU(env), addr, size, MMU_DATA_STORE, > mmu_idx, retaddr); > index = tlb_index(env, mmu_idx, addr); > @@ -1466,7 +1468,6 @@ static void store_helper(CPUArchState *env, > target_ulong addr, > uint64_t val, > && unlikely((addr & ~TARGET_PAGE_MASK) + size - 1 > >= TARGET_PAGE_SIZE)) { > int i; > - uintptr_t index2; > CPUTLBEntry *entry2; > target_ulong page2, tlb_addr2; > do_unaligned_access: > @@ -1476,15 +1477,13 @@ static void store_helper(CPUArchState *env, > target_ulong > addr, uint64_t val, > * cannot evict the first. > */ > page2 = (addr + size) & TARGET_PAGE_MASK; > - index2 = tlb_index(env, mmu_idx, page2); > - entry2 = tlb_entry(env, mmu_idx, index2); > + entry2 = tlb_entry(env, mmu_idx, page2); > tlb_addr2 = tlb_addr_write(entry2); > if (!tlb_hit_page(tlb_addr2, page2) > - && !VICTIM_TLB_HIT(addr_write, page2)) { > + && !victim_tlb_hit(env, mmu_idx, index, tlb_off, > + page2 & TARGET_PAGE_MASK)) { > tlb_fill(ENV_GET_CPU(env), page2, size, MMU_DATA_STORE, > mmu_idx, retaddr); > - index2 = tlb_index(env, mmu_idx, page2); > - entry2 = tlb_entry(env, mmu_idx, index2); > } Oops, I thought I'd applied the victim fix to both loads and stores. > > /* > > > Even with this patch applied I was still seeing issues with the progress bar, > so I > ended up manually unrolling softmmu_template.h myself until I could see at > what point > things were breaking. > > The culprit turned out to be the change in type of res in load_helper() from > the size > -related type of uint8_t/uint16_t/uint32_t/uint64_t to tcg_target_ulong in > the slow > path as seen from my locally unrolled version: > > > WORKING: > > tcg_target_ulong helper_le_ldul_mmu(CPUArchState *env, target_ulong addr, > TCGMemOpIdx oi, uintptr_t retaddr) > { > .... > > /* Handle slow unaligned access (it spans two pages or IO). */ > if (4 > 1 > && unlikely((addr & ~TARGET_PAGE_MASK) + 4 - 1 > >= TARGET_PAGE_SIZE)) { > target_ulong addr1, addr2; > uint32_t res1, res2; > ^^^^^^^^ > unsigned shift; > do_unaligned_access: > addr1 = addr & ~(4 - 1); > addr2 = addr1 + 4; > res1 = helper_le_ldul_mmu(env, addr1, oi, retaddr); > res2 = helper_le_ldul_mmu(env, addr2, oi, retaddr); > shift = (addr & (4 - 1)) * 8; > > /* Little-endian combine. */ > res = (res1 >> shift) | (res2 << ((4 * 8) - shift)); > return res; > } > > } > > tcg_target_ulong helper_be_ldul_mmu(CPUArchState *env, target_ulong addr, > TCGMemOpIdx oi, uintptr_t retaddr) > { > .... > > /* Handle slow unaligned access (it spans two pages or IO). */ > if (4 > 1 > && unlikely((addr & ~TARGET_PAGE_MASK) + 4 - 1 > >= TARGET_PAGE_SIZE)) { > target_ulong addr1, addr2; > uint32_t res1, res2; > ^^^^^^^^ > unsigned shift; > do_unaligned_access: > addr1 = addr & ~(4 - 1); > addr2 = addr1 + 4; > res1 = helper_be_ldul_mmu(env, addr1, oi, retaddr); > res2 = helper_be_ldul_mmu(env, addr2, oi, retaddr); > shift = (addr & (4 - 1)) * 8; > > /* Big-endian combine. */ > res = (res1 << shift) | (res2 >> ((4 * 8) - shift)); > return res; > } > } > > > CORRUPTED: > > tcg_target_ulong helper_le_ldul_mmu(CPUArchState *env, target_ulong addr, > TCGMemOpIdx oi, uintptr_t retaddr) > { > .... > > /* Handle slow unaligned access (it spans two pages or IO). */ > if (4 > 1 > && unlikely((addr & ~TARGET_PAGE_MASK) + 4 - 1 > >= TARGET_PAGE_SIZE)) { > target_ulong addr1, addr2; > tcg_target_ulong res1, res2; > ^^^^^^^^^^^^^^^^ > unsigned shift; > do_unaligned_access: > addr1 = addr & ~(4 - 1); > addr2 = addr1 + 4; > res1 = helper_le_ldul_mmu(env, addr1, oi, retaddr); > res2 = helper_le_ldul_mmu(env, addr2, oi, retaddr); > shift = (addr & (4 - 1)) * 8; > > /* Little-endian combine. */ > res = (res1 >> shift) | (res2 << ((4 * 8) - shift)); > return res; > } > > } > > tcg_target_ulong helper_be_ldul_mmu(CPUArchState *env, target_ulong addr, > TCGMemOpIdx oi, uintptr_t retaddr) > { > .... > > /* Handle slow unaligned access (it spans two pages or IO). */ > if (4 > 1 > && unlikely((addr & ~TARGET_PAGE_MASK) + 4 - 1 > >= TARGET_PAGE_SIZE)) { > target_ulong addr1, addr2; > tcg_target_ulong res1, res2; > ^^^^^^^^^^^^^^^^ > unsigned shift; > do_unaligned_access: > addr1 = addr & ~(4 - 1); > addr2 = addr1 + 4; > res1 = helper_be_ldul_mmu(env, addr1, oi, retaddr); > res2 = helper_be_ldul_mmu(env, addr2, oi, retaddr); > shift = (addr & (4 - 1)) * 8; > > /* Big-endian combine. */ > res = (res1 << shift) | (res2 >> ((4 * 8) - shift)); > return res; > } > } > > > Presumably the issue here is somehow related to the compiler incorrectly > extending/reducing the shift when the larger type is involved? Also during my > tests > the visual corruption was only present for 32-bit accesses, but presumably > all the > access sizes will need a similar fix. So I did fix this in the third patch when I out of lined the unaligned helpers so: const tcg_target_ulong size_mask = MAKE_64BIT_MASK(0, size * 8); and /* Big-endian combine. */ r2 &= size_mask; or /* Little-endian combine. */ r1 &= size_mask; I guess I should also apply that to patch 1 for bisectability. Thanks for digging into the failures. It does make me think that we really could do with some system mode tests to properly exercise all softmmu code paths: * each size access, load and store * unaligned accesses, load and store (+ potential arch specific handling) * page striding accesses faulting and non-faulting I'm not sure how much work it would be to get a minimal bare metal kernel that would be able to do these and report success/fail. When I did the MTTCG work I used kvm-unit-tests as a fairly complete mini-os but maybe that's overkill for what we need here. After all we already have migration kernels that just do a simple tick-tock for testing migration. It would be nice to have the tests in C with maybe a minimal per-arch assembly so we can share the tests across various platforms. -- Alex Bennée