On Fri, Feb 12, 2021 at 1:28 PM Catalin Marinas <catalin.mari...@arm.com> wrote: > The only downside I think is that for some syscalls it's not that > efficient. Those using struct iovec come to mind, qemu probably > duplicates the user structures, having to copy them in both directions > (well, the kernel compat layer does something similar). > > Anyway, I'm not in favour of this patch. Those binary translation tools > need to explore the user-only options first and come up with some perf > numbers to justify the proposal.
I'd like to elaborate Tango's point of view on this problem. Quick recap: Tango allows AArch32 programs to run on AArch64 CPUs that don't support 32-bit mode. The primary use case is supporting 32-bit Android apps, which means that Tango needs to be able to support the full set of syscalls used on Android, including interfacing with many drivers that are not in the mainline kernel. The patch proposed by Ryan is based on the kernel patch used by Tango which can be found here: https://github.com/Amanieu/linux/tree/tango-v5.4 Efficiency is not the concern here: copying/rearranging some bytes is tiny compared to the cost of a syscall. The main concern is correctness: there are many cases where userspace does not have the information or the capabilities needed to ensure that the 32-bit syscall ABI is correctly emulated. There are two distinct parts to this: compat syscall emulation and mmap address selection. I will address each separately. Part 1: Compat syscall emulation Even with this patch, Tango doesn't just pass 32-bit syscall through to the kernel directly. We have ~5000 lines of code dealing with various details such as memory management, signal handling, /proc emulation, ptrace emulation, etc. However once this is done, Tango will pass the syscall through to the kernel as a 32-bit compat syscall instead of as a 64-bit syscall. Here are several issues, off the top of my head, which are impossible or impractical to support in user-mode: - As mentioned before, there are a huge number of ioctls which behave differently in 32-bit mode. It is impractical and error prone to manually emulate them all in user mode. Specifically, the kernel already has a well-tested and reliable compatibility layer and it makes sense to reuse this. QEMU supports emulating some ioctls in userspace but this still does not cover devices like GPUs which are needed for accelerated rendering. - The 64-bit set_robust_list is not compatible with the 32-bit ABI. The compat version of set_robust_list must be used. Emulating this in user mode is not reliable since SIGKILL cannot be caught. - io_uring uses iovec structures as part of its API, which have different sizes on 32-bit and 64-bit. This makes io_uring unusable - ext4 represents positions in directories as a 64-bit hash, which break if they are truncated to 32 bits. There is special support for 32-bit off_t in the ext4 driver but this is only used when in_compat_syscall is true. QEMU also suffers from this problem: https://bugzilla.kernel.org/show_bug.cgi?id=205957 Additionally, for Tango we want 32-bit programs to be able to use seccomp filters, which is required by the Android CTS. Tango intercepts seccomp filter installation and inserts a prefix which whitelists 64-bit syscalls used internally by Tango and passes the rest through to the user seccomp filter. For this to work, the kernel must report 32-bit syscalls from 64-bit processes as AUDIT_ARCH_ARM with the compat syscall number. These issues are all solved by exposing compat syscalls to 64-bit processes and ensuring is_compat_task/in_compat_syscall is true for the duration of that syscall. There is a precedent for this: on x86, syscalls made with int 0x80 are treated as 32-bit syscalls even if they come from a 64-bit process. Aside from seccomp support, this also solves FEX's concerns for x86-to-AArch64 translation. There are of course some structures with architecture-specific differences (e.g. epoll, stat, statfs) which have to be translated manually in userspace, but the vast majority of the ABI differences are simply due to 32/64-bit differences which apply to all architectures. Part 2: mmap address range A binary translator such as FEX or Tango generally splits the address space into two parts: the lower 4GB are reserved for the use of the 32-bit process and the rest of the address space is for the translator's internal use (e.g. JIT cache). It is important that any VM regions allocated through syscalls by the translated application be located in the lower 4GB. QEMU reserves 4G of address space with PROT_NONE and maps chunks out of it for the application with MAP_FIXED as needed. However this doesn't work for all cases: - The io_setup syscall allocates a VM area for the AIO context and returns it. But there is no way to control where this context is allocated so it will almost always end up above the 4GB limit. - Some ioctls will also perform VM allocations, with the same issues as io_setup. Search for "vm_mmap" in drivers/. - Some file descriptors have alignment requirements which are not known to userspace. For example, a hugetlbfs file can only be mmaped at a huge page alignment but there is no way for userspace to know this when selecting an address. - The Mali kbase out-of-tree driver outright forbids MAP_FIXED when mapping GPU memory and insists on selecting a properly aligned address itself. - shmat and shmdt are particularly difficult to emulate since the length of the mapping is not passed in as a parameter. They also suffer from race conditions since shmdt leaves a gap in the 4GB reserved space which could be filled in by a concurrent mmap operation. The solution proposed in this patch is to use a separate mmap_base when a compat syscall is being executed by a 64-bit process. This mmap_base is separately randomized on process startup so that translated processes benefit from the additional security. All VM allocations performed by 32-bit-under-64-bit syscalls will be done in the low 4GB using this new mmap_base, while 64-bit syscalls used by the translator continue to use the original mmap_base. A possible alternative approach would be to use a prctl to restrict the mmap range of the process and allow the translator to manually specify its mmap_base. Any allocations that the translator needs to perform above 4GB would then need to be done with MAP_FIXED, which is workable albeit slightly inconvenient. The main advantage of this alternative is that it is not tied to compat syscalls. An extension to mmap which allows a custom address range to be specified does *not* solve all of the issues listed above, which primarily come from VM allocations performed by syscalls other than mmap.