I don't know whether this is worth it. I would need to spend a lot more time
with the simulator and oprofile. But once I had the idea I had to write it
down. And if somebody decides to build a non-PIC ABI, this is definitely worth
it.
I mentioned before I'm wondering about the tiny number of TLB entries on the
low-end MIPS-based routers. Otherwise-decent ath71xx routers like the DIR-825
have 16 JTLB entries to cover 64M of RAM. The DIR-601-A1 has 32M, and it's
selling for $13[1] refurbished on newegg.com right now.
The 3.3.8 kernel on these devices seems to see ~47 cycles of latency per TLB
miss.[2].) These devices have 64k icache/32k dcache; cache misses are ~42
cycles. Because of how jumpy PIC is, I think we can start missing TLB before
missing the cache....
One no-code solution is to turn on 16kB pages. Unlike 64kB pages, 16k at least
boots on qemu, Looking at a /proc/self/maps for a shell, the number of TLB
entries to cover the whole process would go from 167 pairs to 54.[2] The reason
it's not 4:1 is granularity; there are costs to separating libc.so from libm,
libgcc_s, and libcrypt. Which reminds me: it might be worth it to do some
really cheap profiling to let gcc separate hot from cold functions.
Just about every process has libc and libgcc_s mapped, and many will have libm.
Busybox is a good proportion of workload too. Counting the pure read-only pages
I see 127 page pairs out of that 167 (or 36 out of 54) which could be mapped by
a single large TLB entry.
There's existing infrastructure in the Linux kernel for manually mapping huge
pages. Unfortunately on MIPS it's only working on 64-bit kernels. But it's easy
to take a power-of-two aligned chunk of physmem and just slam it into every
process's address space at some aligned virtual address. That's just adding a
WIRED TLB entry with the Global bit set, and marking that range as not
available to mmap.
This does not necessarily interfere with the basic shared library scheme. ELF
PIC code does need per-process initialized data, and data including the GOT
needs to be at a fixed distance from the code segment. But nothing says they
have to be adjacent, just reachable via PC-relative addressing. A linker append
script[5] can just push the data segment 2M away. The address space would look
like:
++++++
main program
heap
...
normal shared libraries
....
=== fixed global readonly segment ===
libc.text
...
libm.text
...
libbusybox.text
=== end global readonly segment ===
...
libc.data = libc.text + 2M
...
libm.data = libm.text + 2M
...
libbusybox.data = libbusybox.text + 2M
....
stack
++++++
The primary change required would be to teach ld.so about the global segment
and the objects present in it. When ld.so would start to load /lib/libc.so, it
would notice the hard read-only libc.text segment was already present at
address x, skip mapping libc.text but keep x as the vma offset for the rest of
/lib/libc.so's segments, which would be mmapped per process as usual.
Note that a squashfs containing libraries with these 2M gaps should function as
normal if the normal ld.so is used. (Well, the libraries will eat up 2M more
virtual address space each, but it's just PROT_NONE mappings.) The global
segment can be positioned randomly on boot, subject to alignment constraints,
and the position of individual text segments can be shuffled. Although you keep
per-boot ASLR, you do lose per-process ASLR.
The segment is read-only; how would you get anything in there? My guess is that
the global segment could be read/write until ld.so launches the "real" init.
The /lib filesystem is available, and if the global segment was read-write at
that point, ld.so could position the text segments normally, although it'd have
to memcpy instead of mmap them into place.
OK, I'm done. Need to get back to Real Work....
Jay
[1]: Well, the DIR-601 is $13 plus a heatsink. I think they have some serious
overheating issues--which would explain why there are so many refurbished ones.
I had one in my basement, and when the weather turned cold it stopped locking
up. Or maybe it was some 12.09 fix....
[2]: There are non-architected 4-entry micro-ITLB and DTLBs; they eat a cycle
on misses present in the JTLB.
[3]: For mips16 busybox it's 153->148 pairs, or 54->48 pairs.
[4]: And for mips16 busybox there are 108 read-only pairs, or 30 pairs with 16k
pages.
[5]: gcc -shared -T offset.ld, were offset.ld is:
SECTIONS
{
. = . + 0x00200000;
.fakedata : { *(.fakedata) }
}
INSERT AFTER .exception_ranges;
_______________________________________________
openwrt-devel mailing list
[email protected]
https://lists.openwrt.org/mailman/listinfo/openwrt-devel