Consider what happens when you reverise the page and byteINpage loops:
for (byteINpage=0;byteINpage<4096;byteINpage++) for (page=0;page<bytes;page=page+4096)
where you touch a byte in each page before going to the second byte. The working set becomes "terrible". And so does the performance.
And you're surprised by this?
What happens if you run an equivalent program on VM?
It's been a loooooong time since I looked at the VM paging algorithms... and I don't really want to say anything about what I think happens in case they've radically altered in the last few releases and I don't want to look too stupid :-)
Rod