With autovac off I see 8.3 as faster than 8.2 in pgbench.

Indeed. I'm seeing much better pgbench results from HEAD than 8.2 when I set the configurations up identically. I'm hoping to have a comparison set to show everyone this week.

and use -t at least 1000 or so (otherwise startup transients are significant).

I personally consider any pgbench run that lasts less than several minutes noise. On a system that hits 500 TPS like Pavel's, I'd want to see around 100,000 transactions before I consider the results significant. And then I'd want a set of 3 at each configuration because even with longer runs, you occasionally get really odd results. Until you have 3 it can be unclear which is the weird one.

