Oh, cleaning up the compiler flags is definitely on my to-do list, but I lack access to a sufficiently diverse set of targets to do much benchmarking (or much general testing) myself.
I can test some of the PP targets: Samsung YH820 and YH925 (both PP5020), Philips (PP5022 IIRC) and iPod Mini gen1. I also can test Creative ZEN, Creative ZEN XFi2 and Fuze+. However time is the limiting factor ATM :) Just leave a note on IRC (I always read the logs) or PM me.
IMO we should use -Os for everything, and only use -O2 or -O3 if benchmarks show a tangible benefit (and the code still fits, obviously).
Would it be feasible to compile to Thumb-mode (ARM targets only, of course)? This could give even more savings. However it may be neccessary to keep computationally intensive code parts like Codecs in ARM mode. As a side note, I tried out different optimation settings on a PP5020 target with my WIP 2600 emulator; -O2 gave the best results, -O3 resulted in a much bigger binary and was even slightly slower. Consequently shrinking code size had positive effekts (even when I removed inlining etc.). Sebastian