Seriously, nobody will ever get performance from single-element iterator/range pattern - this makes CPU stall!

Anytime you read one byte from typical hard disk, system reads full sector - typically 512 bytes, no more, no less. Anytime you read one byte from memory, CPU loads entire cache line from RAM into the cache (64 bytes on all modern Intel CPU's). Why not exploit that with ranges?

Ranges could potentially return arrays in front() (or in frontVector/whatever), so basically they will become ranges of ranges where the deepest range is always RandomAccessRange. This has obvious performance benefits. Everyone knows that traversing memory is faster than iteraring by popFront(). On the other hand memory lacks the flexibility of ranges - so lets make a hybrid range!

Advantages:
* performance (!)
* limited lookahead/backtracking

How?

1. instruction level parallelism: CPU's can execute few instructions in parallel given that they operate on different data (different vector elements) 2. SIMD: load entire vector to SSE/AVX register then run the operation on all elements at once 3. use of L1 cache: traversing vector in memory is way faster that calling popFront() for each vector element - likely if it's inlined

Disadvantages:
* potential inconvenience

I know it can't be easy drop-in addition to the current range implementation, but who knows...

Reply via email to