[Oiio-dev] Fwd: [oiio] ImageCache/ImageBuf read performance (#480)

Larry Gritz Sun, 30 Dec 2012 23:40:28 -0800

I don't usually send pull requests to oiio-dev these days, but this was was 
unusually interesting and newsworthy, so I thought it might be interesting to a 
wider audience than those who "watch" the GitHub repo in order to receive pull 
requests.


------

I spent some time carefully profiling image read performance of scanline 
OpenEXR images in OIIO. I was particularly interested in making sure that 
reading via an ImageBuf or ImageCache::get_pixels did not have too much 
overhead compared to a raw ImageInput::read_image (presumed to be "speed of 
light", the barest wrapping of the underlying library calls), and how 
autotile/autoscanline affected things.

First thing I noticed is a big flaw in my libOpenImageIO/imagespeed_test, I had 
neglected to fully flush the ImageCache between sub-tests, and that was 
throwing off my numbers, making IB and IC look rosier than they really were. I 
also augmented imagespeed_test to explicitly cover several more of the 
combinations I describe above. Here is example output from my Macbook Pro on a 
2336x1198 4-channel float OpenEXR file:

read_image speed: 0.12s
read_scanline (1 at a time) speed: 0.33s
read_scanlines (64 at a time) speed: 0.12s
ImageBuf read speed: 0.30s
ImageCache get_pixels speed: 0.28s
With autotile = 64:
ImageBuf read speed: 0.41s
ImageCache get_pixels speed: 0.39s
With autotile = 64, autoscanline = 1:
ImageBuf read speed: 0.48s
ImageCache get_pixels speed: 0.47s
Some interesting things we can say already:

Using read_scanlines in 64-scanline chunks is equally fast as a read_image 
(which internally uses read_scanlines in 256-scanline chunks). But 
read_scanline individually reading each line takes almost 3x longer! This is 
because when reading multiple scanlines at once, the OpenEXR library 
(libIlmImf) uses multiple threads to pipeline the reads and decompresses. When 
you read just one scanline, it doesn't have the opportunity to do that.

Reading the same file with either ImageBuf or IC::get_pixels is a high penalty 
-- about 2.5x slower than speed of light.

When autotile is turned on, it's even worse, and when autoscanline=1 (which is 
supposed to be an improvement, by making the "virtual tiles" the full width of 
the image), it's even slower, 4x SOL.

OK, so I carefully profiled those code paths and made a number of improvements. 
I'll give you the results first:

read_image speed: 0.12s
read_scanline (1 at a time) speed: 0.33s
read_scanlines (64 at a time) speed: 0.12s
ImageBuf read speed: 0.15s
ImageCache get_pixels speed: 0.14s
With autotile = 64:
ImageBuf read speed: 0.33s
ImageCache get_pixels speed: 0.32s
With autotile = 64, autoscanline = 1:
ImageBuf read speed: 0.18s
ImageCache get_pixels speed: 0.17s
Bottom line is that I've DOUBLED the image reading speed of using ImageBuf and 
ImageCache::get_pixels when autotile is off, they are now almost as fast as raw 
calls to ImageInput::read_scanlines or read_image. Autotile is still much 
slower, though much improved compared to before. And the combination of 
autotile and autoscanline has been nearly tripled in performance, and is now 
only incrementally slower than when autotile is off (that is, when the 
ImageCache reads the whole image as one tile).

So, how did I do it? Basically it broke down to three main improvements:

I found some places (including ImageBuf's local storage and the tile memory in 
ImageCacheTile) where I'd used std::vector as a simple memory-managed buffer. 
This turned out to be unwise, because vector zeroes out the memory when you 
resize() to allocate (only to be I immediately filled with the real data 
afterwards). On modern architectures, touching memory unnecessarily (especially 
every byte of a buffer that's bigger than L2 cache!) is one of the most 
expensive things you can do. I made a custom scoped_array template in 
imagebuf.h that does the simple memory management I was after originally. (N.B. 
When we can count on C++11 everywhere, we can use unique_ptr<>, but for now I 
just rolled my own).

The code that read scanlines for the autotile was using individual 
read_scanline calls, not read_scanlines. For OpenEXR in particular, this makes 
a big difference.

The ImageCache was rounding tile sizes up to powers of 2 (including the 
full-image-tile, when autotile was off, and the virtual tile size when autotile 
was on, and the scanline width, when autotile was on and autoscanline was on). 
This is a historical artifact from the time when TextureSystem required 
power-of-2 tile sizes. It no longer does, so I removed the rounding. Note that 
for my file, which coincidentally was 2336 x 1198 (slightly above a power of 2 
in each dimension), that was quite a lot of extra data and memory when rounded 
up, as well as causing various routines that has special fast cases for 
"contiguous stride" data layouts to end up on the slower paths for 
non-contiguous strides.

And there were a few other minor refactors with smaller effects, and some minor 
bug fixes for things that broke with these changes. Look at the individual 
commit comments if you care.

The final takeaway is:

After this is committed, you should expect much faster image read speeds if you 
are going through the ImageBuf or ImageCache interfaces, particularly for 
scanline (non-tiled) images, and especially for OpenEXR where there is an 
actual advantage for read_scanlines versus read_scanline.

If you are using autotile, you should also be using autoscanline.

You can merge this Pull Request by running:

  git pull https://github.com/lgritz/oiio lg-icperf
Or view, comment on, or merge it at:

  https://github.com/OpenImageIO/oiio/pull/480

Commit Summary

More diagnostics in imagespeed_test to pinpoint bottlenecks
scoped_array template for simple memory management of dynamically-all…
ImageBuf perf improvement -- hold local pixels in scoped_array rather…
ImageCache perf: for tile pixel mem, use scoped_array rather than std…
ImageCache perf: use scoped_array rather than vector for read_untiled…
ImageCache perf: read_untiled autotile case -- use read_scanlines ins…
ImageCache perf: refactor get_pixels to do fewer tile queries
ImageCache speedups -- remove pow2 roundups of autotile tile sizes.
oiiotool - fix minor bug where tiled files were output inappropriately.
File Changes

M src/include/imagebuf.h (30)
M src/libOpenImageIO/imagebuf.cpp (70)
M src/libOpenImageIO/imagespeed_test.cpp (83)
M src/libtexture/imagecache.cpp (62)
M src/libtexture/imagecache_pvt.h (6)
M src/oiiotool/oiiotool.cpp (16)
Patch Links

https://github.com/OpenImageIO/oiio/pull/480.patch
https://github.com/OpenImageIO/oiio/pull/480.diff
—
Reply to this email directly or view it on GitHub.



--
Larry Gritz
[email protected]

_______________________________________________
Oiio-dev mailing list
[email protected]
http://lists.openimageio.org/listinfo.cgi/oiio-dev-openimageio.org

[Oiio-dev] Fwd: [oiio] ImageCache/ImageBuf read performance (#480)

Reply via email to