I don't usually send pull requests to oiio-dev these days, but this was was unusually interesting and newsworthy, so I thought it might be interesting to a wider audience than those who "watch" the GitHub repo in order to receive pull requests.
------ I spent some time carefully profiling image read performance of scanline OpenEXR images in OIIO. I was particularly interested in making sure that reading via an ImageBuf or ImageCache::get_pixels did not have too much overhead compared to a raw ImageInput::read_image (presumed to be "speed of light", the barest wrapping of the underlying library calls), and how autotile/autoscanline affected things. First thing I noticed is a big flaw in my libOpenImageIO/imagespeed_test, I had neglected to fully flush the ImageCache between sub-tests, and that was throwing off my numbers, making IB and IC look rosier than they really were. I also augmented imagespeed_test to explicitly cover several more of the combinations I describe above. Here is example output from my Macbook Pro on a 2336x1198 4-channel float OpenEXR file: read_image speed: 0.12s read_scanline (1 at a time) speed: 0.33s read_scanlines (64 at a time) speed: 0.12s ImageBuf read speed: 0.30s ImageCache get_pixels speed: 0.28s With autotile = 64: ImageBuf read speed: 0.41s ImageCache get_pixels speed: 0.39s With autotile = 64, autoscanline = 1: ImageBuf read speed: 0.48s ImageCache get_pixels speed: 0.47s Some interesting things we can say already: Using read_scanlines in 64-scanline chunks is equally fast as a read_image (which internally uses read_scanlines in 256-scanline chunks). But read_scanline individually reading each line takes almost 3x longer! This is because when reading multiple scanlines at once, the OpenEXR library (libIlmImf) uses multiple threads to pipeline the reads and decompresses. When you read just one scanline, it doesn't have the opportunity to do that. Reading the same file with either ImageBuf or IC::get_pixels is a high penalty -- about 2.5x slower than speed of light. When autotile is turned on, it's even worse, and when autoscanline=1 (which is supposed to be an improvement, by making the "virtual tiles" the full width of the image), it's even slower, 4x SOL. OK, so I carefully profiled those code paths and made a number of improvements. I'll give you the results first: read_image speed: 0.12s read_scanline (1 at a time) speed: 0.33s read_scanlines (64 at a time) speed: 0.12s ImageBuf read speed: 0.15s ImageCache get_pixels speed: 0.14s With autotile = 64: ImageBuf read speed: 0.33s ImageCache get_pixels speed: 0.32s With autotile = 64, autoscanline = 1: ImageBuf read speed: 0.18s ImageCache get_pixels speed: 0.17s Bottom line is that I've DOUBLED the image reading speed of using ImageBuf and ImageCache::get_pixels when autotile is off, they are now almost as fast as raw calls to ImageInput::read_scanlines or read_image. Autotile is still much slower, though much improved compared to before. And the combination of autotile and autoscanline has been nearly tripled in performance, and is now only incrementally slower than when autotile is off (that is, when the ImageCache reads the whole image as one tile). So, how did I do it? Basically it broke down to three main improvements: I found some places (including ImageBuf's local storage and the tile memory in ImageCacheTile) where I'd used std::vector as a simple memory-managed buffer. This turned out to be unwise, because vector zeroes out the memory when you resize() to allocate (only to be I immediately filled with the real data afterwards). On modern architectures, touching memory unnecessarily (especially every byte of a buffer that's bigger than L2 cache!) is one of the most expensive things you can do. I made a custom scoped_array template in imagebuf.h that does the simple memory management I was after originally. (N.B. When we can count on C++11 everywhere, we can use unique_ptr<>, but for now I just rolled my own). The code that read scanlines for the autotile was using individual read_scanline calls, not read_scanlines. For OpenEXR in particular, this makes a big difference. The ImageCache was rounding tile sizes up to powers of 2 (including the full-image-tile, when autotile was off, and the virtual tile size when autotile was on, and the scanline width, when autotile was on and autoscanline was on). This is a historical artifact from the time when TextureSystem required power-of-2 tile sizes. It no longer does, so I removed the rounding. Note that for my file, which coincidentally was 2336 x 1198 (slightly above a power of 2 in each dimension), that was quite a lot of extra data and memory when rounded up, as well as causing various routines that has special fast cases for "contiguous stride" data layouts to end up on the slower paths for non-contiguous strides. And there were a few other minor refactors with smaller effects, and some minor bug fixes for things that broke with these changes. Look at the individual commit comments if you care. The final takeaway is: After this is committed, you should expect much faster image read speeds if you are going through the ImageBuf or ImageCache interfaces, particularly for scanline (non-tiled) images, and especially for OpenEXR where there is an actual advantage for read_scanlines versus read_scanline. If you are using autotile, you should also be using autoscanline. You can merge this Pull Request by running: git pull https://github.com/lgritz/oiio lg-icperf Or view, comment on, or merge it at: https://github.com/OpenImageIO/oiio/pull/480 Commit Summary More diagnostics in imagespeed_test to pinpoint bottlenecks scoped_array template for simple memory management of dynamically-all… ImageBuf perf improvement -- hold local pixels in scoped_array rather… ImageCache perf: for tile pixel mem, use scoped_array rather than std… ImageCache perf: use scoped_array rather than vector for read_untiled… ImageCache perf: read_untiled autotile case -- use read_scanlines ins… ImageCache perf: refactor get_pixels to do fewer tile queries ImageCache speedups -- remove pow2 roundups of autotile tile sizes. oiiotool - fix minor bug where tiled files were output inappropriately. File Changes M src/include/imagebuf.h (30) M src/libOpenImageIO/imagebuf.cpp (70) M src/libOpenImageIO/imagespeed_test.cpp (83) M src/libtexture/imagecache.cpp (62) M src/libtexture/imagecache_pvt.h (6) M src/oiiotool/oiiotool.cpp (16) Patch Links https://github.com/OpenImageIO/oiio/pull/480.patch https://github.com/OpenImageIO/oiio/pull/480.diff — Reply to this email directly or view it on GitHub. -- Larry Gritz [email protected]
_______________________________________________ Oiio-dev mailing list [email protected] http://lists.openimageio.org/listinfo.cgi/oiio-dev-openimageio.org
