[Oiio-dev] OIIO tips 8 Aug 2015: Optimize big resizes

Larry Gritz Mon, 08 Aug 2016 11:19:43 -0700

This is, I hope, the first of a regular feature in which I explicate an OIIO 
scenario that seems particularly interesting or helpful for a wider audience. 
Enjoy.


[ Note: this one is much longer and more detailed than I'll generally aim for; 
but the topic was very rich to explore. ]


An OIIO user recently had the problem of wanting to generate three different 
reduced-resolution preview images from each source image, using oiiotool. This 
was done for each of a huge number of source images, so performance was 
important, and they also wanted to keep memory usage low on their server-side 
app.

The source images were very large greyscale TIFFs with an odd aspect ratio. 
We'll reproduce the test case like this:

    oiiotool -pattern checker  2272x152780 1 -d uint8 -o big.tif

That's an admittedly big and oddly-shaped file. Nonetheless, the goal was to 
produce three successive resizes, with the longest side being 1024, 384, and 
128 pixels, saved as JPEG files. (Ours is not to reason why, ours is but to do 
and die.)

Baseline:

So the naive approach is,

oiiotool big.tif -resize 0x1024 -o 1024.jpg
oiiotool big.tif -resize 0x384 -o 384.jpg
oiiotool big.tif -resize 0x128 -o 128.jpg

These three commands take a total of 10m:40s (on my 2015 MacBookPro, quad core) 
and use a peak of 1.4 GB. Yikes!

Aside: note the -resize 0x1024 ... when one dimension of a resize is 0, it 
means to select a value that preserves the original aspect ratio, given the 
constraint of the dimension you did specify.

Step 1: Use successive resizes in the same oiiotool command line

The main reason it's so expensive is the extreme resize (152k -> 1k, to 
152k->128 vertically), it means that each output pixel needs to sample an 
absurd number of pixels in the source image. We can save the time of the latter 
two resizes by successively resizing from the lower-res images as we go (i.e., 
resize the 1024 image to 384, then the 384 to 128).

There's also no reason to use three separate invocations of oiiotool. Note that 
oiiotool processes its commands strictly from left to right, it's fine to 
include multiple -o outputs, and they can be interspersed among the other 
commands and will output the results at that point in the command sequence.

oiiotool big.tif -resize 0x1024 -o 1024.jpg -resize 0x384 -o 384.jpg -resize 
0x128 -o 128.jpg

Time:  1:50  Peak memory:   1.4GB
Improvement so far:   5.8x speed, 1x memory

Step 2: Speed up the expensive resize with a cheaper filter

The slow resize speed is exacerbated by the fact that the default downsize 
filter is a "lanczos3" filter, which is quite wide. That's great for high 
quality resizes of reasonable magnification. But in this case, the resize is so 
extreme that I didn't think anybody would notice the fine points of filtering 
quality, and these are intended as low-res previews, not "final quality" 
delivery images. So I hypothesized that a box filter would be faster and look 
fine for this purpose.

oiiotool big.tif -resize:filter=box 0x1024 -o 1024.jpg -resize:filter=box 0x384 
-o 384.jpg -resize:filter=box 0x128 -o 128.jpg

(excuse the line wrap, that is a single command)

Time:  4s  Peak memory: 1.4GB
Improvement so far:   160x speed, 1.0x memory

Aside: stats show that of the 4s it now takes, 1.9s was file I/O, so we're 
already at the point where even if we could make the "resize" be infinitely 
fast, we could squeeze out no more than an additional factor of 2 in speed.

Step 3: Use -native to prevent expansion to float internally

By default, oiiotool converts all images to float pixels internally, so that 
any math you ask it to do will be at full precision. Also, this is usually the 
fastest option, since it converts to float just once per pixel, whereas if it 
kept it in uint8, say, it might end up converting to float and back for every 
individual math op in the resize. But what the heck, let's see what the time vs 
memory tradeoff is in this case, where a bit of accuracy loss is probably 
acceptable for a thumbnail preview image. We'll use the --native option:

oiiotool --native big.tif -resize:filter=box 0x1024 -o 1024.jpg 
-resize:filter=box 0x384 -o 384.jpg -resize:filter=box 0x128 -o 128.jpg

Time: 3.3s  Peak memory: 361MB
Improvement so far:  194x speed, 4x memory

Aside: I said that I expected maximum speed to be when the internal 
representation was float. Why is it faster now? I assume that by reducing the 
size of the image in memory, more could fit into processor cache at any given 
time, so we have probably made the performance somewhat less bottlenecked on 
RAM speed. For more reasonable images where the working set can fit into cache 
and the math itself is dominating the time, I do expect the up-front float 
conversion to be faster as well as more accurate.

Step 4: Restrict the ImageCache size

The way oiiotool works, small images are read directly into RAM whole, but big 
images (like this one) are backed by ImageCache, which tries to limit memory 
size. But oiiotool's default is to let the ImageCache use up to 4GB. Perhaps we 
can squeeze down the working size of the cache ever further, without 
sacrificing much time? Let's use the --cache argument (the argument is size in 
MB, with a default of 4096):

oiiotool --cache 50 --native big.tif -resize:filter=box 0x1024 -o 1024.jpg 
-resize:filter=box 0x384 -o 384.jpg -resize:filter=box 0x128 -o 128.jpg

Time: 3.25s  Peak memory: 77MB

We got lucky again -- reducing the cache size didn't seem to hurt performance 
at all. It would if we had somewhat more random or repeated access to the big 
source image. But in this case, we are doing just one resize (of the original 
big input image), and it's reasonably coherent in its access pattern -- the 
"working set" at any given time didn't exceed the size of the ImageCache, so 
there were no redundant reads from the files.

The reason why capping the memory footprint was so important is the number of 
processes they could run simultaneously on each physical server was dependent 
on the memory (swapping would kill performance). This reduction in memory 
allowed them to run this same process in 28 threads on the same server without 
swapping, doubling the number of threads they could deploy, and thus resulted 
in another 2x performance boost for them, in terms of total throughput to 
process their whole database.

Final results:

Time: 3.25s  Peak memory: 77MB
Total improvement:   194x speed, 20x memory

Moral: Power users, make sure you know what every obscure oiiotool command line 
option does!

--
Larry Gritz
[email protected]

_______________________________________________
Oiio-dev mailing list
[email protected]
http://lists.openimageio.org/listinfo.cgi/oiio-dev-openimageio.org

[Oiio-dev] OIIO tips 8 Aug 2015: Optimize big resizes

Reply via email to