Re: [Oiio-dev] oiiotool slower than python - expected?

Larry Gritz Mon, 08 May 2023 09:08:39 -0700

As with any case of a general tool that can do a lot of things and tries to 
make the hardest cases work and the average case have good performance, a 
specialized tool that does exactly one thing and no more is likely to be more 
efficient, especially on straightforward use cases.

Without trying it out myself, just going by what you've written in the email, I 
can think of a few things that might be very different between oiiotool and 
your python script:

1. Your python example parallelizes across the frame range -- meaning that each 
parallel task is truly independent and should have no locking or interference 
at all (though at the expense of possibly using a lot of memory, since many 
ImageBuf's will be active at once). But oiiotool is going to handle the frames 
serially, and try to parallelize the work within each frame. This necessarily 
means that, since all the threads are dealing with the same image, there will 
probably be lots of times that they interfere with each other, the most severe 
being that the threads are sharing a single underlying OpenEXR file for input 
and also for output.

The simplest way to test this hypothesis is with a few other timing tests (a) 
compare to python when you don't use the python thread pool (i.e., handle the 
files serially) but also don't set the "threads" attribute, so OIIO tries to 
thread in the various operations you're doing; (b) compare python and oiiotool 
for cropping just one file; (c) try both for just one file, with just one 
thread (set "threads" to 1 in python, and also use --threads 1 for oiiotool).

2. If using OpenEXR >= 3.1, there's also the option to turn on the use of the 
"exrcore" library, which is off by default, but when enabled results in a lot 
less locking for the case where multiple threads are reading from the same 
ImageInput. You can enable this in oiiotool with "--oiioattrib  openexr:core 
1". This will soon be the default, but only after the next openexr release that 
fixes a limitation where the exrcore was broken for certain compression types. 
But even enabled, it won't currently help the case of multiple cores wanting to 
write to a single file.

3. Your python oiiotool by default reads the files immediately into an 
ImageBuf, whereas oiiotool does so for small files but for big files (which I 
assume something 6k or bigger, as in your example, certainly is), falls back on 
an underlying ImageCache to read parts of the image on demand (which is helpful 
if the images are really huge, or if it turns out you only need part of it). 
When you need the whole image and it could fit comfortably in memory, this is 
obviously slower since it's breaking the I/O into smaller chunks, and also 
there's the matter of any additional overhead of using the cache versus having 
the whole image in a big flat buffer at once.

4. Also, oiiotool by default reads all images into 32 bit float buffers and 
caches for the in-memory representation, in order to maximally preserve 
precision of whatever operations you do and also to speed up any complex math 
(doing math on a whole float buffer is faster than converting to float, doing a 
single math op, then converting back on a pixel-by-pixel basis). But... the 
thing you're doing is simply reading the buffer in, cropping (either padding 
with black or trimming pixels away), and writing out again -- the pixel values 
are just being copied, there is no "math", so the speed and precision 
advantages of float buffers are irrelevant for this particular case, you are 
just uselessly paying the overhead of converting to and from float (your source 
exr file is likely "half") and having 2x more data size to slog through and 
copy around in memory. Needless to say, your python script is not trying to be 
clever in this way, it just accepts whatever the native data type is.

One way to test the effects of #3 and #4 is to change the input from simply 
naming the file to using the -i command explicitly, with some modifiers

    -i:now=1:native=1 path/to/src.3935-3954%06d.exr

The now=1 bypasses the ImageCache and eschews any lazy reading, and just reads 
the whole input image into memory (i.e., an ImageBuf) right then. And native=1 
means "don't convert to float, just read it into an ImageBuf as whatever data 
type was in the file."

I'm very curious to see a comparison between

(a) oiiotool cropping a single image when you use -i:now=1:native=1, versus (b) 
your python script operating on a single image (but don't bother setting 
threads=1). I bet those two times are going to be a lot more similar.

TL;DR after rereading everything I just wrote: it would not surprise me at all 
if almost all the runtime in the oiiotool case was simply the serialization of 
writing all of the output exr frames one by one, whereas the python case is 
explicitly parallelizing this operation by handling many frames at once, 
independently, and that even all the other factors I hypothesize about are 
relatively minor in comparison.

Food for thought about oiiotool and future enhancements:

* Is reliance on ImageCache for anything but fairly small images the right 
default? Should the size thresholds for ImageCache to kick in be much larger, 
or happen only if you explicitly ask for it?

* Should it try to scan the commands being used before doing any work, and 
determine if all the operations involve only pixel copies, no math per se, that 
it automatically use native=1 for inputs? That is, should the promotion to 
float buffers internally only happen in the cases where the command line 
implies that there is a precision or speed advantage to doing so? (Down side: 
can we screw this up and miss cases where we should have promoted? Can we get 
results or perf that differ significantly just because one command in a long 
sequence changes slightly, leading to counter-intuitive behavior that is hard 
for users to reason about?)

* When file sequence wildcards are used, should oiiotool automatically try to 
parallelize across the file sequence rather than within each file operation? 
Are there oiiotool command lines people use that operate on file sequences but 
have iteration-to-iteration data dependencies that would give wrong results if 
the sequence wasn't processed serially in order? (I suppose it could be an 
option users can explicitly set for whether to serialize or parallelize file 
sequence operations.)

* Possibly simpler than the last item: What if we simply made -o asynchronous 
when it's the last operation on the command line? That is, -o puts the whole 
output task on the thread queue to do its thing, while the main thread moves on 
to the next iteration of the file sequence, thus allowing the output step 
(which is probably the most expensive part, as well as the hardest to 
parallelize) to overlap in time with the work on the next input file?

> On May 8, 2023, at 2:41 AM, Simon Björk <bjork.si...@gmail.com> wrote:
> 
> I'm trying to crop a sequence of (8k) exr files and it seems like oiiotool is 
> quite a bit slower than using the python bindings and mulitprocessing.
> 
> Is this expected? I was under the assumption that it's always better/faster 
> to use oiiotool if possible. I've tried changing the --threads argument but 
> the results are the same. I'm on Windows.
> 
> oiiotool path/to/src.3935-3954%06d.exr --crop 6640x5760+1000+0 --runstats -o 
> path/to/dst.3935-3954%06d.exr
> --------------------------------
> Time: 59 seconds
> import sys
> import os
> import time
> from multiprocessing.dummy import Pool as ThreadPool
> 
> import OpenImageIO as oiio
> 
> oiio.attribute("threads", 1)
> 
> def crop(path):
>     im = oiio.ImageBuf(path)
>     new_im = oiio.ImageBufAlgo.crop(im, oiio.ROI(1000, 7640, 0, 5760))
>     new_filepath = "{0}/{1}".format("D:/tmp/exr_crop", os.path.basename(path))
>     new_im.write(new_filepath)
> 
> dir_path = "D:/tmp/exr_src"
> files  = ["{0}/{1}".format(dir_path, x) for x in os.listdir(dir_path)]
> 
> st = time.time()
> 
> pool = ThreadPool(48)
> results = pool.map(crop, files)
> 
> et = time.time()
> 
> print("Total time: {0}".format(et-st))
> 
> --------------------------------
> Time: 6.9 seconds
> 
> /Simon
> 
> 
> 
> -------------------------------
> Simon Björk
> Compositor/TD
> 
> +46 (0)70-2859503
> www.bjorkvisuals.com 
> <http://www.bjorkvisuals.com/>_______________________________________________
> Oiio-dev mailing list
> Oiio-dev@lists.openimageio.org
> http://lists.openimageio.org/listinfo.cgi/oiio-dev-openimageio.org

--
Larry Gritz
l...@larrygritz.com

_______________________________________________
Oiio-dev mailing list
Oiio-dev@lists.openimageio.org
http://lists.openimageio.org/listinfo.cgi/oiio-dev-openimageio.org

Re: [Oiio-dev] oiiotool slower than python - expected?

Reply via email to