On my mac, I noticed a major performance scaling issue if I'm reading in
lots of invalid textures. If I have all my threads rapidly doing lots of
texture lookups, the lookup time will become several hundred times slower.
This, for instance, made the parallel displacement of a mesh on my 4-core
mac go from 1.6s, when all the displacement textures were valid, to 8.5s
when they were all invalid (I used an invalid texture searchpath). This is
due to a mutex in the std::locale constructor which is called whenever
std::ostringstream is used, and this in turn is used by the boost
"tinyformat" class which OIIO uses for formatting error messages. I
understand that this offers type safety benefits over sprintf and
localization benefits, but I don't think it's worth this overhead. Modern
compilers will already do the type checking for you:
clang:
test.cpp:84:17: warning: format specifies type 'char *' but the argument
has type 'int' [-Wformat]
sprintf(buf, "%s", 42);
~~ ^~
%d
g++
test.cpp: In function 'int main(int, char**)':
test.cpp:84: warning: format '%s' expects type 'char*', but argument 3 has
type 'int'
test.cpp:84: warning: format '%s' expects type 'char*', but argument 3 has
type 'int'
So the type checking argument seems to be a moot point. As for
localization, I don't think that's really an issue for error messages, and
even if it was, I don't think it's worth the performance overhead.
Linux has partially fixed this regression (
https://gcc.gnu.org/bugzilla/show_bug.cgi?id=40088), so that instead of
being a few hundred times slower, it's only tens of times slower, but there
is still a slow down -- and if the global localization has been modified,
it would still revert back to its old slow performance and perform just as
badly as on OS X.
Reproducing this is easy. On a mac, you just need to have all your threads
repeatedly doing texture lookups on textures that don't exist. The slow
down should be obvious. For fun, I've attached a little benchmarking code
that more directly shows this slowdown. On my mac I see:
% a.out 1 10000 (this is using 1 thread)
thread 0: delay 0.008989 (time for doing 10000 ostringstream calls)
thread 0: delay 0.001502 (time for doing 10000 snprintf calls)
% a.out 8 10000 (this is using 8 threads)
thread 0: delay 0.919378 (time for doing 10000 ostringstream calls)
thread 1: delay 0.919375
thread 2: delay 0.919330
thread 3: delay 0.919279
thread 4: delay 0.919367
thread 5: delay 0.919310
thread 6: delay 0.919242
thread 7: delay 0.919289
thread 0: delay 0.003132 (time for doing 10000 snprintf calls)
thread 1: delay 0.002838
thread 2: delay 0.003022
thread 3: delay 0.002828
thread 4: delay 0.003021
thread 5: delay 0.002877
thread 6: delay 0.003108
thread 7: delay 0.003004
Notice that going from 1 thread to 8 threads the snprintf time barely
increases per thread, but the ostringstream time is 100x slower than before
and 300x slower than snprintf. If I had more cores, the slowdown would be
even greater.
What I recommend is getting rid of tinyformat and going back to the much
faster and simpler sprintf type formatting.
Thiago
#include <unistd.h>
#include <sys/time.h>
#include <iomanip>
#include <iostream>
#include <sstream>
#include <ostream>
#include <string>
#include <vector>
#include <stdio.h>
#include <stdlib.h>
struct timestat
{
pthread_t pid;
int n;
timeval start, stop;
std::string str;
timestat() : pid(0), n(0) { const timeval c = { 0, 0 }; start = stop = c; }
};
std::ostream& operator<<(std::ostream& os, const timeval &tv)
{
using namespace std;
os << tv.tv_sec << '.' << setw(6) << setfill('0') << tv.tv_usec;
return os;
}
std::ostream& operator<<(std::ostream& os, const timestat &ti)
{
os << ti.pid << ' ' << ti.n << ' ' << ti.start << ' ' << ti.stop;
return os;
}
void* func(void *param)
{
timestat* info = reinterpret_cast<timestat*>(param);
gettimeofday(&info->start, 0);
for (int i = 0; i < info->n; ++i)
{
std::ostringstream os;
os << "hello " << i;
info->str = os.str();
}
gettimeofday(&info->stop, 0);
return param;
}
void* func2(void *param)
{
timestat* info = reinterpret_cast<timestat*>(param);
gettimeofday(&info->start, 0);
for (int i = 0; i < info->n; ++i)
{
char buf[255];
snprintf(buf, sizeof(buf), "hello %d", i);
info->str = buf;
}
gettimeofday(&info->stop, 0);
return param;
}
timeval diff(const timeval &a, const timeval &b)
{
timeval c = { a.tv_sec - b.tv_sec, a.tv_usec - b.tv_usec };
if (c.tv_usec < 0)
{
// carry
--c.tv_sec;
c.tv_usec += 1000000;
}
return c;
}
int main(int argc, char *argv[])
{
int nthreads = atoi(argv[1]);
int niters = atoi(argv[2]);
std::vector<timestat> ti(nthreads);
for (int i = 0; i < nthreads; ++i)
{
ti[i].n = niters;
//pthread_create(&ti[i].pid, NULL, (i % 2 ? func2 : func), &ti[i]);
pthread_create(&ti[i].pid, NULL, func, &ti[i]);
}
for (int i = 0; i < nthreads; ++i)
{
pthread_join(ti[i].pid, NULL);
}
for (int i = 0; i < nthreads; ++i)
{
timeval dt = diff(ti[i].stop, ti[i].start);
std::cout << "thread " << i << ": delay " << dt << std::endl;
}
for (int i = 0; i < nthreads; ++i)
{
ti[i].n = niters;
//pthread_create(&ti[i].pid, NULL, (i % 2 ? func2 : func), &ti[i]);
pthread_create(&ti[i].pid, NULL, func2, &ti[i]);
}
for (int i = 0; i < nthreads; ++i)
{
pthread_join(ti[i].pid, NULL);
}
for (int i = 0; i < nthreads; ++i)
{
timeval dt = diff(ti[i].stop, ti[i].start);
std::cout << "thread " << i << ": delay " << dt << std::endl;
}
return 0;
}
_______________________________________________
Oiio-dev mailing list
[email protected]
http://lists.openimageio.org/listinfo.cgi/oiio-dev-openimageio.org