On 06/15/2011 03:45 PM, Rich Felker wrote: > On Wed, Jun 15, 2011 at 03:28:16PM +0300, Timo Teräs wrote: >> On 06/15/2011 03:14 PM, Rich Felker wrote: >>> On Wed, Jun 15, 2011 at 01:49:30PM +0300, Timo Teräs wrote: >>>> Hi all, >>>> >>>> As recently discussed on the 'Significant performance problems with >>>> modprobe', the config_* and xmalloc_fget* API has some performance issues: >>>> 1. It uses locking getc >>>> 2. It does malloc/realloc/free for each line read >>>> >>>> Now (1) is relatively simply fixable with getdelim(), getline() or >>>> getc_unlocked(). And even get_line_from_file.c has comments that fixing >>>> of (1) should about double the speed. >>> >>> The difference between getc_unlocked and getc should be fairly small. >>> If getc_unlocked is a macro, probably at best 2.5-3x, and if it's a >>> function, probably at best 1.5-2x. getline/getdelim would make a much >>> bigger difference. >> >> What makes you think so? >> >> E.g. uclibc getline and getdelim implementations internally use >> __GETC_UNLOCKED macro, which is what you would get with getc_unlocked() too. > > I'm assuming a non-crap implementation where getdelim performs memchr > in the buffer. Of course if you implement getdelim in terms of slow > primitives it will be slow. Both glibc and my implementation in musl > are much faster than repeatedly calling getc_unlocked.
Mmm.. yeah, uclibc needs fixing here. I'm still slightly doubtful how much the performance difference would be actually. It probably depends on the line lengths totally. On small lines, it's not much. On really long lines, it could be essential. >> Actually getc_unlocked is likely faster, because getdelim/getline still >> do regular locking once for each line read. And there just simple isn't >> _unlocked variants of them. >> >> We should be really using _unlocked here anyway if available. Since the >> parsing api is not intended to be multithreaded anyway, where as regular >> stdio is. > > Locking should not be performed unless there's more than one thread > anyway. The only cost of the locked variants is a single conditional > to determine that locking does not need to be performed, and the fact > that the single-character locked functions (getc etc.) won't be > inlined. But for larger operations like getdelim, fread, etc. the > difference is not measurable whatsoever. That's probably why the > standard doesn't bother including ugly _unlocked variants for > everything. Locking is not performed for pthread_mutex_* in this case. However, glibc and uclibc with nptl, do locking for stdio stuff. The reason is that both support late loading of libpthread with dlopen. This is why both implementations use special locking primitives for stdio. In glibc case it's the lowlevellock stuff or lll_lock. It's done unconditionally, and always. And does at least one atomic instruction per lock acquire. I believe the same is true for malloc stuff. glibc malloc uses libc_lock_lock, which resolves lll_lock (while building libc/libpthread including malloc), or _pthread_mutex_lock for the other aux. libraries. So in both cases locking cannot be avoided. >>>> However, (2) is trickier, because the whole API is designed exactly for >>> >>> I'm not convinced that (2) has any bearing whatsoever on performance >>> unless you're using some hideous malloc implementation that calls mmap >>> for each allocation. Most calls to malloc (when new brk/mmap isn't >>> required) should cost about the same as most calls to getc (when >>> buffer isn't empty). >> >> This depends very much on your malloc implementation. E.g. on uclibc >> calling malloc guarantees that you do at least one mutex lock/unlock >> operation. And yes, it will give you performance penalty similar to not >> using the _unlocked variants. > > Then this should be fixed in uclibc. There's no reason for the mutex > lock/unlock to be called at all if no threads have been created. See above. locking is done. >> If you want to optimise things, let's do it properly. > > That's what I want to see done. "Let's assume everything that > should(*) be fast is slow and everything that should be slow is fast" > is not what I'd call "properly". It's better to not do stuff at all if it can be avoided. Just assuming things that should be fast are fast, doesn't usually work. For the getdelim stuff, it really depends on the input files which is faster, and probably needs benchmarking. _______________________________________________ busybox mailing list [email protected] http://lists.busybox.net/mailman/listinfo/busybox
