Hi Padraig, > > T=1: 5.10s
> > T=2: 2.87s > > T=3: 2.71s > > T=4: 1.75s > > T=5: 1.66s > > T=6: 1.65s > > T=7: 1.67s > > T=8: 1.31s > > Nice results! > > A few quick questions: > > Any thoughts on the interesting jump at T=8? Say we're sorting 32 lines with 8 threads, each thread would get 4 lines to sort. If we sort with 7 threads, then 6 threads would get 4 lines, and the last thread would get 8 to sort. Thus, this last thread becomes kind of a bottleneck. A way around this would be, if sorting with 7 threads, have 6 threads sort 5 lines and the last thread sort 2. A more "wow" example might be 1000 lines with 3 threads... We could have 250, 250, and 500, with 500 being the bottleneck, or 333, 333, and 334. To divide threads up this way, we'd need to at the very start do nlines / nthreads for all the threads except 1, and nlines - (nthreads - 1) * (nlines / nthreads) for the last thread. However, this method implies creating all the threads in a loop, which isn't as elegant as recursion. I've used this approach for a previous patch, but for some reason never thought of it here. I'll try it out and see how much the results differ. > Have you tested in conjunction with the external || patch? I actually havent, though I'm really interested in knowing how the speedups will multiply. Joey and I talked about, if sorting on N disks with balanced work load, calling sortlines with NTHREADS / N threads. > You previously mentioned a thread bug with memcoll. Is that worked around? That happened when more than one instance of memcoll is called on the same line at once, since memcoll replaces the eolchar with '\0'. Under our approach, the same line shouldn't ever be compared at the same time, so we're fine. On top of that, Professor Eggert suggested NUL delimiting all lines as they're read in, so memcoll doesn't have to; hence the patch to gnulib, which introduces xmemcoll_nul and memcoll_nul, for when input is known to be NUL delimited, thus no replacement of the eolchar is needed, making memcoll threadsafe.