On 2020-12-13 23:19:25 [+0200], Lasse Collin wrote: > > Threads, which finished decoding, remain idle until their output > > buffer has been fully consumed. The output buffer once allocated > > remains allocated until the thread is cleaned up. This saved 5 secs > > in the example above compared to freeing the buffer once the buffer > > was fully consumed and allocating it again once there is new data. > > The input buffer is freshly allocated for each block since they vary > > in size in general. > > Yes, reusing buffers and encoder/decoder states can be useful (fewer > page faults). Perhaps even the input buffer could be reused if it is OK > to waste some memory and it makes a difference in speed.
I tried two archives with 16 & 3 CPUs and the time remained the same. I tried to only increase the in-buffer and to allocate the block-size also for the in-buffer. No change. > > I made my own output queue since the output size is known. I have no > > idea if this is good or if it would be better to use lzma_outq > > instead. > > The current lzma_outq isn't flexible enough for a decoder. It's a bit > primitive even for encoding: it works fine but it wastes a little > memory. However, since the LZMA encoder needs a lot of memory anyway, > the overall difference is around (or under) 10 % which likely doesn't > matter too much. > > The idea of lzma_outq is to have a pool for output buffers that is > separate from the pool of worker threads. Different data takes > different amount of time to compress. The separate pools allow Blocks > to finish out of order and reusing worker threads immediately as long > as there is enough extra buffer space in the output queue. This is an > important detail for encoder performance (to prevent idle threads) and > with a quick try it seems it might help with decoding too. The > significance depends a lot on the data, of course. I tried to decouple the thread and the out-buffer but after several failures I tried to keep it simple for the start. I do have idle threads after a while with 16 CPUs. The alternative is to keep them busy with more memory. With 4 CPUs I get to | 100 % 10,2 GiB / 40,0 GiB = 0,255 396 MiB/s 1:43 and this is ofcourse a CPU bound due to the `sha1' part as consumer. I don't know how reasonable this performace is if it means that you have to write 400MiB/s to disk. Of course, should it become an issue then it can still be decoupled. Sebastian