On 2020-12-13 23:19:25 [+0200], Lasse Collin wrote:
> > Threads, which finished decoding, remain idle until their output
> > buffer has been fully consumed. The output buffer once allocated
> > remains allocated until the thread is cleaned up. This saved 5 secs
> > in the example above compared to freeing the buffer once the buffer
> > was fully consumed and allocating it again once there is new data.
> > The input buffer is freshly allocated for each block since they vary
> > in size in general.
> 
> Yes, reusing buffers and encoder/decoder states can be useful (fewer
> page faults). Perhaps even the input buffer could be reused if it is OK
> to waste some memory and it makes a difference in speed.

I tried two archives with 16 & 3 CPUs and the time remained the same. I
tried to only increase the in-buffer and to allocate the block-size also
for the in-buffer. No change.

> > I made my own output queue since the output size is known. I have no
> > idea if this is good or if it would be better to use lzma_outq
> > instead.
> 
> The current lzma_outq isn't flexible enough for a decoder. It's a bit
> primitive even for encoding: it works fine but it wastes a little
> memory. However, since the LZMA encoder needs a lot of memory anyway,
> the overall difference is around (or under) 10 % which likely doesn't
> matter too much.
> 
> The idea of lzma_outq is to have a pool for output buffers that is
> separate from the pool of worker threads. Different data takes
> different amount of time to compress. The separate pools allow Blocks
> to finish out of order and reusing worker threads immediately as long
> as there is enough extra buffer space in the output queue. This is an
> important detail for encoder performance (to prevent idle threads) and
> with a quick try it seems it might help with decoding too. The
> significance depends a lot on the data, of course.

I tried to decouple the thread and the out-buffer but after several failures I
tried to keep it simple for the start.
I do have idle threads after a while with 16 CPUs. The alternative is to keep
them busy with more memory.  With 4 CPUs I get to
|  100 %         10,2 GiB / 40,0 GiB = 0,255   396 MiB/s       1:43             

and this is ofcourse a CPU bound due to the `sha1' part as consumer. I
don't know how reasonable this performace is if it means that you have
to write 400MiB/s to disk. Of course, should it become an issue then it
can still be decoupled.

Sebastian

Reply via email to