Since this topic refuses to die, I would like to have a thread where
pros and cons about memory usage limits are explained in detail. I can
then tell people to read this thread first.

People have strong and conflicting opinions, and it's not possible to
make everyone happy. I'm not proposing any changes, but I might
consider some tweaks depending on what people think.


The default memory usage limits were removed in 5.0.0. The problem with
the limits was that xz could refuse to decompress some files on a
low-memory system even if there was enough free swap space to get the
job done. There are situations where it's useful to have a limiter for
decompression, but it's clear that single-threaded decompression must
not have any limits by default.

Unsurprisingly, the removal of memory usage limits had an opposite
effect when compressing. With a default limit, xz used to adjust the
compression settings down on a low-memory system. Now it can fail on the
same system, requiring the user to adjust the settings manually or
setting a memory usage limit for compression.

Some people don't like that xz doesn't "just work" on low-end systems.
They think that it's better that xz runs even if it means worse
compression, because xz failing in the middle of a big script isn't so
fun. It's similar to decompression with a memory usage limit, where it
wasn't fun when xz failed in the middle of a big script.

It has also been suggested that xz should check RLIMIT_DATA and
RLIMIT_AS when compressing (but not when decompressing). They would be
used as an implicit memory usage limit to prevent xz from trying to
allocate more memory than the system will allow. Since xz will fail and
exit if malloc() doesn't succeed, checking the resource limits would
keep xz working as long as they are reasonably high.


There are reasons why it can be good that xz fails instead of just works
when there's not enough RAM or the resource limits are too low.

xz prints a notice when it adjusts the compression settings. Since
compression will still succeed, that notice might get lost if xz is run
non-interactively. It may take a while until users figure out that the
same script gives different output on different systems with identical
software, and why it does that. Many people don't read logs unless
something has already gone wrong.

If xz fails when there isn't enough RAM+swap or the resource limits are
too low, it will be obvious to the user that xz cannot do exactly what
it was asked to do. If the user has set low resource limits just in
case even when he has lots of RAM, getting an error will make the user
increase the limits. The notice that xz prints when it adjusts the
limits should do that too when running xz interactively, but in
non-interactive example in the previous paragraph it's no so obvious.

Sometimes the adjusted settings will cause the compression ratio to be
much worse than it would be without adjusting. Sometimes this is fine
to keep compression speed sane. Sometimes it would be better that xz
used the original settings even if it makes things very slow due to
heavy swapping.

Sometimes repeatable compression is needed, that is, get the same
compressed output from the same input on multiple computers. Since the
output of xz may vary between versions, this requires using the same xz
version on every system. I think this is somewhat unusual situation
although, if I remember correctly, it is needed e.g. by DeltaRPM to get
the package signatures to match. It can be argued that signing
compressed data isn't the best way, but it's off-topic in this thread.

When repeatable compression is needed, it's essential that xz doesn't
change the compression settings, even if that means heavy swapping or
that the compression fails completely due to failed malloc(). Having a
limiter enabled by default is a bit risky, because the limiter would
get hit only on low-end systems and thus developers could easily forget
to add an option to explicitly disable the limiter for this use case.
Documenting doesn't help much because many people don't read docs unless
something has already gone wrong.

Some think that a UNIX program should blindly try to do whatever the
user told it and fail only if the task simply cannot be done. Self-aware
programs that adapt to their environment (especially when it affects
the output) belong to Windows. ;-)

It can be argued that most of the above cases are not so common, and
that the common case is that people want xz to just work even if it
means suboptimal compression. It depends on what a person finds the
most important and so there are conflicting wishes.


When the default limit was removed, XZ_DEFAULTS environment variable
was added to let people set default limits. It was very simple to add
because there was already support for XZ_OPT variable.

Using the XZ_DEFAULTS environment variable to set default memory usage
limits isn't liked by everyone who want to enable limits:

  - If you login interactively with ssh, the shell startup scripts are
    executed and XZ_DEFAULTS will be set. But if ssh is used to run a
    remote command (e.g. "ssh myhost myprogram"), the startup scripts
    aren't read and XZ_DEFAULTS won't be there.

  - /etc/profile or equivalent usually isn't executed by initscripts
    when starting daemons. Some daemons use xz.

  - People don't want to pollute the environment with variables that
    affect only one program.

Having a configuration file would fix the above problems, but XZ Utils
is already an over-engineered pig, so I'm not so eager to add config
file support.

I have thought about adding configure options that would allow setting
default limits for compression and decompression. Someone may think
that it can confuse things even more, but on the other hand some people
already patch xz to have a limit for compression by default.


I haven't thought much about memory usage limits with threading, but
below are some preliminary thoughts.

With compression, -T0 in 5.1.1alpha sets the number of threads to match
the number of CPU cores. If no memory usage limit has been set, it may
end up using more memory than there is RAM. Pushing the system to swap
with threading is silly, because the point of threading in xz is to
make it faster. So it might make sense to have some kind of default
soft limit that is used to limit the number of threads when automatic
number of threads is requested.

With threaded decompression (not implemented yet) and no memory usage
limit, the worst case is that xz will try to read the whole input file
to memory, which is silly. So it probably will need some sort of soft
default limit to keep the maximum memory usage sane. The definition of
sane is unclear though. It's not necessarily the same as for
compression.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Reply via email to