from:"Lasse Collin"

[xz-devel] XZ Utils 5.2.13, 5.4.7, and 5.6.2

2024-05-29 Thread Lasse Collin

 --enable-small.
  (CMake build doesn't support ENABLE_SMALL in XZ Utils 5.2.x.)

* xz:

- Fix a C standard conformance issue in --block-list parsing
  (arithmetic on a null pointer).

- Fix a warning from GNU groff when processing the man page:
  "warning: cannot select font 'CW'"

- Windows: Handle special files such as "con" or "nul". Earlier
  the following wrote "foo" to the console and deleted the input
  file "con_xz":

  echo foo | xz > con_xz
  xz --suffix=_xz --decompress con_xz

- Windows: Fix an issue that prevented reading from or writing
  to non-terminal character devices like NUL.

* xzless:

- With "less" version 451 and later, use "||-" instead of "|-"
  in the environment variable LESSOPEN. This way compressed
  files that contain no uncompressed data are shown correctly
  as empty.

- With "less" version 632 and later, use --show-preproc-errors
  to make "less" show a warning on decompression errors.

* Build systems:

- Add a new line to liblzma.pc for MSYS2 (Windows):

  Cflags.private: -DLZMA_API_STATIC

  When compiling code that will link against static liblzma,
  the LZMA_API_STATIC macro needs to be defined on Windows.

- Autotools (configure):

* Symbol versioning variant can now be overridden with
  --enable-symbol-versions. Documentation in INSTALL was
  updated to match.

- CMake:

* Fix a bug that prevented other projects from including
  liblzma multiple times using find_package().

* Fix a bug where configuring CMake multiple times resulted
  in HAVE_CLOCK_GETTIME and HAVE_CLOCK_MONOTONIC not being
  defined.

* Fix the build with MinGW-w64-based Clang/LLVM 17.
  llvm-windres now has more accurate GNU windres emulation
  so the GNU windres workaround from 5.4.1 is needed with
  llvm-windres version 17 too.

* The import library on Windows is now properly named
  "liblzma.dll.a" instead of "libliblzma.dll.a"

* Add large file support by default for platforms that
  need it to handle files larger than 2 GiB. This includes
  MinGW-w64, even 64-bit builds.

* Linux on MicroBlaze is handled specially now. This
  matches the changes made to the Autotools-based build
  in XZ Utils 5.4.2 and 5.2.11.

* Disable symbol versioning on non-glibc Linux to match
  what the Autotools build does. For example, symbol
  versioning isn't enabled with musl.

* Symbol versioning variant can now be overridden by
  setting SYMBOL_VERSIONING to "OFF", "generic", or
  "linux".

* Documentation:

- Clarify the description of --disable-assembler in INSTALL.
  The option only affects 32-bit x86 assembly usage.

- Don't install the TODO file as part of the documentation.
  The file is out of date.

- Update home page URLs back to their old locations on
  tukaani.org.

- Update maintainer info.

-- 
Lasse Collin

Re: [xz-devel] xz-java and newer java

2024-03-26 Thread Lasse Collin

On 2024-03-20 Brett Okken wrote:
> The jdk8 changes show nice improvements over head. My assumption is
> that with less math going on in the offsets of the while loop allowed
> the jvm to better optimize.

Sounds good, thanks! :-)

> I am surprised with the binary math behind your handling of long
> comparisons here:

I had to refresh my memory as I hadn't commented it in memcmplen.h. Now
it is (based on Agner Fog's microarchitecture.pdf):

  - On some x86-64 processors (Intel Sandy Bridge to Tiger Lake),
sub+jz and sub+jnz can be fused but xor+jz or xor+jnz cannot.
Thus using subtraction has potential to be a tiny amount faster
since the code checks if the quotient is non-zero.

  - Some processors (Intel Pentium 4) used to have more ALU
resources for add/sub instructions than and/or/xor.

So in the C code it's not a huge thing and in Java it's probably
about nothing. But there is no real downside to using subtraction.

I understand how xor seems more obvious choice. However, when looking
for the lowest differing bit, subtraction will make that bit 1 and the
bits below it 0. Only the bits above the 1 will differ between
subtraction and xor but those bits are irrelevant here.

I created a new branch, bytearrayview, which combines the CRC64 edits
with the encoder speed changes as they share the ByteArrayView class
(formerly ArrayUtil).

> > I still need to check a few of your edits if some of them should be
> > included. :-)  
> 
> I think the changes to LZMAEncoderNormal as part of this PR to avoid
> the negative length comparison would be good to carry forward.

Done, I hope.

> 1. Use an interface with implementation chosen statically to separate
> out the implementation options.

I had an early version that used separate implementation classes but I
must have done something wrong as that version was *clearly* slower. So
I tried it again and it's as you say, no speed difference. :-)

> 2. Allow specifying the implementation to use with a system property.

Done. I hope it's done in a sensible enough way. The Java < 9 code is
completely separate so it cannot be chosen. The property needs to be
documented somewhere too.

I suppose the ARM64 speed is still to be determined by you or someone
else.

-- 
Lasse Collin

Re: [xz-devel] xz-java and newer java

2024-03-12 Thread Lasse Collin

On 2024-03-12 Brett Okken wrote:
> I am still working on digesting your branch.

I still need to check a few of your edits if some of them should be
included. :-)

> The difference in method signature is subtle, but I think a key part
> of the improvements you are getting. Could you add javadoc to more
> clearly describe how the args are to be interpreted and what the
> return value means?

I pushed basic docs for getMatchLen.

Once crc64_varhandle2 is merged then array_compare should use ArrayUtil
too. It doesn't make a difference in speed.

> I am playing with manually unrolling the java 8 byte-by-byte impl
> along with tests comparing unsafe, var handle, and vector approaches.
> These tests take a long time to run, so it will be a couple days
> before I have complete results. Do you want data as I have it (and it
> is interesting), or wait for summary?

I can wait for the summary, thanks.

> I am not sure when I will get opportunity to test out arm64.

If someone has, for example, a Raspberry Pi, the compression of zeros
test is simple enough to do and at least on x86-64 has clear enough
difference. It's an over-simplified test but it's a data point still.

> I do have some things still on jdk 8, but only decompression. Surveys
> seem to indicate quite a bit of jdk 8 still in use, but I have no
> personal need.

Thanks. I was already tilted towards not using Unsafe and now I'm even
more. The speed benefit of Unsafe over VarHandle should be tiny enough.
It feels better that memory safety isn't ignored on any JDK version. If
a bug was found, it's nicer to not wonder if Unsafe had a role in it.
This is better for security too.

In my previous email I wondered if using Unsafe only with Java 8 would
make upgrading to newer JDK look bad if newer JDK used VarHandle
instead of Unsafe. Perhaps that worry was overblown. But the other
reasons and keeping the code simpler make me want to avoid Unsafe.

(C code via JNI wouldn't be memory safe but then the speed benefits
should be much more significant too.)

-- 
Lasse Collin

Re: [xz-devel] xz-java and newer java

2024-03-10 Thread Lasse Collin

On 2024-03-09 Brett Okken wrote:
> When I tested graviton2 (arm64) previously, Arrays.mismatch was
> better than comparing longs using a VarHandle.

Sounds promising. :-) However, your array_comparison_performance
handles the last 1-7 bytes byte-by-byte. My array_compare branch
reserves extra 7 bytes at the end of the array so that one can safely
read up to 7 bytes more than one actually needs. This way no bounds
checks are needed (even with Unsafe). This might affect the comparision
between Arrays.mismatch and VarHandle if the results were close before.

> I do like Unsafe as an option for jdk 8 users on x86 or arm64.

Unsafe seems very slightly faster than VarHandle. If Java 8 uses Unsafe,
should newer versions do too? It could be counter-productive if Java 8
was faster, even if the difference was tiny.

Do you have use cases that are (for now) stuck on Java 8 or is your
wish a more generic one?

-- 
Lasse Collin

Re: [xz-devel] xz-java and newer java

2024-03-09 Thread Lasse Collin

I created a branch array_compare. It has a simple version for Java <= 8
which seems very slightly faster than the current code in master, at
least when tested with OpenJDK 21. For Java >= 9 there is
Arrays.mismatch for portability and VarHandle for x86-64 and ARM64.
These are clearly faster than the basic version.

sun.misc.Unsafe would be a little faster than VarHandle but I feel it's
not enough to be worth the downsides (non-standard and not memory safe).

32-bit archs I didn't include, for now at least, since if people want
speed I hope they don't run 32-bit Java.

Speed differences are very minor when testing with files that don't
compress extremely well. That was the problem I had with my earlier
test results. With files that have compression ratio like 0.05 the
speed differences are clear.

I cannot test on ARM64 so it would be great if someone can, comparing
the three versions. The most extreme difference is when compressing
just zeros:

time head -c1 /dev/zero \
| java -jar build/jar/XZEncDemo.jar > /dev/null

Internal docs should be added to the branch and perhaps there are other
related optimizations to do still. So it's not fully finished yet but
now it's ready for testing and feedback. For example, some tweaks from
your array_comp_incremental could be considered after testing.

-- 
Lasse Collin

Re: [xz-devel] [BUG] Issue with xz-java: Unknown Filter ID

2024-03-07 Thread Lasse Collin

On 2024-03-05 Dennis Ens wrote:
> > I hope 1.10 could be done in a month or two but I don't want to
> > make any promises or serious predictions. Historically those
> > haven't been accurate at all.  
> 
> I'll hope it's on the sooner side then. Is there a reason that
> xz-java is so far behind its counterpart?

These are unpaid hobby projects and the maintainers work on things they
happen to find interesting. The focus was on XZ Utils quite long, now
more attention is returning to XZ for Java.

> It seems those filters have been in that version for a while, and it
> seems strange they aren't compatible with each other. Maybe this
> should be made more clear in the README?

The README file in XZ for Java 1.9 specifies that the code implements
the .xz file format specification version 1.0.4. That doesn't include
the ARM64 or RISC-V filters.

ARM64 filter was in the master branch already. RISC-V filter is there
now too among a few other changes. README refers to spec version 1.2.0
now.

I understand it can be cryptic to refer to a spec version but obviously
one cannot list what future things are missing. One could list
supported filters but in theory something else could be extended too.

> I don't see anything about contributing on the xz-java github page.
> What are the best practices for contributing to this project?

I'm not sure if there is anything specific. Chatting on #tukaani can be
good to get ideas discussed quickly but it requires that people happen
to be online at the same time.

> > The encoder implementations have some minor differences which
> > affects both output and speed. Different releases can in theory
> > have different output. XZ Utils output might change in future
> > versions too.  
> 
> I see, that makes sense. I'm glad the difference is explainable and
> not a bug. Can you explain exactly what the differences are?

I don't remember much now. It's minor details but minor differences
affect output already.

> Does xz-java always do a better job compressing since it resulted in a
> smaller file?

They should be very close in practice. You need to compare to XZ Utils
in single-threaded mode: xz -T1

-- 
Lasse Collin

Re: [xz-devel] xz-java and newer java

2024-03-07 Thread Lasse Collin

On 2024-02-29 Brett Okken wrote:
> > Thanks! Ideally there would be one commit to add the minimal
> > portable version, then separate commits for each optimized variant.
> 
> Would you like me to remove the Unsafe based impl from
> https://github.com/tukaani-project/xz-java/pull/13?

There are new commits in master now and those might slightly conflict
with your PR (@Override additions). I'm playing around a bit and
learning about the faster methods still. So right now I don't have
wishes for changes; I don't want to request anything when there's a
possibility that some other way might end up looking more preferable.

In general, I would prefer splitting to more commits. Using your PR as
an example:

  1. Adding the changes to lz/*.java and the portable *Array*.java
 code required by those changes.

  2. Adding one advanced implementation that affects only the
 *Array*.java files.

  3. Repeat step 2. until all implementations are added.

When reasonably possible, the line length should be under 80 chars.

> > So far I have given it only a quick try. array_comp_incremental
> > seems faster than xz-java.git master. Compression time was reduced
> > by about 10 %. :-) This is with OpenJDK 21.0.2, only a quick test,
> > and my computer is old so I don't doubt your higher numbers.  
> 
> How are you testing? I am using jmh, so it has a warm up period before
> actually measuring, giving the jvm plenty of opportunity to perform
> optimizations. If you are doing single shot executions to compress a
> file, that could provide pretty different results.

I was simply timing a XZEncDemo at the default preset (6). I had hoped
that big files (binary and source packages) that take tens of seconds
to compress, repeating each test a few times, would work well enough.
But perhaps the difference is big enough only with certain types of
files.

On 2024-03-05 Brett Okken wrote:
> I have added a comment to the PR with updated benchmark results:
> https://github.com/tukaani-project/xz-java/pull/13#issuecomment-1977705691

Thanks! I'm not sure if I read the results well enough. The "Error"
column seems to have oddly high values on several lines. If the same
test set is run again, are the results in the "Score" column similar
enough between the two runs, retaining the speed order of the
implementations being tested?

If the first file is only ~66KB, I wonder if other factors like
initiazing large arrays in the classes take so much time that
differences in array comparison speeds becomes hard to measure.

When each test is repeated by the benchmarking framework, each run has
to allocate the classes again. Perhaps it might trigger garbage
collection. Did you have ArrayCache enabled?

ArrayCache.setDefaultCache(BasicArrayCache.getInstance());

I suppose optimizing only for new JDK version(s) would be fine if it
makes things easier. That is, it could be enough that performance
doesn't get worse on Java 8.

If the indirection adds overhead, would it make sense to have a
preprocessing step that creates .java file variants that directly use
the optimized methods? So LZMAEncoder.getInstance could choose at
runtime if it should use LZMAEncoderNormalPortable or
LZMAEncoderNormalUnsafe or some other implementation. That is, if this
cannot be done with multi-release JAR. It's not a pretty solution but if
it is faster then it could be one option, maybe.

Negative lenLimit currently occurs in two places (at least). Perhaps it
should be handled in those places instead of requiring the array
comparison to support it (the C code in liblzma does it like that).

-- 
Lasse Collin

Re: [xz-devel] [BUG] Issue with xz-java: Unknown Filter ID

2024-03-05 Thread Lasse Collin

On 2024-03-05 Dennis Ens wrote:
> > The XZ for Java development is becoming active again but it may
> > still take a while until the next stable release is out. A few
> > other things are waiting in the queue from the past three years.  
> 
> Ah, I see. Thank you for the answer. Do you have a timeline of when
> the changes are expected?

I hope 1.10 could be done in a month or two but I don't want to make any
promises or serious predictions. Historically those haven't been
accurate at all.

> First, xz-java seems much slower. I tested compressing and
> decompressing a ~1.2 gigabyte file, and xz-java took 17m32.345s
> compared to xz's 7m7.615s to compress. Decompressing was 0m21.760s to
> 0m6.223s. Is there anything that can be done to improve the speed of
> the Java version, or is c just a much more efficient programming
> language?

Brett Okken's patches (originally from early 2021) should improve
compression speed. They are currently under review. Those are one of
the things to get into the next stable release.

However, Java in general is slower. Some compressors have a Java API but
the performance-critical code is native code. For example, java.util.zip
calls into native code from zlib. XZ for Java doesn't use any native
code (for now at least).

XZ for Java lacks threading still. Implementing it is among the most
important tasks in XZ for Java. It helps with big files like your test
file but makes compressed file a little bigger. From your numbers I'm
not certain if you used xz in threaded mode or not. The time difference
looks unusually high for single-threaded mode for both compression and
decompression. The difference for a big input file in threaded mode
looks small though (unless it had lots of trivially-compressible
sections).

In single-threaded mode, I would expect compressing with xz to take
around 30-40 % less time than XZ for Java but your numbers show 60 %
time reduction.

XZ Utils 5.6.0 added x86-64 assembly (GCC & Clang only) which reduces
per-thread decompression time by 20-40 % depending on the file and the
computer. So that increases the difference between XZ Utils and XZ for
Java too: decompression time can be roughly 50 % less with XZ Utils
5.6.0 in single-threaded mode on x86-64 compared to XZ for Java.

XZ Utils 5.6.0 also enables threaded mode by default.

> Also, I noticed that the results of compressing the files were
> different sizes. They both worked, so I don't know if it's an issue,
> but it does seem strange. The xz-java one was slightly smaller than
> the xz one.

The encoder implementations have some minor differences which affects
both output and speed. Different releases can in theory have different
output. XZ Utils output might change in future versions too.

-- 
Lasse Collin

Re: [xz-devel] [BUG] Issue with xz-java: Unknown Filter ID

2024-03-05 Thread Lasse Collin

On 2024-03-05 Dennis Ens wrote:
> The files specifically were good-1-arm64-lzma2-1.xz and
> good-1-arm64-lzma2-2.xz and good-1-riscv-lzma2-1.xz and
> good-1-riscv-lzma2-2.xz. These did seem to work fine when I tried
> with xz, but not with xz-java. Do you think there might be a fix
> available for this soon?

XZ for Java 1.9 doesn't have ARM64 or RISC-V filter. The master branch
has ARM64 filter. RISC-V filter will likely be there this week.

The XZ for Java development is becoming active again but it may still
take a while until the next stable release is out. A few other things
are waiting in the queue from the past three years.

-- 
Lasse Collin

Re: [xz-devel] xz-java and newer java

2024-02-29 Thread Lasse Collin

On 2024-02-25 Brett Okken wrote:
> I created https://github.com/tukaani-project/xz-java/pull/13 with the
> bare bones changes to utilize a utility for array comparisons and an
> Unsafe implementation.
> When/if that is reviewed and approved, we can move on through the
> other implementation options.

Thanks! Ideally there would be one commit to add the minimal portable
version, then separate commits for each optimized variant.

So far I have given it only a quick try. array_comp_incremental seems
faster than xz-java.git master. Compression time was reduced by about
10 %. :-) This is with OpenJDK 21.0.2, only a quick test, and my
computer is old so I don't doubt your higher numbers.

With array_comparison_performance the improvement seems to be less,
maybe 5 %. I didn't test much yet but it still seems clear that
array_comp_incremental is faster on my computer.

However, your code produces different output compared to xz-java.git
master so the speed comparison isn't entirely fair. I assume there was
no intent to affect the encoder output with these changes so I wonder
what is going on. Both of your branches produce the same output so it's
something common between them that makes the difference.

I plan to get back to this next week.

> > One thing I wonder is if JNI could help.  
> 
> It would most likely make things faster, but also more complicated. I
> like the java version for the simplicity. I am not necessarily looking
> to compete with native performance, but would like to get improvements
> where they are reasonably available. Here there is some complexity in
> supporting multiple implementations for different versions and/or
> architectures, but that complexity does not intrude into the core of
> the xz code.

I think your thoughts are similar to mine here. Java version is clearly
slower but it's nicer code to read too. A separate class for buffer
comparisons indeed doesn't hurt the readability of the core code.

On the other hand, if Java version happened to be used a lot then JNI
could save both time (up to 50 %) and even electricity. java.util.zip
uses native zlib for the performance-critical code.

In the long run both faster Java code and JNI might be worth doing.
There's more than enough pure Java stuff to do for now so any JNI
thoughts have to wait.

-- 
Lasse Collin

Re: [xz-devel] [PATCH] xz: Avoid warnings due to memlimit if threads are in auto mode.

2024-02-29 Thread Lasse Collin

On 2024-02-28 Sebastian Andrzej Siewior wrote:
> On 2024-02-28 18:45:03 [+0200], Lasse Collin wrote:
> > V_DEBUG was commited to the master and v5.6 branches a few moments
> > ago, so yes, your plan sounds good. :-) Feel free to do it as you
> > prefer, either just making the change or picking the other simple
> > fixes from v5.6 as well.  
> 
> Perfect. I just took the patch.

Thanks! :-)

> > Hopefully the already-added workarounds in other packages don't
> > cause any unwanted side effects in the future.  
> 
> The plan was to revert it. All good.

:-)

There is a branch "memavail" on GitHub with experimental support for
MemAvailable from Linux /proc/meminfo. It needs discussion and
feedback (likely in a new thread). There is no rush as it's not for
5.6.x anyway.

-- 
Lasse Collin

Re: [xz-devel] [PATCH] xz: Avoid warnings due to memlimit if threads are in auto mode.

2024-02-28 Thread Lasse Collin

On 2024-02-28 Sebastian Andrzej Siewior wrote:
> I see. In that case let me throw this to V_DEBUG Debian wise and sync
> with xz upstream once a new release is up or so. I have two packages
> that fail because of this and dpkg added a workaround. So instead
> adding another workaround to another package I would fix this on the
> xz side. Sounds good?

V_DEBUG was commited to the master and v5.6 branches a few moments ago,
so yes, your plan sounds good. :-) Feel free to do it as you prefer,
either just making the change or picking the other simple fixes from
v5.6 as well.

Hopefully the already-added workarounds in other packages don't cause
any unwanted side effects in the future.

Thanks!

-- 
Lasse Collin

Re: [xz-devel] [PATCH] xz: Avoid warnings due to memlimit if threads are in auto mode.

2024-02-28 Thread Lasse Collin

On 2024-02-27 Sebastian Andrzej Siewior wrote:
> On 2024-02-27 19:17:48 [+0200], Lasse Collin wrote:
> >   - The silencing could be done with -q as well though.  
> 
> Wouldn't -q also shut some legitime warnings?

Yes. When compressing from stdin to stdout, there aren't many possible
warnings but there are still a few rare ones. So -q isn't ideal to get
rid of thread count reduction messages.

> Isn't the automatic memory usage accurate?

It's simply 25 % of total RAM. The Linux-specific MemAvailable from
/proc/meminfo didn't get into 5.6.0. Perhaps it could be done in the
next development cycle, and maybe also look for similar features on a
few other OSes.

> Not sure if documenting it in the man-page would help here.

One issue is that currently the message tells about thread count
reduction and what the memlimit is but not how much memory is actually
required. One needs to use -vv to get the usage info.

Documenting on the man page could be good if it can be explained in an
understandable way and people can find it there. The man page is long
already.

The less average users *need* to understand the details the better.

> > There are also messages that are shown when memory limit does affect
> > compressed output (switching to single-threaded mode and LZMA2
> > dictionary size adjustment). The verbosity requirement of these
> > messages isn't being changed now.  
> 
> This sounds like you accept this change in principle but are thinking
> if V_VERBOSE or V_DEBUG is the right thing.

Me and three other people on IRC think it should be changed but there
is no consensus yet what exactly is the best (your patch, -v, or -vv).
This is about the thread count messages only as (since 5.4.0) automatic
thread count doesn't affect the compressed output.

There is some discussion also here:

    https://github.com/tukaani-project/xz/issues/89

-- 
Lasse Collin

Re: [xz-devel] [PATCH] xz: Avoid warnings due to memlimit if threads are in auto mode.

2024-02-27 Thread Lasse Collin

On 2024-02-26 Sebastian Andrzej Siewior wrote:
> Print the warning about reduced threads only if number is selected
> - automatically and asked to be verbose (-v)
> - explicit by the user

Thanks for the patch! We discussed a bit on IRC and everyone thinks
it's on the right track but we are pondering the implementation details
still.

The thread count messages are shown in situations which don't affect
the compressed output, and thus the importance of these messages isn't
so high. Originally they were there to reduce the chance of people
asking why xz isn't using as many threads as requested.

We are considering to simply change those two message() calls to always
use V_VERBOSE or V_DEBUG instead of the current V_WARNING. So automatic
vs. manual number of threads wouldn't affect it like it does in your
patch. Comparing your apporach and this simpler one:

  + There are scripts that take a user-specified number for
parallelization and that number is passed to multiple tools, not
just xz. Keeping xz -T16 silent about thread count reduction can
make sense in this case.

  - The silencing could be done with -q as well though.

There are pros and cons between V_VERBOSE and V_DEBUG.

For (de)compression, a single -v sets V_VERBOSE and actives the
progress indicator. If the thread count messages are shown at -v, on
some systems progress indicator usage would get the message about
reduced thread count as well.

  + It works as a hint that increasing the memory usage limits manually
might allow more threads to be used.

  - If one uses progress indicator frequently, the thread count
reduction message might become slightly annoying as the information
is already known by the user.

  - Progress indicator can be used in non-interactive cases (when
stderr isn't a terminal). Then xz only prints a final summary per
file. This likely is not a common use case but the thread count
messages would be here as well.

V_DEBUG is set when -v is used twice (-vv).

  + Regular progress indicator uses wouldn't get extra messages.

  - A larger number of users might not become aware that they aren't
getting as many threads as they could because the automatic memory
usage limit is too low to allow more threads.

There are also messages that are shown when memory limit does affect
compressed output (switching to single-threaded mode and LZMA2
dictionary size adjustment). The verbosity requirement of these messages
isn't being changed now.

-- 
Lasse Collin

Re: [xz-devel] Testing LZMA_RANGE_DECODER_CONFIG

2024-02-20 Thread Lasse Collin

On 2024-02-19 Sebastian Andrzej Siewior wrote:
> Okay, so the input matters, too. I tried 1GiB urandom (so it does not
> compress so well) but that went quicker than expected…

urandom should be incompressible. When LZMA2 cannot compress a chunk it
stores it in uncompressed form. Decompression is like "cat with CRC".

> I found 3 idle x86 boxes and re-run a test with linux' perf on them
> and the arm64 box. I all flavours for the two archives. On RiscV I
> did the 'xz -t' thing because perf seems not to be supported well or
> I lack access.

Great work! Thanks!

On IRC one person ran a bunch of tests too. On ARM64 the results were
mixed. A variant that was better with GCC could be worse with Clang. So
those weren't as clear as your results but they too made me think that
using 0 for non-x86-64 is the way to go for 5.6.0.

Your x86-64 asm variant results were interesting too. Seems that the bit
0x100 isn't good with GCC although the difference is small. I confirmed
this on the tests I did on Celeron G1620 (Ivy Bridge). So I wonder if
0x0F0 should be the x86-64 variant to use in xz 5.6.0 with GCC.

On another machine with Clang 16, 0x100 is 8 % faster with Linux kernel
source. So the difference is somewhat big. It's still slightly slower
than the GCC version. This is on Phenom II X4 920.

Since 0x100 is only a little worse with GCC, using it for both GCC and
Clang could be OK. An #ifdef __clang__ could be used too but perhaps
it's not great in the long term. Something has to be chosen for 5.6.0;
further tweaks can be made later.

By the way, the "time" command gives more precise results than "xz -v".
I use

TIMEFORMAT=$'\nreal\t%3R\nuser\t%3U\nsys\t%3S\ncpu%%\t%P'

in bash to keep the output as seconds instead of minutes and seconds.

-- 
Lasse Collin

Re: [xz-devel] xz-java and newer java

2024-02-19 Thread Lasse Collin

On 2024-02-19 Brett Okken wrote:
> I have created a pr to the GitHub project.
> 
> https://github.com/tukaani-project/xz-java/pull/12

Thanks! I could be good to split into smaller commits to make reviewing
easier.

> It is not clear to me if that is actually seeing active dev on the
> Java project yet.

I see now that there are quite a few things on GH. I had forgotten to
turn email notifications on for the xz-java project; clearly those
aren't on by default. :-( But likely not much would have been done even
if I had noticed those issues and PRs earlier so the main problem is
that the silence has been impolite. I'm sorry.

XZ Utils 5.6.0 has to be released this month since there was a wish to
get it into the next Ubuntu LTS. I'm hoping that next month something
will finally get done around XZ for Java. We'll see.

One thing I wonder is if JNI could help. Optimizing the Java code can
help a bit but I suspect that it still won't be very fast. So far it has
been nice that the Java code is quite readable and I would like keep it
that way in the future too.

-- 
Lasse Collin

Re: [xz-devel] Testing LZMA_RANGE_DECODER_CONFIG

2024-02-18 Thread Lasse Collin

The balance between the hottest locations in the decompressor code
varies depending on the input file. Linux kernel source compresses very
well (ratio is about 0.10). This reduces the benefit of branchless
code. On my main computer I still get about 2 % time reduction with =3.

On another x86-64 computer I don't see any difference between =0 and =3
with the Linux kernel source. On the same machine, decompression time
of warzone2100-data[1] from Debian is reduced by 10.5 % with =3 compared
to =0. It's a package that doesn't compress so well (ratio is about
0.75). On my main computer the time reduction from =0 to =3 is 8.5 %.
All numbers are with GCC.

Of course, on x86-64 the =0 vs. =3 test isn't that interesting since the
asm is so much better. But this highlights how much the test file
choice can make a difference.

[1] https://packages.debian.org/bookworm/all/warzone2100-data/download

-- 
Lasse Collin

Re: [xz-devel] Testing LZMA_RANGE_DECODER_CONFIG

2024-02-18 Thread Lasse Collin

On 2024-02-17 Sebastian Andrzej Siewior wrote:
> I did some testing on !x86. I changed LZMA_RANGE_DECODER_CONFIG to
> different values run a test and looked at the MiB/s value. xz_0 means
> LZMA_RANGE_DECODER_CONFIG was 0, xz_1 means the define was set to 1. I
> touched src/liblzma/lzma/lzma_decoder.c and rebuilt xz. I pinned the
> shell to a single CPU and run test for archive (-tv) for one file
> three times.

Great to see testing! The testing method is fine. If pinning to a
single core, I assume --threads=1 was set as well because
multithreading is the default now.

Branchless code can help when branch prediction penalties are high. So
it will depend on the processor (not just the instruction set).

On x86-64, there was a clear improvement with the branchless C code. It
was a little more with Clang than GCC. So if easily possible, also
testing with Clang could be useful. Testing your script on x86-64 could
be worth it too so check that at least on x86-64 you get an improvement
with =1 and =3 compared to =0. (The bit 1 makes the main difference; 2
should have a small effect, and 4 and 8 are questionable and perhaps
not worth benchmarking until the usefulness of =1 or =3 is clear.)

If the branchless C code is not consistent outside x86-64, then 5.6.0
likely should stick to =0. From your results it seems that the other
tweaks to the code provided a minor improvement on non-x86-64 still.
(The tweaks that LZMA_RANGE_DECODER_CONFIG doesn't affect.)

Thanks!

-- 
Lasse Collin

[xz-devel] XZ projects license change proposal

2024-02-08 Thread Lasse Collin

Hello!

I have made a post on GitHub about possibly moving from public domain
to BSD Zero Clause License:

https://github.com/tukaani-project/xz/issues/79

Feedback is welcome. Feel free to comment on GitHub, privately via email
to x...@tukaani.org, or on the xz-devel mailing list.

Thank you!

PS. XZ for Java has been idle longer than expected but it should
finally get at least some attention in the coming months.

-- 
Lasse Collin

Re: [xz-devel] [PATCH] [xz-embedded] Fix condition that automatically define XZ_DEC_BCJ

2023-09-08 Thread Lasse Collin

On 2023-09-07 Jules Maselbas wrote:
> The XZ_DEC_BCJ macro was not defined when only selecting the ARM64 BCJ
> decoder, leading to no BCJ decoder being compiled.
> 
> The macro that select XZ_DEC_BCJ if any of the BCJ decoder is
> selected was missing a case for the recently added ARM64 BCJ decoder.
> 
> Also the macro `defined(XZ_DEC_ARM)` was used twice in the condition
> for selecting XZ_DEC_BCJ, so this patch replaces one with
> XZ_DEC_ARM64.

Thanks! I kept the ordering of the filter names the same as elsewhere
in the file and in xz_dec_bcj.c.

The ARM64 filter still hasn't been submitted to Linux but it's on the
to-do list.

-- 
Lasse Collin

[xz-devel] XZ Utils 5.2.11 and 5.4.2

2023-03-18 Thread Lasse Collin

XZ Utils 5.2.11 and 5.4.2 are available at <https://tukaani.org/xz/>.

The Doxygen-generated liblzma API documentation is now available online
at <https://tukaani.org/xz/liblzma-api/files.html>.

Please let us know if there is interest in more releases for the
5.2 branch. Jia Tan and I will plan further bug-fix releases for this
branch only if people will use it.

Future release tarballs might be signed by Jia Tan. Recently he has done
most of the work in XZ Utils. :-)

Here is an extract from the NEWS file:

5.2.11 (2023-03-18)

* Removed all possible cases of null pointer + 0. It is undefined
  behavior in C99 and C17. This was detected by a sanitizer and had
  not caused any known issues.

* Build systems:

- Added a workaround for building with GCC on MicroBlaze Linux.
  GCC 12 on MicroBlaze doesn't support the __symver__ attribute
  even though __has_attribute(__symver__) returns true. The
  build is now done without the extra RHEL/CentOS 7 symbols
  that were added in XZ Utils 5.2.7. The workaround only
  applies to the Autotools build (not CMake).

- CMake: Ensure that the C compiler language is set to C99 or
  a newer standard.

- CMake changes from XZ Utils 5.4.1:

* Added a workaround for a build failure with
  windres from GNU binutils.

* Included the Windows resource files in the xz
  and xzdec build rules.

5.4.2 (2023-03-18)

* All fixes from 5.2.11 that were not included in 5.4.1.

* If xz is built with support for the Capsicum sandbox but running
  in an environment that doesn't support Capsicum, xz now runs
  normally without sandboxing instead of exiting with an error.

* liblzma:

- Documentation was updated to improve the style, consistency,
  and completeness of the liblzma API headers.

- The Doxygen-generated HTML documentation for the liblzma API
  header files is now included in the source release and is
  installed as part of "make install". All JavaScript is
  removed to simplify license compliance and to reduce the
  install size.

- Fixed a minor bug in lzma_str_from_filters() that produced
  too many filters in the output string instead of reporting
  an error if the input array had more than four filters. This
  bug did not affect xz.

* Build systems:

- autogen.sh now invokes the doxygen tool via the new wrapper
  script doxygen/update-doxygen, unless the command line option
  --no-doxygen is used.

- Added microlzma_encoder.c and microlzma_decoder.c to the
  VS project files for Windows and to the CMake build. These
  should have been included in 5.3.2alpha.

* Tests:

- Added a test to the CMake build that was forgotten in the
  previous release.

- Added and refactored a few tests.

* Translations:

- Updated the Brazilian Portuguese translation.

- Added Brazilian Portuguese man page translation.

-- 
Lasse Collin

[xz-devel] XZ Utils 5.2.10 and 5.4.0

2022-12-13 Thread Lasse Collin

d in CMake-based
  builds too ("make test").

-- 
Lasse Collin

[xz-devel] XZ Utils 5.3.5beta

2022-12-01 Thread Lasse Collin

There were technical issues on the tukaani.org website in the past 24 hours. 
These should have now been fixed. Sorry for the inconvenience.

XZ Utils 5.3.5beta is available at <https://tukaani.org/xz/>. Here is an
extract from the NEWS file:

5.3.5beta (2022-12-01)

* All fixes from 5.2.9.

* liblzma:

- Added new LZMA_FILTER_LZMA1EXT for raw encoder and decoder to
  handle raw LZMA1 streams that don't have end of payload marker
  (EOPM) alias end of stream (EOS) marker. It can be used in
  filter chains, for example, with the x86 BCJ filter.

- Added lzma_str_to_filters(), lzma_str_from_filters(), and
  lzma_str_list_filters() to make it easier for applications
  to get custom compression options from a user and convert
  it to an array of lzma_filter structures.

- Added lzma_filters_free().

- lzma_filters_update() can now be used with the multi-threaded
  encoder (lzma_stream_encoder_mt()) to change the filter chain
  after LZMA_FULL_BARRIER or LZMA_FULL_FLUSH.

- In lzma_options_lzma, allow nice_len = 2 and 3 with the match
  finders that require at least 3 or 4. Now it is internally
  rounded up if needed.

- ARM64 filter was modified. It is still experimental.

- Fixed LTO build with Clang if -fgnuc-version=10 or similar
  was used to make Clang look like GCC >= 10. Now it uses
  __has_attribute(__symver__) which should be reliable.

* xz:

- --threads=+1 or -T+1 is now a way to put xz into multi-threaded
  mode while using only one worker thread.

- In --lzma2=nice=NUMBER allow 2 and 3 with all match finders
  now that liblzma handles it.

* Updated translations: Chinese (simplified), Korean, and Turkish.

-- 
Lasse Collin

[xz-devel] XZ Utils 5.2.9

2022-11-30 Thread Lasse Collin

XZ Utils 5.2.9 is available at <https://tukaani.org/xz/>. Here is an
extract from the NEWS file:

5.2.9 (2022-11-30)

* liblzma:

- Fixed an infinite loop in LZMA encoder initialization
  if dict_size >= 2 GiB. (The encoder only supports up
  to 1536 MiB.)

- Fixed two cases of invalid free() that can happen if
  a tiny allocation fails in encoder re-initialization
  or in lzma_filters_update(). These bugs had some
  similarities with the bug fixed in 5.2.7.

- Fixed lzma_block_encoder() not allowing the use of
  LZMA_SYNC_FLUSH with lzma_code() even though it was
  documented to be supported. The sync-flush code in
  the Block encoder was already used internally via
  lzma_stream_encoder(), so this was just a missing flag
  in the lzma_block_encoder() API function.

- GNU/Linux only: Don't put symbol versions into static
  liblzma as it breaks things in some cases (and even if
  it didn't break anything, symbol versions in static
  libraries are useless anyway). The downside of the fix
  is that if the configure options --with-pic or --without-pic
  are used then it's not possible to build both shared and
  static liblzma at the same time on GNU/Linux anymore;
  with those options --disable-static or --disable-shared
  must be used too.

* New email address for bug reports is  which
  forwards messages to Lasse Collin and Jia Tan.

-- 
Lasse Collin

Re: [xz-devel] [PATCH 1/2] Add support openssl's SHA256 implementation

2022-11-30 Thread Lasse Collin

On 2022-11-30 Lasse Collin wrote:
> Are there other good library options?

If the goal is to use SHA instructions on x86 then intrinsics in the C
code with runtime CPU detection are an option too. It's done in
crc64_fast.c in 5.3.4alpha already.

-- 
Lasse Collin

Re: [xz-devel] [PATCH 1/2] Add support openssl's SHA256 implementation

2022-11-30 Thread Lasse Collin

Hello!

This could be good as an optional feature, disabled by default so that
extra dependency doesn't get added accidentally. It's too late for
5.4.0 but perhaps in 5.4.1 or .2.

The biggest problem with the patch is that it lacks error checking:

  - EVP_MD_CTX_new() can return NULL if memory allocation fails. Man
page doesn't document this but source code makes it clear.

  - EVP_get_digestbyname() can return NULL on failure. Perhaps this
could be replaced with EVP_sha256()? It seems to return a pointer
to a statically-allocated structure and man page implies that it
cannot fail.

  - EVP_DigestInit_ex(), EVP_DigestUpdate(), and EVP_DigestFinal_ex()
can in theory fail, perhaps not in practice, I don't know.

Currently it is assumed in liblzma that initiazation cannot fail so
that would need to be changed. It could be good to check the return
values from EVP_DigestUpdate() and EVP_DigestFinal_ex() too. Since it
is unlikely that EVP_DigestUpdate() fails it could perhaps be OK to
store the failure code and only return it for lzma_check_finish() but
I'm not sure if that is acceptable.

The configure options perhaps should be --with instead of --enable since
it adds a dependency on another package, if one wants to stick to
Autoconf's guidlines. (It's less clear if --enable-external-sha256
should be --with since it only affects what to use from the OS base
libraries. In any case it won't be changed as it would affect
compatibility with build scripts.)

Are there other good library options? For example, Nettle's SHA-256
functions don't need any error checking but I haven't checked the
performance.

Is it a mess for distributions if a dependency of liblzma gets its
soname bumped and then liblzma needs to be rebuilt without changing its
soname? I suppose such things happen all the time but when a library is
needed by a package manager it might perhaps have extra worries.

-- 
Lasse Collin

Re: [xz-devel] RHEL7 ABI patch (913ddc5) breaks linking on ia64

2022-11-24 Thread Lasse Collin

On 2022-11-23 Sebastian Andrzej Siewior wrote:
> 3x to be exact:
> - 1x shared with threads
> - 1x static with threads
> - 1x non-shared, no threads, no encoders, just xzdec.
> 
> There are three build folder in the end. The full gets a make install,
> the other get xzdec/liblzma.a extracted.

Thanks! I remember the details now, it's excellent.

I figured out a way to make everything just work in the common case. If
--with-pic or --without-pic is used then building both shared and
static liblzma at the same time isn't possible (configure will fail).
That is, --with-pic or --without-pic requires that also
--disable-shared or --disable-static is used on GNU/Linux.

It's in xz.git now and will be in the next releases (5.2.9 is needed to
fix other bugs) so I hope any workarounds can be removed from distros
after that.

Thanks to Adrian for reporting the bug!

-- 
Lasse Collin

Re: [xz-devel] RHEL7 ABI patch (913ddc5) breaks linking on ia64

2022-11-23 Thread Lasse Collin

On 2022-11-23 John Paul Adrian Glaubitz wrote:
> Well, Debian builds both the static and dynamic libraries in separate
> steps, so I'm not sure whether the autotools build system would be
> able to detect that.

I would assume the separate steps means running configure twice, once
to disable static build and once to disable shared build.

> I would make --enable-static and --enable-symbol-versions mutually
> exclusive so that the configure fails if both are enabled.

I was thinking of a slightly friendlier approach so that the combination
--disable-shared --enable-static would imply --disable-symbol-versions
on GNU/Linux (it doesn't matter elsewhere for now). It's good if people
never need to use the option *-symbol-versions. The defaults need to be
as good as easily possible. Using  --disable-symbol-versions as a
temporary workaround is fine but if it is needed in the long term then
something is broken.

-- 
Lasse Collin

Re: [xz-devel] RHEL7 ABI patch (913ddc5) breaks linking on ia64

2022-11-23 Thread Lasse Collin

On 2022-11-23 John Paul Adrian Glaubitz wrote:
> So, for now, we should build the static library with
> "--disable-symbol-versions".

An ugly workaround in upstream could be to make configure fail on
GNU/Linux if both shared and static libs are about to be built. That
is, show an error message describing that one thing has to be built at
a time. It's not pretty but with Autotools I don't see any other way
except dropping the RHEL/CentOS 7 compat symbols completely. Static
libs shouldn't have symbol versions (no matter which arch), somehow it
just doesn't always create problems.

That is, it would be mandatory to use either --disable-static or
--disable-shared to make configure pass. Or would it be less bad to
default to shared-only build and require the use of both
--disable-shared --enable-static to get static build? I don't like any
of these but I don't have better ideas.

Thoughts? 

-- 
Lasse Collin

Re: [xz-devel] RHEL7 ABI patch (913ddc5) breaks linking on ia64

2022-11-23 Thread Lasse Collin

On 2022-11-23 John Paul Adrian Glaubitz wrote:
> On 11/23/22 12:31, Lasse Collin wrote:
> > (1) Does this make the problem go away?  
> 
> Yes, that fixes the linker problem for me. At least in the case of
> mariadb-10.6.

Why does it want static liblzma.a in the first place? It sounds weird
to require rebuilding of mariadb-10.6 every time liblzma is updated.

Can it build against liblzma.so if liblzma.a isn't available?

It is fine to build *static* liblzma with --disable-symbol-versions on
all archs. Debian-specific workaround is fine in the short term but
this should be fixed upstream. One method would be to disable the extra
symbols on ia64 but that is not a real fix. Perhaps it's not really
possible as long as the main build system is Autotools, I don't
currently know.

I'm still curious why exactly one symbol (lzma_get_progress) looks
special in the readelf output. For some reason no other symbols with
the symver declarations are there. Does it happen because of something
in XZ Utils or is it weird behavior in the toolchain that creates the
static lib.

One can wonder if it was a mistake to try to clean up the issues that
started from the RHEL/CentOS 7 patch since now it has created a new
problem. On the other hand, the same could have happened if this kind of
symbol versioning had been done to avoid bumping the soname (which
hopefully will never happen though).

-- 
Lasse Collin

Re: [xz-devel] RHEL7 ABI patch (913ddc5) breaks linking on ia64

2022-11-23 Thread Lasse Collin

On 2022-11-23 John Paul Adrian Glaubitz wrote:
> I guess the additional unwind section breaks your workaround, so the
> best might be to just disable this workaround on ia64 using the
> configure flag, no?

There currently is no configure option to only disable the CentOS 7
workaround symbols. They are enabled if $host_os matches linux* and
--disable-symbol-versions wasn't used. Disabling symbol versions from
liblzma.so.5 will cause problems as they have been used since 5.2.0 and
many programs and libraries will expect to find XZ_5.0 and XZ_5.2.

Having the symbol versions in a static library doesn't make much sense
though. Perhaps this is a bug in XZ Utils. As a test, the static
liblzma.a could be built without symbol versions with --disable-shared
--disable-symbol-versions:

(1) Does this make the problem go away?

(2) Do the failing builds even require that liblzma.a is present
on the system?

I don't now how to avoid symvers in a static library as, to my
understanding, GNU Libtool doesn't add any -DBUILDING_SHARED_LIBRARY
kind of flag which would allow using a #ifdef to know when to use the
symbol versions. Libtool does add -DDLL_EXPORT when building a shared
library on Windows but that's not useful here.

(Switching to another build system would avoid some other Libtool
problems too like wrong shared library versioning on some OSes. However,
Autotools-based build system is able to produce usable xz on quite a
few less-common systems that some other build systems don't support.)

A workaround to this workaround could be to disable the CentOS 7
symbols on ia64 by default. Adding an explicit configure option is
possible too, if needed. But the first step should be to understand
what is going on since the same problem could appear in the future if
symbol versions are used for providing compatibility with an actual ABI
change (hopefully not needed but still).

> Older versions are available through Debian Snapshots:
> 
> > http://snapshot.debian.org/package/xz-utils/  

liblzma.a in liblzma-dev_5.2.5-2.1_ia64.deb doesn't have any "@XZ" in
it which is expected. This looks normal:

: [0x18c0-0x1990], info at +0x100

> > Many other functions are listed in those .IA_64.unwind
> > sections too but lzma_get_progress is the only one that has "@XZ"
> > as part of the function name.  
> 
> Hmm, that definitely seems the problem. Could it be that the symbols
> that are exported on ia64 need some additional naming?

It seems weird why only one symbol is affected. Perhaps it's a bug in
the toolchain creating liblzma.a. However, perhaps the main bug is that
XZ Utils build puts symbol versions into a static liblzma. :-(

> I think we can waive for CentOS 7 compatibility on Debian unstable
> ia64 .

There is no official CentOS 7 for ia64 but that isn't the whole story
as the broken patch has been used elsewhere too. Not having those extra
symbols would still be fine in practice. :-)

-- 
Lasse Collin

Re: [xz-devel] RHEL7 ABI patch (913ddc5) breaks linking on ia64

2022-11-22 Thread Lasse Collin

On 2022-11-22 Sebastian Andrzej Siewior wrote:
> This looks like it is staticaly linked against liblzma.

The shared libs in Debian seem to be correct as you managed to answer
right before my email. Thanks! :-) But the above comment made me look at
Debian's liblzma.a. The output of

readelf -aW usr/lib/ia64-linux-gnu/liblzma.a

includes the following two lines in both 5.2.7 and 5.3.4alpha:

Unwind section '.IA_64.unwind' at offset 0x2000 contains 15 entries:
[...]
: [0x1980-0x1a50], info at +0x108

There are no older versions on the mirror so I didn't check what
pre-5.2.7 would have. But .IA_64.unwind is a ia64-specific thing.
Many other functions are listed in those .IA_64.unwind
sections too but lzma_get_progress is the only one that has "@XZ"
as part of the function name.

I don't understand these details but I wanted let you know anyway in
case it isn't a coincidence why lzma_get_progress appears in a special
form in both liblzma.a and in the linker error messages. The error has
@@XZ_5.2 (which even 5.2.0 has in shared liblzma.so.5) but here the
static lib has @XZ_5.2.2 which exists solely for CentOS 7 compatibility.

lzma_cputhreads doesn't show the same special behavior in ia64 liblzma.a
even though lzma_cputhreads is handled exactly like lzma_get_progress in
the liblzma C code and linker script.

-- 
Lasse Collin

Re: [xz-devel] RHEL7 ABI patch (913ddc5) breaks linking on ia64

2022-11-22 Thread Lasse Collin

On 2022-11-22 John Paul Adrian Glaubitz wrote:
> Does anyone have a clue why this particular change may have broken
> the linking on ia64?

Thanks for your report. This is important to fix.

What do these commands print? Fix the path to liblzma.so.5 if needed.

readelf --dyn-syms -W /lib/liblzma.so.5 \
| grep lzma_get_progress

readelf --dyn-syms -W /lib/liblzma.so.5 \
| grep lzma_stream_encoder_mt_memusage

The first should print 2 lines and the second 3 lines. The rightmost
columns should be like these:

FUNCGLOBAL DEFAULT   11 lzma_get_progress@@XZ_5.2
FUNCGLOBAL DEFAULT   11 lzma_get_progress@XZ_5.2.2
FUNCGLOBAL DEFAULT   11 lzma_stream_encoder_mt_memusage@@XZ_5.2
FUNCGLOBAL DEFAULT   11 lzma_stream_encoder_mt_memusage@XZ_5.1.2alpha
FUNCGLOBAL DEFAULT   11 lzma_stream_encoder_mt_memusage@XZ_5.2.2

Pay close attention to @ vs. @@. The XZ_5.2 must be the ones with @@.
If you see the same as above then I don't have a clue.

By any chance, was XZ Utils built with GCC older than 10 using
link-time optimization (LTO, -flto)? As my commit message describes
and NEWS warns, GCC < 10 and LTO will not produce correct results
due to the symbol versions. It should work fine with GCC >= 10 or Clang.

For what it is worth, when I wrote the patch I tested it on on
Slackware 10.1 (32-bit x86) that has GCC 3.3.4 and it worked perfectly
there. This symbol version stuff isn't a new thing so it really should
work.

-- 
Lasse Collin

Re: [xz-devel] [PATCH] add xz arm64 bcj filter support

2022-11-17 Thread Lasse Collin

Hello!

On 2021-09-02 Liao Hua wrote:
> +#define LZMA_FILTER_ARM64 LZMA_VLI_C(0x0a)

Is this ID 0x0A in actual use somewhere? Can it be used in the official
.xz format for something else than the filter you submitted?

On 2021-09-08 Lasse Collin wrote:
> On 2021-09-02 Liao Hua wrote:
> > We have some questions about xz bcj filters.
> > 1. Why ARM and ARM-Thumb bcj filters are little endian only?  
> 
> Perhaps it's an error. Long ago when I wrote the docs, I knew that the
> ARM filters worked on little endian code but didn't know how big
> endian ARM was done.

I read about this and if I have understood correctly, in the past big
endian ARM could use big endian instruction encoding too but nowadays
instructions are always in little endian order, even if data access is
big endian. The endianness in the docs is about instruction encoding.
The filters don't care about data access.

The mention of endianness has been removed in 5.3.4alpha (and thus
5.4.0) since it is more confusing than useful.

The PowerPC filter is indeed big endian only. Little endian PowerPC
would need a new filter. Filtering little endian PowerPC code would
have comparable improvement in compression as the current big endian
filter does.

> > 2. Why there is no arm64 bcj filter? Are there any technical risks?
> > Or other considerations?  
> 
> It just hasn't been done, no other reason.

There will probably be a new ARM64 filter in 5.4.0. The exact design is
still not frozen. Different parameters work a little better or worse in
different situations. It doesn't seem practical to make a tunable
filter since few people would try different settings and it would make
the code slower and a little bigger (which matters in XZ Embedded).

With ARM64 it is good to use --lzma2=lc=2,lp=2 instead of the default
lc=3,lp=0. This alone can give a little over 1 % smaller file.

-- 
Lasse Collin

[xz-devel] XZ Utils 5.3.4alpha

2022-11-15 Thread Lasse Collin

XZ Utils 5.3.4alpha is available at <https://tukaani.org/xz/>. Here is an
extract from the NEWS file:

5.3.4alpha (2022-11-15)

* All fixes from 5.2.7 and 5.2.8.

* liblzma:

- Minor improvements to the threaded decoder.

- Added CRC64 implementation that uses SSSE3, SSE4.1, and CLMUL
  instructions on 32/64-bit x86 and E2K. On 32-bit x86 it's
  not enabled unless --disable-assembler is used but then
  the non-CLMUL code might be slower. Processor support is
  detected at runtime so this is built by default on x86-64
  and E2K. On these platforms, if compiler flags indicate
  unconditional CLMUL support (-msse4.1 -mpclmul) then the
  generic version is not built, making liblzma 8-9 KiB smaller
  compared to having both versions included.

  With extremely compressible files this can make decompression
  up to twice as fast but with typical files 5 % improvement
  is a more realistic expectation.

  The CLMUL version is slower than the generic version with
  tiny inputs (especially at 1-8 bytes per call, but up to
  16 bytes). In normal use in xz this doesn't matter at all.

- Added an experimental ARM64 filter. This is *not* the final
  version! Files created with this experimental version won't
  be supported in the future versions! The filter design is
  a compromise where improving one use case makes some other
  cases worse.

- Added decompression support for the .lz (lzip) file format
  version 0 and the original unextended version 1. See the
  API docs of lzma_lzip_decoder() for details. Also
  lzma_auto_decoder() supports .lz files.

- Building with --disable-threads --enable-small
  is now thread-safe if the compiler supports
  __attribute__((__constructor__))

* xz:

- Added support for OpenBSD's pledge(2) as a sandboxing method.

- Don't mention endianness for ARM and ARM-Thumb filters in
  --long-help. The filters only work for little endian
  instruction encoding but modern ARM processors using
  big endian data access still use little endian
  instruction encoding. So the help text was misleading.
  In contrast, the PowerPC filter is only for big endian
  32/64-bit PowerPC code. Little endian PowerPC would need
  a separate filter.

- Added --experimental-arm64. This will be renamed once the
  filter is finished. Files created with this experimental
  filter will not be supported in the future!

- Added new fields to the output of xz --robot --info-memory.

- Added decompression support for the .lz (lzip) file format
  version 0 and the original unextended version 1. It is
  autodetected by default. See also the option --format on
  the xz man page.

* Scripts now support the .lz format using xz.

* Build systems:

- New #defines in config.h: HAVE_ENCODER_ARM64,
  HAVE_DECODER_ARM64, HAVE_LZIP_DECODER, HAVE_CPUID_H,
  HAVE_FUNC_ATTRIBUTE_CONSTRUCTOR, HAVE_USABLE_CLMUL

- New configure options: --disable-clmul-crc,
  --disable-microlzma, --disable-lzip-decoder, and
  'pledge' is now an option in --enable-sandbox (but
  it's autodetected by default anyway).

- INSTALL was updated to document the new configure options.

- PACKAGERS now lists also --disable-microlzma and
  --disable-lzip-decoder as configure options that must
  not be used in builds for non-embedded use.

* Tests:

- Fix some of the tests so that they skip instead of fail if
  certain features have been disabled with configure options.
  It's still not perfect.

- Other improvements to tests.

* Updated translations: Croatian, Finnish, Hungarian, Polish,
  Romanian, Spanish, Swedish, and Ukrainian.

-- 
Lasse Collin

[xz-devel] XZ Utils 5.2.8

2022-11-13 Thread Lasse Collin

XZ Utils 5.2.8 is available at <https://tukaani.org/xz/>. Here is an
extract from the NEWS file:

5.2.8 (2022-11-13)

* xz:

- If xz cannot remove an input file when it should, this
  is now treated as a warning (exit status 2) instead of
  an error (exit status 1). This matches GNU gzip and it
  is more logical as at that point the output file has
  already been successfully closed.

- Fix handling of .xz files with an unsupported check type.
  Previously such printed a warning message but then xz
  behaved as if an error had occurred (didn't decompress,
  exit status 1). Now a warning is printed, decompression
  is done anyway, and exit status is 2. This used to work
  slightly before 5.0.0. In practice this bug matters only
  if xz has been built with some check types disabled. As
  instructed in PACKAGERS, such builds should be done in
  special situations only.

- Fix "xz -dc --single-stream tests/files/good-0-empty.xz"
  which failed with "Internal error (bug)". That is,
  --single-stream was broken if the first .xz stream in
  the input file didn't contain any uncompressed data.

- Fix displaying file sizes in the progress indicator when
  working in passthru mode and there are multiple input files.
  Just like "gzip -cdf", "xz -cdf" works like "cat" when the
  input file isn't a supported compressed file format. In
  this case the file size counters weren't reset between
  files so with multiple input files the progress indicator
  displayed an incorrect (too large) value.

* liblzma:

- API docs in lzma/container.h:
* Update the list of decoder flags in the decoder
  function docs.
* Explain LZMA_CONCATENATED behavior with .lzma files
  in lzma_auto_decoder() docs.

- OpenBSD: Use HW_NCPUONLINE to detect the number of
  available hardware threads in lzma_physmem().

- Fix use of wrong macro to detect x86 SSE2 support.
  __SSE2_MATH__ was used with GCC/Clang but the correct
  one is __SSE2__. The first one means that SSE2 is used
  for floating point math which is irrelevant here.
  The affected SSE2 code isn't used on x86-64 so this affects
  only 32-bit x86 builds that use -msse2 without -mfpmath=sse
  (there is no runtime detection for SSE2). It improves LZMA
  compression speed (not decompression).

- Fix the build with Intel C compiler 2021 (ICC, not ICX)
  on Linux. It defines __GNUC__ to 10 but doesn't support
  the __symver__ attribute introduced in GCC 10.

* Scripts: Ignore warnings from xz by using --quiet --no-warn.
  This is needed if the input .xz files use an unsupported
  check type.

* Translations:

- Updated Croatian and Turkish translations.

- One new translations wasn't included because it needed
  technical fixes. It will be in upcoming 5.4.0. No new
  translations will be added to the 5.2.x branch anymore.

- Renamed the French man page translation file from
  fr_FR.po to fr.po and thus also its install directory
  (like /usr/share/man/fr_FR -> .../fr).

- Man page translations for upcoming 5.4.0 are now handled
  in the Translation Project.

* Update doc/faq.txt a little so it's less out-of-date.

-- 
Lasse Collin

Re: [xz-devel] XZ Utils 5.3.3alpha

2022-09-30 Thread Lasse Collin

On 2022-09-29 Guillem Jover wrote:
> On Wed, 2022-09-28 at 21:41:59 +0800, Jia Tan wrote:
> > […] The
> > interface for liblzma and xz for the multi threaded decoder does not
> > have any planned changes, so things could probably be developed and
> > tested using 5.3.3.  
> 
> Ah, thanks, that's reassuring then. It's one of the things I was
> worried about when having to decide whether to merge the patch I've
> got implementing this support into dpkg. So, once the alpha version
> has been packaged for Debian experimental, I'll test the patch and
> commit it.

There are no planned changes but that isn't a *promise* that there won't
be any changes before 5.4.0.

I don't track API or ABI compatibility within development releases and
thus binaries linked against shared liblzma from one alpha/beta release
won't run with liblzma from the next alpha/beta *if* they depend on
unstable symbols (symbol versioning stops it). This includes the xz
binary itself and would include dpkg too if it uses the threaded
decoder.

Sometimes it can be worked around with distro-specific patches but
that's extra hassle and can go wrong too. Please don't end up with a
similar result that happened with RHEL/CentOS 7 which ended up
affecting users of other distributions too (this is included in 5.2.7):

https://git.tukaani.org/?p=xz.git;a=commitdiff;h=913ddc5572b9455fa0cf299be2e35c708840e922

So while I encourage testing, one needs to be careful when it can
affect critical tools in the operating system. :-)

-- 
Lasse Collin

Re: [xz-devel] XZ Utils 5.3.3alpha

2022-09-30 Thread Lasse Collin

On 2022-09-28 Jia Tan wrote:
> On 2022-09-27 Sebastian Andrzej Siewior wrote:
> > Okay, so that is what you are tracking. I remember that there was a
> > stall in the decoding but I don't remember how it played out.
> >
> > I do remember that I had something for memory allocation/ limit but
> > I don't remember if we settled on something or if discussion is
> > needed. Also how many decoding threads make sense, etc.  
> 
> We ended up changing xz to use (total_ram / 4) as the default "soft
> limit". If the soft limit is reached, xz will decode single threaded.
> The "hard limit" shares the same environment variable and xz option
> (--memlimit-decompress).

There is also the 1400 MiB cap for 32-bit executables.

The memory limiting in threaded decompression (two separate limits in
parallel) is one thing where feedback would be important as after the
liblzma API, ABI and xz tool syntax are in a stable release, backward
compatibility has to be maintained.

Another thing needing feedback is the new behavior of -T0 when no
memlimit has been specified. Now it has a default soft limit. I hope it
is an improvement but quite possibly it could be improved. Your
suggestion to use MemAvailable on Linux is one thing that could be
included if people think it is a good way to go as a Linux-specific
behavior (having more benefits than downsides).

These are documented on the xz man page. I hope it is clear enough. It
feels a bit complicated, which is a bad sign but on the other hand I
feel the underlying problem isn't as trivial as it seems on the surface.

So far Jia Tan and I have received no feedback about these things at
all. I would prefer to hear the complaints before 5.4.0 is out. :-)

> > This reminds me that I once posted a patch to use openssl for the
> > sha256.
> > https://www.mail-archive.com/xz-devel@tukaani.org/msg00429.html
> >
> > Some distro is using sha256 instead crc64 by default, I don't
> > remember which one… Not that I care personally ;)  
> 
> I am unsure if we will have time to include your sha256 patch, but if
> we finish all the tasks with extra time it may be considered.

There's more to this than available time. 5.1.2alpha added support for
using SHA-256 from the OS base libraries (not OpenSSL) but starting with
5.2.3 it is disabled by default. Some OS libs use (or used to use) the
same symbol names for SHA-256 functions as OpenSSL while having
incompatible ABI. This lead to weird problems when an application
needed both liblzma and OpenSSL as liblzma ended up calling OpenSSL
functions. Plus, some of the OS-specific implementations were slower
than the C code in liblzma (OpenSSL would be faster).

OpenSSL's license has compatibility questions with GNU GPL. If I
remember correctly, some distributions consider OpenSSL to be part of
the core operating system and thus avoid the compatibility problem with
the GPL. I'm not up to date how distros handle it in 2022 but perhaps
it should be taken into account so that apps depending on liblzma won't
get legally unacceptable OpenSSL linkage. So if OpenSSL support is
added it likely should be disabled by default in configure.ac.

> > > This is everything currently planned.

Translations need to be updated too once the strings and man pages are
close to final. A development release needs to be sent to the
Translation Project at some point. If people want to translate the man
pages too, they will need quite a bit of time.

-- 
Lasse Collin

[xz-devel] XZ Utils 5.2.7

2022-09-30 Thread Lasse Collin

XZ Utils 5.2.7 is available at <https://tukaani.org/xz/>. Here is an
extract from the NEWS file:

5.2.7 (2022-09-30)

* liblzma:

- Made lzma_filters_copy() to never modify the destination
  array if an error occurs. lzma_stream_encoder() and
  lzma_stream_encoder_mt() already assumed this. Before this
  change, if a tiny memory allocation in lzma_filters_copy()
  failed it would lead to a crash (invalid free() or invalid
  memory reads) in the cleanup paths of these two encoder
  initialization functions.

- Added missing integer overflow check to lzma_index_append().
  This affects xz --list and other applications that decode
  the Index field from .xz files using lzma_index_decoder().
  Normal decompression of .xz files doesn't call this code
  and thus most applications using liblzma aren't affected
  by this bug.

- Single-threaded .xz decoder (lzma_stream_decoder()): If
  lzma_code() returns LZMA_MEMLIMIT_ERROR it is now possible
  to use lzma_memlimit_set() to increase the limit and continue
  decoding. This was supposed to work from the beginning
  but there was a bug. With other decoders (.lzma or
  threaded .xz decoder) this already worked correctly.

- Fixed accumulation of integrity check type statistics in
  lzma_index_cat(). This bug made lzma_index_checks() return
  only the type of the integrity check of the last Stream
  when multiple lzma_indexes were concatenated. Most
  applications don't use these APIs but in xz it made
  xz --list not list all check types from concatenated .xz
  files. In xz --list --verbose only the per-file "Check:"
  lines were affected and in xz --robot --list only the "file"
  line was affected.

- Added ABI compatibility with executables that were linked
  against liblzma in RHEL/CentOS 7 or other liblzma builds
  that had copied the problematic patch from RHEL/CentOS 7
  (xz-5.2.2-compat-libs.patch). For the details, see the
  comment at the top of src/liblzma/validate_map.sh.

  WARNING: This uses __symver__ attribute with GCC >= 10.
  In other cases the traditional __asm__(".symver ...")
  is used. Using link-time optimization (LTO, -flto) with
  GCC versions older than 10 can silently result in
  broken liblzma.so.5 (incorrect symbol versions)! If you
  want to use -flto with GCC, you must use GCC >= 10.
  LTO with Clang seems to work even with the traditional
  __asm__(".symver ...") method.

* xzgrep: Fixed compatibility with old shells that break if
  comments inside command substitutions have apostrophes (').
  This problem was introduced in 5.2.6.

* Build systems:

- New #define in config.h: HAVE_SYMBOL_VERSIONS_LINUX

- Windows: Fixed liblzma.dll build with Visual Studio project
  files. It broke in 5.2.6 due to a change that was made to
  improve CMake support.

- Windows: Building liblzma with UNICODE defined should now
  work.

- CMake files are now actually included in the release tarball.
  They should have been in 5.2.5 already.

- Minor CMake fixes and improvements.

* Added a new translation: Turkish

-- 
Lasse Collin

[xz-devel] XZ Utils 5.3.3alpha

2022-08-22 Thread Lasse Collin

age.sh to create a code coverage report
  of the tests.

* Build systems:

- Automake's parallel test harness is now used to make tests
  finish faster.

- Added the CMake files to the distribution tarball. These were
  supposed to be in 5.2.5 already.

- Added liblzma tests to the CMake build.

- Windows: Fix building of liblzma.dll with the included
  Visual Studio project files.

-- 
Lasse Collin

Re: [xz-devel] VS projects fail to build the resource file

2022-08-18 Thread Lasse Collin

On 2022-08-18 Olivier B. wrote:
> The cmake windows build in a 5.2.6 git clone seem to build and install
> fine for me!

Good to know, thanks!

> As small an improvement to them, I wouldn't mind if the pdbs were
> installed too in the configurations where they are generated (and
> actually also in release builds)

I see .pdb files are for debug symbols and I see CMake has some
properties related to them but I don't know much more. Are the .pdb
files generated by default in the CMake-generated debug targets but not
by the release targets? Does the following do something good?

diff --git a/CMakeLists.txt b/CMakeLists.txt
index 2a88af3..ccfb217 100644
--- a/CMakeLists.txt
+++ b/CMakeLists.txt
@@ -499,6 +499,14 @@ install(DIRECTORY src/liblzma/api/
 DESTINATION "${CMAKE_INSTALL_INCLUDEDIR}"
 FILES_MATCHING PATTERN "*.h")

+if(MSVC)
+# Install MSVC debug symbol file if it was generated.
+install(FILES $
+DESTINATION "${CMAKE_INSTALL_BINDIR}"
+COMPONENT liblzma_Development
+OPTIONAL)
+endif()
+
 # Install the CMake files that other packages can use to find liblzma.
 set(liblzma_INSTALL_CMAKEDIR
 "${CMAKE_INSTALL_LIBDIR}/cmake/liblzma"

I understood that the above can only work for DLLs. Static library
would need compiler-generated debug info which CMake supports via
COMPILE_PDB_NAME property. If .pdb files aren't created for release
builds by default, there likely is a way to enable it. I cannot test
MSVC builds now so I won't make many blind guesses.

-- 
Lasse Collin

Re: [xz-devel] VS projects fail to build the resource file

2022-08-18 Thread Lasse Collin

On 2022-08-18 Olivier B. wrote:
> Yes, indeed. I sent the mail after having only fixed one
> configuration, but the full solution build needs the six modifications

OK, thanks! I committed it to vs2013, vs2017, and vs2019 files, also
to the v5.2 branch.

> Is it normal that CMakeLists and other files are not in the 5.2.6 (or
> 5.3.2) tarball, only in the git?

That's not intentional. Seems that I have forgotten to add those to
Automake's dist target. 5.2.5 was supposed to have experimental CMake
files already as it was mentioned in the NEWS file.

It has been fixed, also in the v5.2 branch. Thanks!

-- 
Lasse Collin

Re: [xz-devel] VS projects fail to build the resource file

2022-08-18 Thread Lasse Collin

On 2022-08-18 Olivier B. wrote:
> I am trying to build 5.2.6 on windows, but, presumably after
> 352ba2d69af2136bc814aa1df1a132559d445616, he build using the MSVC 2013
> project file fails.

Thanks! So the fix for one thing broke another situation. :-(

I cannot test but it seems the same addition is needed in six places,
not just in "Debug|Win32" case, right?

diff --git a/windows/vs2013/liblzma_dll.vcxproj 
b/windows/vs2013/liblzma_dll.vcxproj
index 2bf3e41..f24cd6f 100644
--- a/windows/vs2013/liblzma_dll.vcxproj
+++ b/windows/vs2013/liblzma_dll.vcxproj
@@ -137,6 +137,7 @@
 
 
   
./;../../src/liblzma/common;../../src/common;../../src/liblzma/api;
+  HAVE_CONFIG_H
 
   
   
@@ -154,6 +155,7 @@
 
 
   
./;../../src/liblzma/common;../../src/common;../../src/liblzma/api;
+  HAVE_CONFIG_H
 
   
   
@@ -173,6 +175,7 @@
 
 
   
./;../../src/liblzma/common;../../src/common;../../src/liblzma/api;
+  HAVE_CONFIG_H
 
   
   
@@ -191,6 +194,7 @@
 
 
   
./;../../src/liblzma/common;../../src/common;../../src/liblzma/api;
+  HAVE_CONFIG_H
 
   
   
@@ -210,6 +214,7 @@
 
 
   
./;../../src/liblzma/common;../../src/common;../../src/liblzma/api;
+  HAVE_CONFIG_H
 
   
   
@@ -228,6 +233,7 @@
 
 
   
./;../../src/liblzma/common;../../src/common;../../src/liblzma/api;
+  HAVE_CONFIG_H
 
   
   

I will commit the above to all VS project files if you think it's good.

Does it work with CMake for you? I'm hoping that the VS project files
can be removed in the near-future and CMake used for building with VS.
That way there are fewer build files to maintain.

-- 
Lasse Collin

[xz-devel] XZ Utils 5.2.6

2022-08-12 Thread Lasse Collin

XZ Utils 5.2.6 is available at <https://tukaani.org/xz/>. Here is an
extract from the NEWS file:

5.2.6 (2022-08-12)

* xz:

- The --keep option now accepts symlinks, hardlinks, and
  setuid, setgid, and sticky files. Previously this required
  using --force.

- When copying metadata from the source file to the destination
  file, don't try to set the group (GID) if it is already set
  correctly. This avoids a failure on OpenBSD (and possibly on
  a few other OSes) where files may get created so that their
  group doesn't belong to the user, and fchown(2) can fail even
  if it needs to do nothing.

- Cap --memlimit-compress to 2000 MiB instead of 4020 MiB on
  MIPS32 because on MIPS32 userspace processes are limited
  to 2 GiB of address space.

* liblzma:

- Fixed a missing error-check in the threaded encoder. If a
  small memory allocation fails, a .xz file with an invalid
  Index field would be created. Decompressing such a file would
  produce the correct output but result in an error at the end.
  Thus this is a "mild" data corruption bug. Note that while
  a failed memory allocation can trigger the bug, it cannot
  cause invalid memory access.

- The decoder for .lzma files now supports files that have
  uncompressed size stored in the header and still use the
  end of payload marker (end of stream marker) at the end
  of the LZMA stream. Such files are rare but, according to
  the documentation in LZMA SDK, they are valid.
  doc/lzma-file-format.txt was updated too.

- Improved 32-bit x86 assembly files:
* Support Intel Control-flow Enforcement Technology (CET)
* Use non-executable stack on FreeBSD.

- Visual Studio: Use non-standard _MSVC_LANG to detect C++
  standard version in the lzma.h API header. It's used to
  detect when "noexcept" can be used.

* xzgrep:

- Fixed arbitrary command injection via a malicious filename
  (CVE-2022-1271, ZDI-CAN-16587). A standalone patch for
  this was released to the public on 2022-04-07. A slight
  robustness improvement has been made since then and, if
  using GNU or *BSD grep, a new faster method is now used
  that doesn't use the old sed-based construct at all. This
  also fixes bad output with GNU grep >= 3.5 (2020-09-27)
  when xzgrepping binary files.

  This vulnerability was discovered by:
  cleemy desu wayo working with Trend Micro Zero Day Initiative

- Fixed detection of corrupt .bz2 files.

- Improved error handling to fix exit status in some situations
  and to fix handling of signals: in some situations a signal
  didn't make xzgrep exit when it clearly should have. It's
  possible that the signal handling still isn't quite perfect
  but hopefully it's good enough.

- Documented exit statuses on the man page.

- xzegrep and xzfgrep now use "grep -E" and "grep -F" instead
  of the deprecated egrep and fgrep commands.

- Fixed parsing of the options -E, -F, -G, -P, and -X. The
  problem occurred when multiple options were specied in
  a single argument, for example,

  echo foo | xzgrep -Fe foo

  treated foo as a filename because -Fe wasn't correctly
  split into -F -e.

- Added zstd support.

* xzdiff/xzcmp:

- Fixed wrong exit status. Exit status could be 2 when the
  correct value is 1.

- Documented on the man page that exit status of 2 is used
  for decompression errors.

- Added zstd support.

* xzless:

- Fix less(1) version detection. It failed if the version number
  from "less -V" contained a dot.

* Translations:

- Added new translations: Catalan, Croatian, Esperanto,
  Korean, Portuguese, Romanian, Serbian, Spanish, Swedish,
  and Ukrainian

- Updated the Brazilian Portuguese translation.

- Added French man page translation. This and the existing
  German translation aren't complete anymore because the
  English man pages got a few updates and the translators
  weren't reached so that they could update their work.

* Build systems:

- Windows: Fix building of resource files when config.h isn't
  used. CMake + Visual Studio can now build liblzma.dll.

- Various fixes to the CMake support. Building static or shared
  liblzma should work fine in most cases. In contrast, building
  the command line tools with CMake is still clearly incomplete
  and experimental and should be used for testing only.

-- 
Lasse Collin

Re: [xz-devel] [PATCH] LZMA_FINISH will now trigger LZMA_BUF_ERROR on truncated xz files right away

2022-08-10 Thread Lasse Collin

On 2022-04-21 Jia Tan wrote:
> The current behavior of LZMA_FINISH in the decoder is a little
> confusing because it requires calling lzma_code a few times without
> providing more input to trigger a LZMA_BUF_ERROR.

The current behavior basically ignores the use LZMA_FINISH when
determining if LZMA_BUF_ERROR should be returned. I understand that it
can be confusing since after LZMA_FINISH there is nothing a new call to
lzma_code() can do to avoid the problem. However, I don't think it's a
problem in practice:

  - Application that calls lzma_code() in a loop will just call
lzma_code() again and eventually get LZMA_BUF_ERROR.

  - Application that does single-shot decoding without a loop tends to
check for LZMA_STREAM_END as a success condition and treats other
codes, including LZMA_OK, as a problem. In the worst case a less
robust application could break if this LZMA_OK becomes
LZMA_BUF_ERROR as the existing API doc says that LZMA_BUF_ERROR
won't be returned immediately. The docs don't give any
indication that LZMA_FINISH could affect this behavior.

  - An extra call or two to lzma_code() in an error condition doesn't
matter in terms of performance.

> This patch replaces return LZMA_OK lines with:
> 
> return action == LZMA_FINISH && *out_pos != out_size ? LZMA_BUF_ERROR
> : LZMA_OK;

I don't like replacing a short statement with a copy-pasted long
statement since it is needed in so many places. A benefit of the
current approach is that the handling of LZMA_BUF_ERROR is in
lzma_code() and (most of the time) the rest of code can ignore the
problem completely.

Also, the condition *out_pos != out_size is confusing in a few places.
For example, in SEQ_STREAM_HEADER:

--- a/src/liblzma/common/stream_decoder.c
+++ b/src/liblzma/common/stream_decoder.c
@@ -118,7 +118,8 @@ stream_decode(void *coder_ptr, const lzma_allocator 
*allocator,

// Return if we didn't get the whole Stream Header yet.
if (coder->pos < LZMA_STREAM_HEADER_SIZE)
-   return LZMA_OK;
+   return action == LZMA_FINISH && *out_pos != out_size
+   ? LZMA_BUF_ERROR : LZMA_OK;

coder->pos = 0;

In SEQ_STREAM_HEADER no output can be produced, only input will be
read. Still the condition checks for full output buffer which is not
only confusing but wrong: if there was an empty Stream ahead, having no
output space would be fine! In such a situation this can return LZMA_OK
even when the intention was to return LZMA_BUF_ERROR due to truncated
input. To make this work, only places that can produce output should
check if the output buffer is full.

However, I don't think the current behavior is worth changing. As you
pointed out, it is a bit weird (and I had never noticed it myself
before you mentioned it). It's not actually broken though and some
applications doing single-shot decoding might even rely on the current
behavior. Trying to change this could cause problems in rare cases and,
if not done carefully enough, introduce new bugs. So I thank you for
the patch but it won't be included.

-- 
Lasse Collin

[xz-devel] Man page translations for XZ Utils 5.2.6

2022-07-25 Thread Lasse Collin

Hello!

A bugfix release will be made around mid-August 2022. The German and
French translations of the man pages need updating. There are a few
small changes to the factual content but there are also style changes
which increases the number of strings that have been modified.

A pre-release snapshot from the v5.2 branch is available here:

https://tukaani.org/xz/xz-5.2.5-85-g275de.tar.xz

A tiny thing: I changed the po4a --copyright-holder argument to "[See
the headers in the input files.]" since the three small man pages
inherited from GNU gzip are GNU GPLv2+. It affects the comment that
gets put on top of xz-man.pot.

The strings in the command line tools haven't changed since 5.2.5 or
even 5.2.4 apart from one string being removed completely. Jia Tan
fixed all white-space bugs from the pending translations so 5.2.6 will
have many new translations. :-)

With 5.2.6 I will also finally release 5.3.3alpha. The development
branch has some of the difficult string split into separate strings for
easier translation. I suppose 5.3.3alpha or a later snapshot could be
sent to the Translation Project somewhat soon and perhaps creation of
xz-man domain reconsidered at the same time since I got one new feedback
wishing for xz-man in the TP. Clearly there are people who wish to
translate man pages. :-)

I won't be on my computer for about two weeks so I won't be able to
reply to emails before that.

Thanks!

-- 
Lasse Collin

Re: [xz-devel] Question about using Java API for geospatial data

2022-07-10 Thread Lasse Collin

On 2022-07-09 Gary Lucas wrote:
> I am using the library to compress a public-domain data product called
> ETOPO1. ETOPO1 provides a global-scale grid of 233 million elevation
> and ocean depth samples as integer meters. My implementation
> compresses the data in separate blocks of about 20 thousand values
> each.

So that is about 12 thousand blocks?

> Previously, I used Huffman coding and Deflate to reduce the size
> of the data to about 4.39 bits per value. With your library, LZMA
> reduces that to 4.14 bits per value and XZ to 4.16.

Is the compressed size of each block about ten kilobytes?

> The original implementation requires an average of 4.8 seconds to
> decompress the full set of 233 million points.  The LZMA version
> requires 15.2 seconds, and the XZ version requires 18.9 seconds.

The Deflate implementation in java.util.zip uses zlib (native code). XZ
for Java is pure Java. LZMA is significantly slower than Deflate and
being pure Java makes the difference even bigger.

> My understanding is that XZ should perform better than LZMA. Since
> that is not the case, could there be something suboptimal with the way
> my code uses the API?

The core compression code is the same in both: XZ uses LZMA2 which is
LZMA with framing. XZ adds a few features like filters, integrity
checking, and block-based random access reading.

> And here are the Code Snippets:

The XZ examples don't use XZ for Java directly. This is clear due to
"Xz" vs. "XZ" difference in the class names and that XZOutputStream has
no constructor that takes the input size as an argument.

Non-performance notes:

  - Section "When uncompressed size is known beforehand" in
XZInputStream is worth reading. Basically adding a check
that "xzIn.read() == -1" is true at the end to verify the integrity
check. This at least used to be true (I haven't tested recently)
for GZipInputStream too.

  - When compressing, .finish() is redundant. .close() will do it
anyway.

  - If XZ data is embedded insize another file format, you may want
to use SingleXZInputStream instead of XZInputStream. XZInputStream
supports concatenated streams that are possible on standalone .xz
files but probably shouldn't occur when embedded inside another
format. In your case this likely makes no difference in practice.

Might affect performance:

  - The default LZMA2 dictionary size is 8 MiB. If the uncompressed
size is known to be much smaller than this, it's waste of memory to
use so big dictionary. In that case pick a value that is at least as
big as the largest uncompressed size, possibly round up to 2^n
value.

  - Compressing or decompressing multiple streams that use identical
settings means creating many compressor or decompressor instances.
To reduce garbage collector pressure there is ArrayCache which
reuses large array allocations. You can enable this globally with
this:

ArrayCache.setDefaultCache(BasicArrayCache.getInstance());

However, setting the default like this might not be desired if
multiple unrelated things in the application might use XZ for Java.

Note that ArrayCache can help both LZMA and XZ classes.

Likely will affect performance:

  - Since compression ratio is high, the integrity checking starts to
become more significant for performance. To test how much integrity
checking slows XZ down, use SingleXZInputStream or XZInputStream
constructor that takes "boolean verifyCheck" and set it to false.

You can also compress to XZ without integrity checking at all
(using XZ.CHECK_NONE as the third argument in XZOutputStream
constructor). Using XZ.CHECK_CRC32 is likely much faster than the
default XZ.CHECK_CRC64 because CRC32 comes from java.util.zip which
uses native code from zlib.

It's quite possible that XZ provides no value over raw LZMA in this
application, especially if you don't need integrity checking. Raw LZMA
instead of .lzma will even avoid the 13-byte .lzma header saving 150
kilobytes with 12 thousand blocks. If the uncompressed size is stored
in the container headers then further 4-5 bytes per block can be saved
by telling the size to the raw LZMA encoder and decoder.

Note that LZMAOutputStream and LZMAInputStream support .lzma and raw
LZMA: the choise between these is done by picking the right
constructors.

Finally, it might be worth playing with the lc/lp/pb parameters in
LZMA/LZMA2. Usually those make only tiny difference but with some data
types they have a bigger effect. These won't affect performance other
than that the smaller the compressed file the faster it tends to
decompress in case of LZMA/LZMA2.

Other compressors might be worth trying too. Zstandard typically
compresses only slightly worse than XZ/LZMA but it is *a lot* faster to
decompress.

-- 
Lasse Collin

Re: [xz-devel] XZ for Java

2022-06-29 Thread Lasse Collin

On 2022-06-21 Dennis Ens wrote:
> Why not pass on maintainership for XZ for C so you can give XZ for
> Java more attention? Or pass on XZ for Java to someone else to focus
> on XZ for C? Trying to maintain both means that neither are
> maintained well.

Finding a co-maintainer or passing the projects completely to someone
else has been in my mind a long time but it's not a trivial thing to
do. For example, someone would need to have the skills, time, and enough
long-term interest specifically for this. There are many other projects
needing more maintainers too.

As I have hinted in earlier emails, Jia Tan may have a bigger role in
the project in the future. He has been helping a lot off-list and is
practically a co-maintainer already. :-) I know that not much has
happened in the git repository yet but things happen in small steps. In
any case some change in maintainership is already in progress at least
for XZ Utils.

-- 
Lasse Collin

Re: [xz-devel] XZ for Java

2022-06-08 Thread Lasse Collin

On 2022-06-07 Jigar Kumar wrote:
> Progress will not happen until there is new maintainer. XZ for C has
> sparse commit log too. Dennis you are better off waiting until new
> maintainer happens or fork yourself. Submitting patches here has no
> purpose these days. The current maintainer lost interest or doesn't
> care to maintain anymore. It is sad to see for a repo like this.

I haven't lost interest but my ability to care has been fairly limited
mostly due to longterm mental health issues but also due to some other
things. Recently I've worked off-list a bit with Jia Tan on XZ Utils and
perhaps he will have a bigger role in the future, we'll see.

It's also good to keep in mind that this is an unpaid hobby project.

Anyway, I assure you that I know far too well about the problem that
not much progress has been made. The thought of finding new maintainers
has existed for a long time too as the current situation is obviously
bad and sad for the project.

A new XZ Utils stable branch should get released this year with
threaded decoder etc. and a few alpha/beta releases before that.
Perhaps the moment after the 5.4.0 release would be a convenient moment
to make changes in the list of project maintainer(s).

Forks are obviously another possibility and I cannot control that. If
those happen, I hope that file format changes are done so that no
silly problems occur (like using the same ID for different things in
two projects). 7-Zip supports .xz and keeping its developer Igor Pavlov
informed about format changes (including new filters) is important too.

-- 
Lasse Collin

Re: [xz-devel] XZ for Java

2022-05-19 Thread Lasse Collin

On 2022-05-19 Dennis Ens wrote:
> Is XZ for Java still maintained?

Yes, by some definition at least, like if someone reports a bug it will
get fixed. Development of new features definitely isn't very active. :-(

> I asked a question here a week ago and have not heard back.

I saw. I have lots of unanswered emails at the moment and obviously
that isn't a good thing. After the latest XZ for Java release I've
tried focus on XZ Utils (and ignored XZ for Java), although obviously
that hasn't worked so well either even if some progress has happened
with XZ Utils.

> When I view the git log I can see it has not updated in over a year.
> I am looking for things like multithreaded encoding / decoding and a
> few updates that Brett Okken had submited (but are still waiting for
> merge). Should I add these things to only my local version, or is
> there a plan for these things in the future?

Brett Okken's patches I haven't reviewed so I cannot give definite
answers about if you should include them in your local version, sorry.

The match finder optimizations are more advanced as they are somewhat
arch-specific so it could be good to have broader testing how much they
help on different systems (not just x86-64 but 32-bit x86, ARM64, ...)
and if they behave well on Android too. The benefits have to be clear
enough (and cause no problems) to make the extra code worth it.

The Delta coder patch is small and relative improvement is big, so that
likely should get included. The Delta filter is used rarely though and
even a slow version isn't *that* slow in the big picture (there will
also be LZMA2 and CRC32/CRC64).

Threading would be nice in the Java version. Threaded decompression only
recently got committed to XZ Utils repository.

Jia Tan has helped me off-list with XZ Utils and he might have a bigger
role in the future at least with XZ Utils. It's clear that my resources
are too limited (thus the many emails waiting for replies) so something
has to change in the long term.

-- 
Lasse Collin

Re: [xz-devel] [PATCH] xz: Fix setting memory limit on 32-bit systems

2022-04-14 Thread Lasse Collin

On 2021-01-20 Sebastian Andrzej Siewior wrote:
> On 2021-01-18 23:52:50 [+0200], Lasse Collin wrote:
> > I have understood that *in practice* the problem with the xz command
> > line tool is limited to "xz -T0" usage so fixing this use case is
> > enough for most people. Please correct me if I missed something.  
> 
> Correct.

There is some code for special behavior with -T0 now for both
compression and decompression. I haven't updated the man page yet but
the commit messages should be helpful. I hope it can be documented so
that it sounds simple enough. :-)

> In the parallel decompress I added code on Linux to query the
> available memory. I would prefer that as an upper limit on 64bit if no
> limit is given. The reason is that *this* amount of memory is safe to
> use without over-committing / involving swap.

This may be the way to go on Linux but I didn't add it yet. The
committed code uses total_ram / 4. Since MemAvail is Linux-specific
something more broadly available needs exist for better portability,
and total_ram / 4 could perhaps be it. It can be tweaked if needed,
it's just a starting point.

> For 32bit applications I would cap that limit to 2.5 GiB or so. The
> reason is that the *normal* case is to run 32bit application on a
> 32bit kernel and so likely only 3GiB can be addressed at most (minus
> a few details like linked in libs, NULL page, guard pages and so on).
> The 32bit application on 64bit kernel is probably a shortcut where
> something is done a 32bit chroot - like building a package.
> 
> I'm not sure what a sane upper limit is on other OSes. Limitting it on
> 32bit does probably more good than bad if there is no -M parameter.

I think a generic cap needs to be below 2 GiB. For example, if 32-bit
MIPS can do only 2 GiB. There could be OS+arch-specific exceptions
though.

The code currently in xz.git uses 1400 MiB. There needs to be some
extra room if repeated mallocs and frees fragment the address space a
little. Perhaps it's too conservative but it allows eight compression
threads at the default xz -6, and one thread at -9 in threaded mode (so
it can create a file that can be decompressed in threaded mode).

> > An alternative "fix" for the liblzma case could be adding a simple
> > API function that would scale down the number of threads in a
> > lzma_mt structure based on a memory usage limit and if the
> > application is 32 bits. Currently the thread count and LZMA2
> > settings adjusting code is in xz, not in liblzma.  
> 
> It might help. dpkg checks the memlimit with
> lzma_stream_encoder_mt_memusage() and decreases the memory limit until
> it fits. It looks simpler compared to rpm's attempt and various
> exceptions.

Now that lzma_mt structure contains memlimit_threading already, a flag
could be added to use it to reduce the number of threads at the encoder
initialization. I suppose reducing the thread count would go a long
way. It doesn't affect the compressed output so it can be done when
people wish reproducible output.

> > The idea for the current 4020 MiB special limit is based on a patch
> > that was in use in FreeBSD to solve the problem of 32-bit xz on
> > 64-bit kernel. So at least FreeBSD should be supported to not make
> > 32-bit xz worse under 64-bit FreeBSD kernel.  
> 
> Is this a common case?

I don't *know* but I guess some build 32-bit packages on a 64-bit
kernel so it may be common enough use case.

> While poking around, Linux has this personality() syscall/function.
> There is a flag called PER_LINUX32_3GB and PER_LINUX_32BIT which are
> set if the command is invoked with `linux32' say
>   linux32 xz
> 
> then it would set that flag set and could act. It is not set by
> starting a 32bit application on a 64bit kernel on its own or on a
> 32bit kernel. I don't know if this is common practise but I use this
> in my chroots. So commands like `uname -m' return `i686' instead of
> `x86_64'. If other chroot environments do it as well then it could be
> used as a hack to assume that it is run on 64bit kernel. That is if
> we want that ofcourse :)

I haven't look at this but it sounds that it could be useful. If xz
knows that it has 4 GiB of address space the default limit could be much
higher.

-- 
Lasse Collin

[xz-devel] xzgrep security fix for XZ Utils <= 5.2.5, 5.3.2alpha (ZDI-CAN-16587)

2022-04-07 Thread Lasse Collin

Malicious filenames can make xzgrep to write to arbitrary files
or (with a GNU sed extension) lead to arbitrary code execution.

xzgrep from XZ Utils versions up to and including 5.2.5 are
affected. 5.3.1alpha and 5.3.2alpha are affected as well.
This patch works for all of them.

This bug was inherited from gzip's zgrep. gzip 1.12 includes
a fix for zgrep.

This vulnerability was discovered by:
cleemy desu wayo working with Trend Micro Zero Day Initiative

The patch and signature are available here:

https://tukaani.org/xz/xzgrep-ZDI-CAN-16587.patch
https://tukaani.org/xz/xzgrep-ZDI-CAN-16587.patch.sig

It is also linked from the XZ Utils home page <https://tukaani.org/xz/>.

-- 
Lasse Collin

Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2022-03-31 Thread Lasse Collin

On 2022-03-17 Jia Tan wrote:
> I attached two patches to this message. The first should fix a bug
> with the timeouts.

Thanks! This and the deadlock are now fixed (I committed them a few days
ago).

> The second patch is for the memlimit_threading update. I added a new
> API function that will fail for anything that is not the multithreaded
> decoder.

I need to consider this a little later. Some of the things I will do
next (some already have a patch on this list):

  - Add fail-fast flag to lzma_stream_decoder_mt().

  - Possibly fix a corner case in threaded coder if lzma_code() is
called in a similar way as in zpipe.c in in
<https://zlib.net/zlib_how.html>. That is, currently it doesn't
work but it can be made to work, I think. Supporting it makes
threaded decoder a little easier to adapt to existing apps if they
use that kind of decoding loop.

  - --memlmit-threading, I wrote this weeks ago except a few details
that need to be decided. For example, I guess -M should set
--memlimit-threading just like it sets --memlimit-compress and
--memlimit-decompress.

  - Initial version of automatic memlimit with --threads=0. First
version can be based on lzma_physmem() but other methods can be
added. Sebastian's patch uses MemAvailable on Linux, your patch
uses freemem from sysinfo() which equals MemFree in /proc/meminfo.
I suppose MemAvailable is a better starting point.

  - Support for forcing single/multi-threaded mode with --threads for
cases when xz decides to use only one thread.

  - Fix changing memlimit after LZMA_MEMLIMIT_ERROR in the old
single-threaded decoder. (I knew it's a rare use case but clearly
it's not a use case at all since I haven't seen bug reports.)

  - Your test framework patches

I suppose then the next alpha release is close to ready.

-- 
Lasse Collin

Re: [xz-devel] Re: improve java delta performance

2022-03-31 Thread Lasse Collin

> On Thu, May 6, 2021 at 4:18 PM Brett Okken 
> wrote:
> 
> > These changes reduce the time of DeltaEncoder by ~65% and
> > DeltaDecoder by ~40%, assuming using arrays that are several KB in
> > size. 

On 2022-02-12 Brett Okken wrote:
> Can this be reviewed?

It looks reasonable but I try to focus on XZ Utils at the moment.

The Delta code in XZ Utils is also very simple and could be optimized
the same way. But since Delta isn't used alone (it's used together with
LZMA2) I suspect the overall improvement isn't big. It could still be
done as it is simple but I won't look at it now.

For the ArrayUtil patch, it's a complex one and I'm not able to look at
it for now.

-- 
Lasse Collin

Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2022-03-17 Thread Lasse Collin

On 2022-03-15 Jia Tan wrote:
> As promised, I have attached a patch to solve the problem. Instead of
> doing as I had originally proposed, I simply added a wake up signal
> to a sleeping thread if partial updates are enabled. When the worker
> wakes up, it checks if no more input
> is available and signals to the main thread if it has output ready
> before going back
> to sleep. This prevents the deadlock on my liblzma tests and testing
> xz with/without timeout.

Thanks to both of you for debugging this. I see now that I had
completely missed this corner case. The patch looks correct except that
the mutex locking order is wrong which can cause a new deadlock. If
both thr->mutex and coder->mutex are locked at the same time,
coder->mutex must be locked first.

About memlimit updates, that may indeed need some work but I don't know
yet how much is worth the trouble. stream_decoder_mt_memconfig() has a
few FIXMEs too, maybe they don't need to be changed but it needs to be
decided.

I'm in a hurry now but I should have time for xz next week. :-)

-- 
Lasse Collin

Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2022-03-17 Thread Lasse Collin

Hello!

Once again, sorry for the delay. I will be busy the rest of the week. I
will get back to xz early next week.

On 2022-03-07 Sebastian Andrzej Siewior wrote:
> 32 cores:
> 
> | $ time ./src/xz/xz -tv tars.tar.xz -T0
> | tars.tar.xz (1/1)
> |   100 %  2.276,2 MiB / 18,2 GiB = 0,122   1,6 GiB/s   0:11
> | 
> | real0m11,162s
> | user5m44,108s
> | sys 0m1,988s
> 
> 256 cores:
> | $ time ./src/xz/xz -tv tars.tar.xz -T0
> | tars.tar.xz (1/1)
> |   100 %  2.276,2 MiB / 18,2 GiB = 0,122   3,4 GiB/s   0:05
> | 
> | real0m5,403s
> | user4m0,298s
> | sys 0m24,315s
> 
> it appears to work :) If I see this right, then the file is too small
> or xz too fast but it does not appear that xz manages to create more
> than 100 threads.

Thanks! The scaling is definitely good enough. :-) Even if there was
room for improvement I won't think about it much for now.

A curious thing above is the ratio of user-to-sys time. With more
threads a lot more is spent in syscalls.

> and decompression to disk
> | $ time ~bigeasy/xz/src/xz/xz -dvk tars.tar.xz -T0
> | tars.tar.xz (1/1)
> |   100 %  2.276,2 MiB / 18,2 GiB = 0,122   746 MiB/s   0:24
> | 
> | real0m25,064s
> | user3m49,175s
> | sys 0m29,748s
> 
> appears to block at around 10 to 14 threads or so and then it hangs
> at the end until disk I/O finishes. Decent.
> Assuming disk I/O is slow, say 10MiB/s, and we would 388 CPUs
> (blocks/2) then it would decompress the whole file into memory and
> stuck on disk I/O?

Yes.

I wonder if the way xz does I/O might affect performance. Every time
the 8192-byte input buffer is empty (that is, liblzma has consumed it),
xz will block reading more input until another 8192 bytes have been
read. As long as threads can consume more input, each call to
lzma_code() will use all 8192 bytes. Each call might pass up to 8192
bytes of output from liblzma to xz too. If compression ratio is high
and reading input isn't very fast, then perhaps performance might go
down because blocking on input prevents xz from producing more output.
Only when liblzma cannot consume more input xz will produce output at
full speed.

That is, I wonder if with slow input the output speed will be limited
until the input buffers inside liblzma have been filled. My explanation
isn't very good, sorry.

Ideally input and output would be in different threads but the liblzma
API doesn't really allow that. Based on your benchmarks the current
method likely is easily good enough in practice.

> In terms of scaling, xz -tv of that same file with with -T1…64:
[...]
> time of 1 CPU / 64 = (3 * 60 + 38) / 64 = 3.40625
> 
> Looks okay.

Yes, thanks!

> > If the input is broken, it should produce as much output as the
> > single-threaded stable version does. That is, if one thread detects
> > an error, the data before that point is first flushed out before
> > the error is reported. This has pros and cons. It would be easy to
> > add a flag to allow switching to fast error reporting for
> > applications that don't care about partial output from broken
> > files.  
> 
> I guess most of them don't care because an error is usually an abort,
> the sooner, the better. It is probably the exception that you want
> decompress it despite the error and maybe go on with the next block
> and see what is left.

I agree. Over 99 % of the time any error means that the whole output
will be discarded. However, I would like to make the threaded decoder
to (optionally) have very similar external behavior as the
single-threaded version for cases where it might matter. It's not
perfect at the moment but I think it's decent enough (bugs excluded).

Truncated files are a special case of corrupt input because, unless
LZMA_FINISH is used, liblzma cannot know if the input is truncated or
if there is merely a pause in the input for some application-specific
reason. That can result in LZMA_BUF_ERROR but if the application knows
that such pauses are possible then it can handle LZMA_BUF_ERROR
specially and continue decoding when more input is available.

-- 
Lasse Collin

Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2022-03-06 Thread Lasse Collin

Hello!

I committed something. The liblzma part shouldn't need any big changes,
I hope. There are a few FIXMEs but some of them might actually be fine
as is. The xz side is just an initial commit, there isn't even
--memlimit-threading option yet (I will add it).

Testing is welcome. It would be nice if someone who has 12-24 hardware
threads could test if it scales well. One needs a file with like a
hundred blocks, so with the default xz -6 that means a 2.5 gigabyte
uncompressed file, smaller if one uses, for example, --block-size=8MiB
when compressing.

If the input is broken, it should produce as much output as the
single-threaded stable version does. That is, if one thread detects an
error, the data before that point is first flushed out before the error
is reported. This has pros and cons. It would be easy to add a flag to
allow switching to fast error reporting for applications that don't
care about partial output from broken files.

-- 
Lasse Collin

Re: [xz-devel] [PATCH] liblzma: Use non-executable stack on FreeBSD as on Linux

2022-02-21 Thread Lasse Collin

On 2022-02-11 Ed Maste wrote:
>  src/liblzma/check/crc32_x86.S | 4 ++--
>  src/liblzma/check/crc64_x86.S | 4 ++--
>  2 files changed, 4 insertions(+), 4 deletions(-)

I have committed (but not tested) this. Thanks!

-- 
Lasse Collin

Re: [xz-devel] xz-utils-man.po, French translation

2022-02-21 Thread Lasse Collin

On 2022-02-10 Mario Blättermann wrote:
> The file is broken; due to some markup errors it produces only one of
> the manpages. See the attached patch.

Sorry to all, this time I had skipped testing and checking the
translation before committing it and it broke the build (po4a failure).

I have committed your patch. Now it works. :-) Thanks!

> Lasse,  besides the markup issues, both French and German translations
> are meanwhile incomplete and partially outdated.

Yes, although in context of the v5.2 branch it should be slightly less
outdated. The master is still in alpha stage and not meant for any
distribution like Debian. This is a problem with translations as it's
not clear if v5.2 or master should be translated. They don't differ
much but still. If 5.2.6 will be needed, then translating v5.2 might
make more sense, maybe.

> Please update po4a/xz-man.pot, and then consider to create a kind of
> "intermediate" tarball and send it to the TP robot, requesting a new
> TP domain for "xz-man".

I tried requesting for xz-man domain a year ago and that didn't go well
for a few reasons. Perhaps maybe I might dare to retry when the master
branch is getting close to becoming a stable release. Or it might be
easier to handle the man pages outside the Translation Project, we'll
see.

There are many open issues in the project that have been accumulated
over the years; translations are unfortunately just one thing. I have
many xz-related emails that I haven't answered yet. So the situation
is a bit chaotic. My life situation is now a little different and I'm
hoping I can focus on xz more now. So I'm trying to sort this, we'll
see how it goes in the next 2-4 months.

I'm hoping to commit a version of the threaded decoder in a few days.
All big FIXMEs are solved, only a few small ones to do. :-)

Gitweb is working again.

-- 
Lasse Collin

Re: [xz-devel] xz-utils-man.po, French translation

2022-02-06 Thread Lasse Collin

On 2022-01-08 Jean-Pierre Giraud wrote:
> Package xz-utils
> version 5.2.5-2
> 
> Hi,
> Please find attached the french translation of the xz-utils manpage
> done by "bubu" and proofread by the debian-l10n-french mailing list
> contributors.

Thanks! Committed.

-- 
Lasse Collin

Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2022-02-06 Thread Lasse Collin

On 2021-12-31 Sebastian Andrzej Siewior wrote:
> On 2021-12-15 23:33:58 [+0200], Lasse Collin wrote:
> > Yes. It's fairly simple from implementation point of view but is it
> > clear enough for the users, I'm not sure.
> > 
> > I suppose the alternative is having just one limit value and a flag
> > to tell if it is a soft limit (so no limit for single-threaded
> > case) or a hard limit (return LZMA_MEM_ERROR if too low for even
> > single thread). Having separate soft and hard limits instead can
> > achieve the same and a little more, so I think I'll choose the
> > two-value approach and hope it's clear enough for users.  
> 
> The value approach might work. I'm not sure if the term `soft' and
> `hard' are good here. Using `memlimit' and `memlimit_threaded' (or so)
> might make more obvious and easier to understand.
> But then this just some documentation that needs to be read and
> understood so maybe `softlimit' and `hardlimit' will work just fine.

I now plan to use memlimit_threading and memlimit_stop in the lzma_mt
structure. Documentation is still needed but hopefully those are a bit
more obvious.

> > I was hoping to get this finished by Christmas but due to a recent
> > sad event, late January is my target for the next alpha release
> > now.

And I'm late again. :-(

This is more work than I had expected because there unfortunately are a
few problems in the code and fixing them all requires quite significant
changes (and I'm slow). As a bonus, working on this made me notice a few
small bugs in the old liblzma code too (not yet committed).

The following tries to explain some of the problems and what I have
done locally. I don't have code to show yet because it still contains
too many small FIXMEs but, as unbelievable as it might sound, this will
get done. I need a few more days; I have other things I must do too.

The biggest issue is handling of memory usage and threaded vs. direct
mode. The memory usage limiting code makes assumptions that are true
with the most common files but there are situations where these
assumptions fail:

(1) If a non-first Block requires a lot more memory than the first
Block and so the memory limit would be exceeded in threaded mode,
the decoder will not switch to direct mode even with
LZMA_MEMLIMIT_COMPLETE. Instead the decoder proceeds with one
thread and uses as much memory as that needs.

(2) If a non-first Block lacks size info in its Block Header, the
decoder won't switch to direct mode. It returns LZMA_PROG_ERROR
instead.

(3) The per-thread input buffers can grow as bigger Blocks are seen but
the buffers cannot shrink. This has pros and cons. It's a problem if
a single Block is very big and others are not.

I thought it's better to first decode the Block Header to
coder->block_options and then, based on the facts from that Block
Header, determine memory usage and how to proceed (including switching
to/from direct mode). This way there is no need to assume or expect
anything. (coder->block_options need to be copied to a thread-specific
structure before initializing the decoder.)

For direct mode, I added separate SEQ states for it. This also helps
making the code more similar to the single-threaded decoder in both
looks and behavior. I hope that with memlimit_threading = 0 the
threaded version can have identical externally-visible behavior as the
original single-threaded version. This way xz doesn't need both
functions (the single-threaded function is still needed if built with
--disable-threads).

Corner cases of the buffer-to-buffer API:

(4) In some use cases there might be long pauses where no new input is
available (for example, sending a live log file over network with
compression). It is essential that the decoder will still provide
all output that is easily possible from the input so far. That is,
if the decoder was called without providing any new input, it might
need to be handled specially.

SEQ_BLOCK_HEADER and SEQ_INDEX return immediately if the application
isn't providing any new input data, and so eventually lzma_code()
will return LZMA_BUF_ERROR even when there would be output
available from the worker threads. try_copy_decoded() could be
called earlier but there is more to fix (see (5) and (6)).

(Also remember my comment above that I changed the code so that
Block Header is decoded first before getting a thread. That adds
one more SEQ point where waiting for output is needed.)

(5) The decoder must work when the application provides an output
buffer whose size is exactly the uncompressed size of the file.
This means that one cannot simply use *out_pos == out_size to
determine when to return LZMA_OK. Perhaps the decoder hasn't marked
its lzma_outbuf as finished but no more output will be coming, or
there is an empty Block (empty Blocks perh

Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2021-12-15 Thread Lasse Collin

On 2021-12-04 Sebastian Andrzej Siewior wrote:
> On 2021-11-30 00:25:11 [+0200], Lasse Collin wrote:
> > Separate soft and hard limits might be convenient from
> > implementation point of view though. xz would need --memlimit-soft
> > (or some better name) which would always have some default value
> > (like MemAvailable). The threaded decoder in liblzma would need to
> > take two memlimit values. Then there would be no need for an enum
> > (or a flag) to specify the memlimit mode (assuming that
> > LZMA_MEMLIMIT_THREAD is removed).  
> 
> Ah I see. So one would say soft-limit 80MiB, hard-limit 2^60bytes and
> would get no threading at all / LZMA_MEMLIMIT_NO_THREAD. And with soft
> 1GiB, hard 2^60bytes would get the threading mode. (2^60 is made up
> no limit).

Yes. It's fairly simple from implementation point of view but is it
clear enough for the users, I'm not sure.

I suppose the alternative is having just one limit value and a flag to
tell if it is a soft limit (so no limit for single-threaded case) or a
hard limit (return LZMA_MEM_ERROR if too low for even single thread).
Having separate soft and hard limits instead can achieve the same and a
little more, so I think I'll choose the two-value approach and hope it's
clear enough for users.

> > I wonder if relying on the lzma_mt struct is useful for the decoder.
> > Perhaps the options could be passed directly as arguments as there
> > are still 2-3 fewer than needed for the encoder.  
> 
> Thre is
> - num threads
> - flags
> - memlimit
> - timeout
> 
> One struct to rule them all and you could extend it without the need
> to change the ABI.
> I took one of the reserved ones for the memlimit. If you put the two
> memory limits and number of threads in one init/configure function
> then only flags and timeout is left. Maybe that would be enought then.

You have a valid point. Either approach works, new functions can be
added if needed for extending the ABI, but having just one can be nice
in the long term.

I was hoping to get this finished by Christmas but due to a recent sad
event, late January is my target for the next alpha release now. I hope
to include a few other things too, including some of Jia Tan's patches
(we've chatted outside the xz-devel list). Thank you for understanding.

-- 
Lasse Collin

Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2021-11-29 Thread Lasse Collin

Hello!

On 2021-02-05 Sebastian Andrzej Siewior wrote:
> - Added enum `lzma_memlimit_opt' to lzma_stream_decoder_mt() as an
>   init parameter. The idea is to specify how to obey the memory limit
>   so the user can keep using one API and not worry to fail due to the
>   memory limit. Lets assume the archive has a 9MiB dictionary, 24MiB
>   block of uncompressed data. The archive contains two compressed
>   blocks of 10 MiB each. Using two threads, the memory requirement is
>   roughly (9 + 24 + 10) * 2 = 86 MiB
> 
>   On a system with 64 MiB of memory with additional 128MiB of swap it
>   likely leads to the use of (say 30 MiB) swap memory during
>   decompression which will slow down the whole operation.
>   The synchronous API would do just fine with only 9 MiB of memory.
> 
>   So to not complicate things, invoking lzma_stream_decoder_mt() with
>   a memory limit of 32 MiB three scenarios are possible:
>   - LZMA_MEMLIMIT_THREAD
> One thread requires 43MiB of memory and would exceed the memory
> limit. However, continue with one thread instead of possible two.
> 
>   - LZMA_MEMLIMIT_NO_THREAD
> One thread requires 43MiB of memory and would exceed the memory
> limit. Fallback to the synchronous API without buffered input /
> output memory.
> 
>   - LZMA_MEMLIMIT_COMPLETE
> In this scenario it would behave like LZMA_MEMLIMIT_NO_THREAD.
> However, with a dictionary size > 32MiB it would abort.

In the old single-threaded code, if no memory usage limit is specified
the worst case memory usage with LZMA2 is about 4 GiB (the format
allows 4 GiB dict although the current encoder only supports 1536 MiB).
With the threaded decoder it's the same with LZMA_MEMLIMIT_NO_THREAD.

However, LZMA_MEMLIMIT_THREAD sounds a bit scary. There are no practical
limits to the block size so there can be a .xz file that makes the
decoder allocate a huge amount of memory. It doesn't even need to be an
intentionally malicious file, it just needs to have the size fields
present. Thus, I think LZMA_MEMLIMIT_THREAD should be removed.
One-thread multi-threaded mode will still be used with
LZMA_MEMLIMIT_NO_THREAD if the limit is high enough.

LZMA_MEMLIMIT_NO_THREAD should be the default in xz when no memory
usage limit has been explicitly specified. There needs to be a default
"soft limit" (the MemAvailable method is such) that will drop xz to
single-threaded mode if the soft limit is too high for threaded mode
(even with just one thread).

LZMA_MEMLIMIT_COMPLETE could be the mode to use when a memlimit is
explicitly specified (a "hard limit") on the xz command line. This would
match the existing behavior of the old single-threaded decoder. It
would be good to have a way to specify a soft limit on the xz command
line too.

It could make sense to have both soft and hard limit at the same time
but perhaps it gets too confusing: Soft limit that would be used to
restrict the number of threads (and even drop to single-threaded mode)
and hard limit which can return LZMA_MEMLIMIT_ERROR. If one is fine to
use 300 MiB in threaded mode but still wants to allow up to 600 MiB in
case the file *really* requires that much even in single-threaded mode,
then this would be useful.

Separate soft and hard limits might be convenient from implementation
point of view though. xz would need --memlimit-soft (or some better
name) which would always have some default value (like MemAvailable).
The threaded decoder in liblzma would need to take two memlimit values.
Then there would be no need for an enum (or a flag) to specify the
memlimit mode (assuming that LZMA_MEMLIMIT_THREAD is removed).

Extra idea, maybe useless: The --no-adjust option could be used to
specify that if the specified number of threads isn't possible due to
a memlimit then xz will abort. This is slightly weird as it doesn't
provide real performance guarantees anyway (block sizes could vary a
lot) but it's easy to implement if it is wanted.

I wonder if relying on the lzma_mt struct is useful for the decoder.
Perhaps the options could be passed directly as arguments as there are
still 2-3 fewer than needed for the encoder.

I've made some other minor edits locally already so I would prefer to
*not* get new patch revisions until I have committed something.
Comments are very welcome. :-)

Thanks!

-- 
Lasse Collin

Re: [xz-devel] [PATCH] xz: Multithreaded mode now always uses stream_encoder_mt to ensure reproducible builds

2021-11-29 Thread Lasse Collin

On 2021-11-29 Jia Tan wrote:
> This patch addresses the issues with reproducible builds when using
> multithreaded xz. Previously, specifying --threads=1 instead of
> --threads=[n>1] creates different output. Now, setting any number of  
> threads forces multithreading mode, even if there is only 1 worker
> thread.

This is an old problem that should have been fixed long ago.
Unfortunately I think the fix needs to be a little more complex due to
backward compatibility.

With this patch, if threading has been enabled, no further option on
the command line (except --flush-timeout) will disable threading.
Sometimes there are default options (for exampe, XZ_DEFAULTS) that
enable threading and one wants to disable it in a specific situation
(like running multiple xz commands in parallel via xargs). If
--threads=1 always enables threading, memory usage will be quite a bit
higher than in non-threaded mode (94 MiB vs. 166 MiB for the default
compression level -6; 674 MiB vs. 1250 MiB for -9).

To be backward compatible, maybe it needs extra syntax within the
--threads option or a new command line option. Both are a bit annoying
and ugly but I don't have a better idea.

Currently one-thread multi-threading is done if one specifies two or
more threads but the memory limit is so low that only one thread can be
used. In that case xz will never switch to non-threaded mode. This
ensures that the output file is always the same even if the number of
threads gets reduced.

When -T0 is used, that is broken in sense that threading mode (and
thus encoded output) depends on how many hardware threads are supported.
So perhaps -T0 should mean that multi-threaded mode must be used even
for single thread (your patch would do this too).

A way to explicitly specify one-thread multi-threaded mode is still
needed but I guess it wouldn't need to be used so often if -T0 handles
it already. -T0 needs improvements in default memory usage limiting too,
and both changes could make the default behavior better.

The opposite functionality could be made available too: if the number
of threads becomes one for whatever reason, an option could tell xz to
always use single-threaded mode to get better compression and to save
RAM.

> +#include "common.h"
[...]
> // The max is from src/liblzma/common/common.h.
> hardware_threads_set(str_to_uint64("threads",
> - optarg, 0, 16384));
> + optarg, 0, LZMA_THREADS_MAX));

common.h is internal to liblzma and must not be used from xz. Maybe
LZMA_THREADS_MAX could be moved to the public API, I don't know right
now.

-- 
Lasse Collin

Re: [xz-devel] [PATCH] xz: Added .editorconfig file for simple style guide encouragement

2021-11-10 Thread Lasse Collin

Hello!

On 2021-10-30 Jia Tan wrote:
> This patch adds a .editorconfig to the root directory.

Thanks! I hadn't heard about this before but it sounds nice.

> +[*]
> +insert_final_newline = true
> +trim_trailing_whitespace = true

I think it should be fine to add these:

charset = utf-8
end_of_line = lf

The exception are some files under windows/vs*. Those files will
hopefully be gone in the future though. They use LF, not CR+LF but have
BOM:

[*.vcxproj,xz_win.sln]
charset = utf-8-bom

> +[src/,tests/]

If the syntax is similar to gitignore, then src/ will match also
foo/bar/src/. It doesn't really matter here but I suppose /src/ is a
tiny bit more correct.

> +indent_style = tab

I guess it makes sense to set also indent_size = 8 because viewing the
files with any other setting will look weird when long lines are
wrapped and can editing can result in wrong word wrapping.

There are multiple indentation styles even under src. Instead of
specifying directories, how about specifying file suffixes like *.c so
it won't matter where the files are. There are .sh files with different
styles but maybe it's not that important.

I ended up with this:

---
# To use this config on your editor, follow the instructions at:
# https://editorconfig.org/

root = true

[*]
charset = utf-8
end_of_line = lf
insert_final_newline = true
trim_trailing_whitespace = true

[*.c,*.h,*.S,*.map,*.sh,*.bash,Makefile*,/configure.ac,/po4a/update-po,/src/scripts/{xzless,xzmore}.in]
indent_style = tab
indent_size = 8

[/src/scripts/{xzdiff,xzgrep}.in]
indent_style = space
indent_size = 2

[CMakeLists.txt,*.cmake]
indent_style = space
indent_size = 4

[*.vcxproj,xz_win.sln]
charset = utf-8-bom
---

Is it good enough or did I add bad bugs? :-)

-- 
Lasse Collin

Re: [xz-devel] Multithreaded decompression for XZ Utils.

2021-11-10 Thread Lasse Collin

On 2021-11-06 Sebastian Andrzej Siewior wrote:
> just spotted that Christmas is around the corner. I *think* that I've
> been a good boy over the year. I plan to keep it that way just to be
> sure. Not trying to push my luck here but what are my chances to find
> parallel decompression in xz-utils under the christmas tree?

You have been very good boy indeed and I have been the opposite, still
not gotten this done. I don't want to give any odds, although there are
reasons why the odds should be better than a month or two ago, but I
will really try so that Santa can deliver a new alpha package.

-- 
Lasse Collin

[xz-devel] XZ Utils 5.3.2alpha

2021-10-28 Thread Lasse Collin

XZ Utils 5.3.2alpha is available at <https://tukaani.org/xz/>. Here is
an extract from the NEWS file:

This release was made on short notice so that recent erofs-utils can
be built with LZMA support without needing a snapshot from xz.git.
Thus many pending things were not included, not even updated
translations (which would need to be updated for the new --list
strings anyway).

* All fixes from 5.2.5.

* xz:

- When copying metadata from the source file to the destination
  file, don't try to set the group (GID) if it is already set
  correctly. This avoids a failure on OpenBSD (and possibly on
  a few other OSes) where files may get created so that their
  group doesn't belong to the user, and fchown(2) can fail even
  if it needs to do nothing.

- The --keep option now accepts symlinks, hardlinks, and
  setuid, setgid, and sticky files. Previously this required
  using --force.

- Split the long strings used in --list and --info-memory modes
  to make them much easier for translators.

- If built with sandbox support and enabling the sandbox fails,
  xz will now immediately exit with exit status of 1. Previously
  it would only display a warning if -vv was used.

- Cap --memlimit-compress to 2000 MiB on MIPS32 because on
  MIPS32 userspace processes are limited to 2 GiB of address
  space.

* liblzma:

- Added lzma_microlzma_encoder() and lzma_microlzma_decoder().
  The API is in lzma/container.h.

  The MicroLZMA format is a raw LZMA stream (without end marker)
  whose first byte (always 0x00) has been replaced with
  bitwise-negation of the LZMA properties (lc/lp/pb). It was
  created for use in EROFS but may be used in other contexts
  as well where it is important to avoid wasting bytes for
  stream headers or footers. The format is also supported by
  XZ Embedded.

  The MicroLZMA encoder API in liblzma can compress into a
  fixed-sized output buffer so that as much data is compressed
  as can be fit into the buffer while still creating a valid
  MicroLZMA stream. This is needed for EROFS.

- Added fuzzing support.

- Support Intel Control-flow Enforcement Technology (CET) in
  32-bit x86 assembly files.

- Visual Studio: Use non-standard _MSVC_LANG to detect C++
  standard version in the lzma.h API header. It's used to
  detect when "noexcept" can be used.

* Scripts:

- Fix exit status of xzdiff/xzcmp. Exit status could be 2 when
  the correct value is 1.

- Fix exit status of xzgrep.

- Detect corrupt .bz2 files in xzgrep.

- Add zstd support to xzgrep and xzdiff/xzcmp.

- Fix less(1) version detection in xzless. It failed if the
  version number from "less -V" contained a dot.

* Fix typos and technical issues in man pages.

* Build systems:

- Windows: Fix building of resource files when config.h isn't
  used. CMake + Visual Studio can now build liblzma.dll.

- Various fixes to the CMake support. It might still need a few
  more fixes even for liblzma-only builds.

-- 
Lasse Collin

Re: [xz-devel] [PATCH] xz: Avoid fchown(2) failure.

2021-10-27 Thread Lasse Collin

On 2021-10-05 Alexander Bluhm wrote:
> OpenBSD does not allow to change the group of a file if the user
> does not belong to this group.  In contrast to Linux, OpenBSD also
> fails if the new group is the same as the old one.  Do not call
> fchown(2) in this case, it would change nothing anyway.

Thanks! Committed.

-- 
Lasse Collin

Re: [xz-devel] [PATCH] add xz arm64 bcj filter support

2021-09-08 Thread Lasse Collin

On 2021-09-02 Liao Hua wrote:
> We have some questions about xz bcj filters.
> 1. Why ARM and ARM-Thumb bcj filters are little endian only?

Perhaps it's an error. Long ago when I wrote the docs, I knew that the
ARM filters worked on little endian code but didn't know how big endian
ARM was done. If it always uses the same encoding for instructions,
then the docs should be fixed. The same is likely true about PowerPC.

> 2. Why there is no arm64 bcj filter? Are there any technical risks?
> Or other considerations?

It just hasn't been done, no other reason. In general I haven't gotten
much done in years and there even are a few patches (unrelated to BCJ)
that have been waiting my feedback for a very long time. :-(

> We add arm64 bcj filter support in local xz codes and it works ok.
> We modify the Linux Kernel codes accordingly and use the new xz to
> compress kernel, and kernel is decompressed successfully during
> startup.
> 
> The following is the patch for arm64 bcj filter support which is
> based on xz 5.2.5 version.

Thanks!

> + // arm64 bl instruction: 0x94 and 0x97;
> + if (buffer[i + 3] == 0x94 || buffer[i + 3] == 0x97) {

The "bl" instruction takes a signed 26-bit immediate value that encodes
the offsets as a multiple of four bytes. The above matches only when the
two highest bits are either 00 or 11. Is it intentional that it ignores
immediate values with the the highest bits 01 and 10?

Ignoring 01 (offset > 64 MiB) and 10 (offset < -64 MiB) results in
fewer false matches when the filter is applied to non-code data. Also,
perhaps such offsets aren't so common in actual code (they can appear
in big binaries only). If false matches are an issue, it might even
make sense to reduce the range further (+/-32 MiB would be the same as
on 32-bit ARM):

for (i = 0; i + 4 <= size; i += 4) {
const uint32_t instr = read32le(buffer + i);
const uint32_t x = instr & 0xFF80;
if (x == 0x9400 || x == 0x9780) {
...

It's not obvious what is better so it would be good to test with a few
types of files (kernel image, and a few GNU/Linux distro packages
containing both executable and data files).

Also, the way the two highest bits are ignored means that the sign bit
isn't taken into account when doing the conversion. The calculation of
"dest" will never flip the sign bit(s) (0x94 to 0x97 or vice versa) when
the addition/substraction wraps around. Maybe it doesn't matter much in
practice.

Have you tested if instructions other than "bl" could be worth
converting too? Unconditional branch instruction "b" is the most
obvious candidate to try (0x14 instead of 0x94). I don't expect much
but at this point it is easy to test. It's possible that it depends too
much on what kind of code the input file has (it might help with some
files and be harmful with many others).

Since this is a new filter, I would like to avoid a problem that other
BCJ filters have: Linux kernel modules, static libraries and such files
have the address part in the instructions filled with zeroes (correct
values will be set when the file is linked). For example, if you run
"objdump -d" on a x86-64 Linux module, there are lots of "call"
instructions encoded as "e8 00 00 00 00". I haven't checked if this is
similar on ARM64 but it sounds likely.

The existing BCJ filters make compression worse with these files. The
correct action would be to do nothing with zeroed addresses:

if (src == 0)
continue;

However, the encoder has to avoid conversions that would result in a
zero that the decoder would ignore. On the other hand, the decoder will
never need to decode a non-zero input value to a zero. These special
cases can be used together. Untested code:

if (src == 0)
continue;

src <<= 2;

const uint32_t pc = now_pos + (uint32_t)(i);
uint32_t dest = is_encoder ? src + pc : src - pc;

// The mask assumes that only 24 bits of the 26-bit immedate
// are used.
if ((dest & 0x3FC) == 0) {
assert((pc & 0x3FC) != 0);
dest = is_encoder ? pc : 0U - pc;
}

dest >>= 2;

The "start=offset" option probably could be omitted. It's quite useless
inside .xz. XZ Embedded doesn't support it anyway.

Once a filter is ready, I will need to discuss it with Igor Pavlov (the
7-Zip's developer) too, and add the new filter ID to the official .xz
specification.

-- 
Lasse Collin

Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder

2021-07-20 Thread Lasse Collin

Hello!

On 2021-07-20 Guillem Jover wrote:
> I've only skimmer very quickly over the patch, but I've been running
> it on my system in addition to a locally modified dpkg that uses this
> new support, and it seems to be working great. :)

Great to hear, thanks! :-) Unfortunately I don't have any news. :-(

-- 
Lasse Collin

Re: [xz-devel] Go/Golang bindings for xz

2021-04-15 Thread Lasse Collin

Hello!

On 2021-04-12 James Fennell wrote:
> Over the last couple of weeks I've been working on a project to add
> Go bindings for the xz format: https://github.com/jamespfennell/xz

:-)

> The project uses the Go technology cgo to compile the relevant
> liblzma C files automatically and link them in with the Go binary.

That made me wonder about config.h and the #defines. With a really
quick look I found

https://github.com/jamespfennell/xz/blob/main/lzma/lzma.go

which sets a few #defines but it's quite limited, for example, a
comment tells that only 64-bit systems are supported. I also don't see
TUKLIB_FAST_UNALIGNED_ACCESS which is good on 32/64-bit x86 and some
ARMs to get a little better encoder performance. Also

#define TUKLIB_SYMBOL_PREFIX lzma_

could be good to have to ensure that all symbols begin with "lzma_".

Of course these don't matter if the system liblzma is used instead. I
understood that it's an option too.

> Lasse, would you be interested in adding a link under the bindings
> section of the xz website?

I can. Since there are other bindings to use liblzma, I wonder if some
of those should be listed too. What do you think? I have no Go
experience so I have no idea which are good or already popular.

Thanks!

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Re: [xz-devel] Status of man page translations?

2021-04-15 Thread Lasse Collin

On 2021-04-15 Mario Blättermann wrote:
> Am So., 11. Apr. 2021 um 20:48 Uhr schrieb Lasse Collin
> :
> > I suppose I can just submit a snapshot from the master branch.

I have done this.

> I am curious to see when the first new translations will arrive :)

Me too. It's a lot of work to translate them all.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Re: [xz-devel] Status of man page translations?

2021-04-11 Thread Lasse Collin

On 2021-04-04 Mario Blättermann wrote:
> But what ist the blocker which still prevents you from creating an
> intermediate tarball, send it to the TP coordinator and tell him to
> create a new domain named "xz-man"?

I suppose I had forgotten it. If there were other reasons, I have
forgotten them too. Sorry.

I suppose I can just submit a snapshot from the master branch.
xz-man.pot is compatible with v5.2 for now. xz.pot isn't compatible
between the branches though but if 5.2.6 is needed (impossible to know
now) maybe it's not that bad:

The command line tool translations in v5.2 have strings that are
difficult to get right. The master branch has such strings too but not
as many. For the 5.2.5 release, many translations didn't pass basic
quality control due to these strings. Some translators (individuals or
teams) replied to my emails about suggested white-space corrections,
some didn't. Thus multiple translations were omitted from 5.2.5. With
this background I feel that if 5.2.6 is needed I won't consider any
*new* xz.po files for it anyway; new xz-man.po languages would be fine.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Re: [xz-devel] [PATCH] Reduce maximum possible memory limit on MIPS32

2021-04-11 Thread Lasse Collin

On 2021-04-09 Vitaly Chikunov wrote:
> From: "Ivan A. Melnikov" 
> 
> Due to architectural limitations, address space available to a single
> userspace process on MIPS32 is limited to 2 GiB, not 4, even on
> systems that have more physical RAM -- e.g. 64-bit systems with 32-bit
> userspace, or systems that use XPA (an extension similar to x86's
> PAE).
> 
> So, for MIPS32, we have to impose stronger memory limits. I've chosen
> 2000MiB to give the process some headroom.

Thanks! Committed.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

[xz-devel] XZ for Java 1.9

2021-03-12 Thread Lasse Collin

XZ for Java 1.9 is available at <https://tukaani.org/xz/java.html> and
in the Maven Central (groupId = org.tukaani, artifactId = xz). Here is
an extract from the NEWS file:

* Add LZMAInputStream.enableRelaxedEndCondition(). It allows
  decompression of LZMA streams whose uncompressed size is known
  but it is unknown if the end of stream marker is present. This
  method is meant to be useful in Apache Commons Compress to
  support .7z files created by certain very old 7-Zip versions.
  Such files have the end of stream marker in the LZMA data even
  though the uncompressed size is known. 7-Zip supports such files
  and thus other implementations of the .7z format should support
  them too.

* Make LZMA/LZMA2 decompression faster. With files that compress
  extremely well the performance can be a lot better but with
  more typical files the improvement is minor.

* Make the CRC64 code faster.

* Add module-info.java as multi-release JAR. The attribute
  Automatic-Module-Name was removed.

* The binaries for XZ for Java 1.9 in the Maven Central now
  require Java 7. Building the package requires at least Java 9
  for module-info support but otherwise the code should still be
  Java 5 compatible (see README and comments in build.properties).

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Re: [xz-devel] Re: java LZDecoder small improvement

2021-03-08 Thread Lasse Collin

On 2021-03-01 Brett Okken wrote:
> > One thing that confuses me in your version is the special handling
> > of the first byte:
> >
> > buf[pos++] = buf[back++];
> > --left;
> >
> > If there are two bytes to copy, then one will be copied above and
> > the other with arraycopy later. If there are more bytes to copy and
> > distance is very small, incrementing "back" above can mean that an
> > extra arraycopy call might be needed in the loop because the first
> > copy will be one byte smaller.
> >
> > I understand that it might help when there is just one byte to
> > repeat because then the while-loop will be skipped. In all other
> > situations it sounds like that the special handling of the first
> > byte would in theory be harmful. Note that I don't doubt your test
> > results; I already saw with the CRC64 code that some changes in the
> > code can affect performance in weird ways.  
> 
> The image1.dcm is the most impacted by this optimization. Again, this
> file is basically a large greyscale bmp. This results in a significant
> number of single byte repeats. Optimizing for the single byte improves
> performance in that file by 3-5%, while having smaller effects on the
> other 2 files (ihe_ovly_pr.dcm slightly slower, large.xml slightly
> faster)

OK, that is an interesting test case.

> I agree your approach is more readable. From your version of it, I was
> expecting that simplicity in reading to translate into better
> performance.
> This latest version actually does appear to do that. The image1.dcm
> performance matches my version and the other 2 are a bit faster.
> Adding the single byte optimization still speeds up image1.dcm (~8ms,
> ~2%) and large.xml (~3ms, 2%), while slowing ihe_ovly_pr.dcm (~.008ms,
> ~1%).
[...]
> Version 3 is better for all 3 files.

With these results I now plan to include version 3 in the next release.
It sounds that the single-byte optimization has a fairly small effect.
Omitting it keeps the code a tiny bit simpler.

I have committed the change. I think xz-java.git should now be almost
ready for a release. I just need to add NEWS and bump the version
number.

Thanks for your help!

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Re: [xz-devel] Re: java LZDecoder small improvement

2021-02-27 Thread Lasse Collin

On 2021-02-13 Brett Okken wrote:
> On Thu, Feb 11, 2021 at 12:51 PM Lasse Collin
>  wrote:
> > I still worry about short copies. If the file is full of tiny
> > matches/repeats of 1-3 bytes or so, arraycopy can be slower. Such
> > files aren't typical at all but I don't want to add a corner case
> > where the performance drops too much.  
> 
> Do you have examples of such files, or code on how to generate one?

Use the patch below and compress with this:

java -jar build/jar/XZEncDemo.jar 2 < infile > outfile.xz"

Adjust LIMIT to get longer matches.

diff --git a/src/org/tukaani/xz/lzma/LZMAEncoderFast.java 
b/src/org/tukaani/xz/lzma/LZMAEncoderFast.java
index f8230ee..cd92ca6 100644
--- a/src/org/tukaani/xz/lzma/LZMAEncoderFast.java
+++ b/src/org/tukaani/xz/lzma/LZMAEncoderFast.java
@@ -44,6 +44,8 @@ final class LZMAEncoderFast extends LZMAEncoder {
 return smallDist < (bigDist >>> 7);
 }

+private static final int LIMIT = 2;
+
 int getNextSymbol() {
 // Get the matches for the next byte unless readAhead indicates
 // that we already got the new matches during the previous call
@@ -66,11 +68,13 @@ final class LZMAEncoderFast extends LZMAEncoder {
 int bestRepIndex = 0;
 for (int rep = 0; rep < REPS; ++rep) {
 int len = lz.getMatchLen(reps[rep], avail);
+if (len > LIMIT)
+len = LIMIT;
 if (len < MATCH_LEN_MIN)
 continue;

 // If it is long enough, return it.
-if (len >= niceLen) {
+if (len >= LIMIT) {
 back = rep;
 skip(len - 1);
 return len;
@@ -88,9 +92,11 @@ final class LZMAEncoderFast extends LZMAEncoder {

 if (matches.count > 0) {
 mainLen = matches.len[matches.count - 1];
+if (mainLen > LIMIT)
+mainLen = LIMIT;
 mainDist = matches.dist[matches.count - 1];

-if (mainLen >= niceLen) {
+if (mainLen >= LIMIT) {
 back = mainDist + REPS;
 skip(mainLen - 1);
 return mainLen;

With a quick try I got a feeling that my worry about short repeats was
wrong. It doesn't matter because decoding each LZMA symbol is much more
expensive. What matters is avoiding multiple tiny arraycopy calls
within a single run of the repeat method, and that problem was already
solved.

> > I came up with the following. I haven't decided yet if I like it.  
> 
> On the 3 files I have been testing with, this change is a mixed bag.
> Compared to trunk 1 regresses by ~8%. While the other 2 do improve,
> neither are better than my last patch.

OK, thanks. So it isn't great. I wonder which details make the
difference.

One thing that confuses me in your version is the special handling of
the first byte:

buf[pos++] = buf[back++];
--left;

If there are two bytes to copy, then one will be copied above and the
other with arraycopy later. If there are more bytes to copy and
distance is very small, incrementing "back" above can mean that an
extra arraycopy call might be needed in the loop because the first copy
will be one byte smaller.

I understand that it might help when there is just one byte to repeat
because then the while-loop will be skipped. In all other situations it
sounds like that the special handling of the first byte would in theory
be harmful. Note that I don't doubt your test results; I already saw
with the CRC64 code that some changes in the code can affect
performance in weird ways.

Your code needs

if (back == bufSize)
back = 0;

in the beginning of the while-loop and later checking for tmp > 0. My
version avoids these branches by handling those cases under "if (back <
0)" (which is equivalent to "if (dist >= pos)"). On the other hand,
under "if (back < 0)" all copies, including tiny copies, are done with
arraycopy.

Another tiny difference is that your code uses left shift to double the
copy size in the loop while I used Math.min(pos - back, left).

> I was able to improve this a bit by pulling the handling of small
> copies outside of the while loop. This eliminates the regressions
> compared to trunk, but still does not feel like an improvement over my
> last patch.

Yeah, the switch isn't worth it. If I understand it correctly now,
trying to avoid arraycopy for the tiny copies wasn't a useful idea in
the first place. So the code can be simplified ("version 3"):

int back = pos - dist - 1;
if (back < 0) {
// The distance wraps around to the end of the cyclic dictionary
// buffer. We cannot get here if the dictionary isn't full.
assert full == bufSize;
back += bufSize;

// Here we

Re: [xz-devel] xz-java and newer java

2021-02-16 Thread Lasse Collin

I quickly tried these with "XZEncDemo 2". I used the preset 2 because
that uses LZMAEncoderFast instead of LZMAEncoderNormal where the
negative lengths result in a crash. The performance was about the same
or worse than the original code. I don't know why. I didn't spend much
time on this and it's possible that I messed up something.

One thing that may be worth checking out is how in HC4.java (and
BT4.java too) the patch doesn't try to quickly skip too short matches
like the original code does. I suppose the first set of patches should
be such that they only replace the byte-by-byte loops with a function
call to make comparison as fair as possible.

These patches won't get into XZ for Java 1.9 but might be in a later
version if I see them being/becoming good. The only remaining patch
that might get into 1.9 is LZDecoder.repeat improvements.

When you post a patch or other code, please make sure that word-wrapping
is disabled in the email client or use attachments. Thanks!

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Re: [xz-devel] java array cache fill

2021-02-16 Thread Lasse Collin

On 2021-02-16 Brett Okken wrote:
> We found in LZDecoder that using System.arrayCopy with doubling size
> is faster than Arrays.fill (especially for larger arrays).
> We can apply that knowledge in the BasicArrayCache, where there are
> some use cases which require clearing out the array prior to returning
> it.

A simple micro-benchmark gives me a very different result. The
alternative method is roughly 70 % slower than Arrays.fill on my system
with a big array. If Arrays.fill were so terrible, it should be
improved instead. Even if the alternative method were faster, it would
need to be a lot faster to be worth the extra complexity.

If Arrays.fill version (uncomment/comment the code) is slower for you,
it must depend on the Java runtime or operating system or such things.

import java.util.Arrays;

public class Foo {
public static void main(String[] args) throws Exception {
byte[] buf = new byte[10 << 20];

for (int i = 0; i < 4000; ++i) {
//Arrays.fill(buf, (byte)0);

buf[0] = (byte)0;
buf[1] = (byte)0;
buf[2] = (byte)0;
buf[3] = (byte)0;
int toCopy = 4;
int remaining = buf.length - toCopy;
do {
System.arraycopy(buf, 0, buf, toCopy, toCopy);
remaining -= toCopy;
toCopy <<= 1;
} while (remaining >= toCopy);

if (remaining != 0) {
System.arraycopy(buf, 0, buf, toCopy, remaining);
    }
    }
}
}

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Re: [xz-devel] jdk9+ CRC64

2021-02-14 Thread Lasse Collin

On 2021-02-13 Brett Okken wrote:
> We can make it look even more like liblzma :)

It can be done but I'm not sure yet if it should be done. Your
implementation looks very neat though. :-)

> In my benchmark I observe no negative impact of using the functions.
> Which is to say that this is still 5-7% faster than the byte-by-byte
> approach.

With a dumb test with XZDecDemo, it seems faster than the current code
(8.5 s vs. 7.9 s). However, if I misalign the buffer in XZDecDemo.java
like this

int size;
while ((size = in.read(buf, 1, 8191)) != -1)
System.out.write(buf, 1, size);

then both versions are about as fast (7.9 s). The weird behavior with
misaligned buffers was discussed earlier.

My point is that if tiny things like buffer alignment can make as big a
difference as supposedly better code, perhaps the explanation for the
speed difference isn't the code being better but some side-effect that
I don't understand.

On your systems the results might differ significantly and more
information is welcome. With the current information I think the
possible benefit of the fancier code isn't worth it (bigger xz.jar,
more code to maintain). In any case, any further CRC64 improvements
will need to wait past the 1.9 release.

The test file I used contains a repeating 257-byte pattern where each
8-bit value occurs at least once. It is extremely compressible and thus
makes the differences in CRC64 speed as big as they can be with LZMA2.
With real files the differences are smaller.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Re: [xz-devel] Compatibility between CMake config file and FindLibLZMA.cmake

2021-02-13 Thread Lasse Collin

think the CMake build files also were not yet included in any
> official release.

CMakeLists.txt and friends were included in XZ Utils 5.2.5 (with the
bug that shared library doesn't build on Windows). It's described as
experimental so in that sense it could be OK to change things.

> You can add an alias for target "liblzma" to target "LibLZMA" in the 
> CMakeLists.txt file (after the target definition in add_library, line 
> 193) for users that embed the xz project as a subdirectory:
> add_library(LibLZMA::LibLZMA ALIAS LibLZMA)
> add_library(liblzma ALIAS LibLZMA::LibLZMA)
> add_library(liblzma::liblzma ALIAS LibLZMA::LibLZMA)

If I change the main add_library(liblzma ) to add_library(LibLZMA
) then the filename will be LibLZMA.something too. That isn't
good because then one cannot replace a CMake-built shared liblzma with
an Autotools-built one on operating systems where file and library
names are case sensitive.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Re: [xz-devel] Re: java LZDecoder small improvement

2021-02-11 Thread Lasse Collin

On 2021-02-05 Brett Okken wrote:
> I worked this out last night. We need to double how much we copy each
> time by not advancing "back". This actually works even better than
> Arrays.fill for the single byte case also.

This clearly is a good idea in a Java implementation. :-)

I still worry about short copies. If the file is full of tiny
matches/repeats of 1-3 bytes or so, arraycopy can be slower. Such files
aren't typical at all but I don't want to add a corner case where the
performance drops too much.

I came up with the following. I haven't decided yet if I like it.

public void repeat(int dist, int len) throws IOException {
if (dist < 0 || dist >= full)
throw new CorruptedInputException();

int left = Math.min(limit - pos, len);
pendingLen = len - left;
pendingDist = dist;

int back = pos - dist - 1;
if (back < 0) {
// We won't get here if the dictionary isn't full.
assert full == bufSize;

// The distance wraps around to the end of the cyclic dictionary
// buffer. Here we will never copy more than dist + 1 bytes
// and so the copying won't repeat from its own output. Thus,
// we can always use arraycopy safely.
back += bufSize;
int copySize = Math.min(bufSize - back, left);
assert copySize <= dist + 1;

System.arraycopy(buf, back, buf, pos, copySize);
pos += copySize;
back = 0;
left -= copySize;

if (left == 0)
return;
}

assert back < pos;
assert left > 0;

do {
// Determine the number of bytes to copy on this loop iteration:
// copySize is set so that the source and destination ranges
// don't overlap. If "left" is large enough, the destination
// range will start right after the last byte of the source
// range. This way we don't need to advance "back" which
// allows the next iteration of this loop to copy (up to)
// twice the number of bytes.
int copySize = Math.min(left, pos - back);

// With tiny copy sizes arraycopy is slower than a byte-by-byte
// loop. With typical files the difference is tiny but with
// unusual files this can matter more.
if (copySize < 4) {
int i = 0;
do {
buf[pos + i] = buf[back + i];
} while (++i < copySize);
} else {
System.arraycopy(buf, back, buf, pos, copySize);
}

pos += copySize;
left -= copySize;
} while (left > 0);

    if (full < pos)
full = pos;
}

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Re: [xz-devel] jdk9+ CRC64

2021-02-11 Thread Lasse Collin

On 2021-02-06 Brett Okken wrote:
> Since it is quite easy to read an int from a byte[] in jdk 9, the
> CRC64 implementation can be optimized to operate on an int rather than
> byte by byte as part of a multi-release jar. This shows to be 5-7%
> faster in a microbenchmark of just the crc64 calculation. In jdk 11 it
> speeds up the decompression of the repeating single byte by ~1%.

To avoid byte swapping in the main loop on big endian systems, the
lookup table would need to be big endian and operations need to be
bitwise-mirrored too just like in liblzma. I'm not convinced yet that
it's worth the extra effort and complexity for such a small speed gain.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Re: [xz-devel] java LZMA2OutputStream changes

2021-02-05 Thread Lasse Collin

On 2021-02-05 Brett Okken wrote:
> > > Now that there is a 6 byte chunkHeader, could the 1 byte tempBuf
> > > be removed?  
> >
> > It's better to keep it. It would be confusing to use the same
> > buffer in write(int) and writeChunk(). At glance it would look like
> > that writeChunk() could be overwriting the input.  
> 
> I assumed that lz.fillWindow(buf, off, len); would always process the
> 1 byte.

Yes, but it's not immediately obvious to a new reader. Also, many other
classes have tempBuf for identical use so it's good to keep that pattern
consistent.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Re: [xz-devel] java crc64 implementation

2021-02-05 Thread Lasse Collin

On 2021-02-05 Brett Okken wrote:
> On Fri, Feb 5, 2021 at 11:07 AM Lasse Collin
>  wrote:
> > Also, does it really help to unroll the loop? With 8191-byte
> > buffers I see no significant difference (in a quick
> > not-very-accurate test) if the switch-statement is replaced with a
> > while-loop.  
> 
> The differences are pretty minimal. My observation was switch a bit
> faster than for loop, which was a bit faster than a while loop. But
> the differences in averages were less than the confidence interval for
> the given tests.

OK, smaller code wins then.

> > With these two changes the code becomes functionally identical to
> > the version I posted with the name "Modified slicing-by-4". Is that
> > an OK version to commit?  
> 
> Yes.

OK.

> > Is the following fine to you as the file header? Your email address
> > can be omitted if you prefer that. I will mention in the commit
> > message that you adapted the code from XZ Utils and benchmarked it.
> >
> > /*
> >  * CRC64
> >  *
> >  * Authors: Brett Okken 
> >  *  Lasse Collin 
> >  *
> >  * This file has been put into the public domain.
> >  * You can do whatever you want with this file.
> >  */  
> 
> That is fine. You can include my e-mail.

OK. :-) I have committed it. Thank you!

The LZDecoder changes I may still look at before the next release. Then
I will go back to the XZ Utils code.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Re: [xz-devel] java LZMA2OutputStream changes

2021-02-05 Thread Lasse Collin

On 2021-02-05 Brett Okken wrote:
> After recent changes, the LZMA2OutputStream class no longer uses
> DataOutputStream, but the import statement is still present.

Fixed. Thanks!

> Now that there is a 6 byte chunkHeader, could the 1 byte tempBuf be
> removed?

It's better to keep it. It would be confusing to use the same buffer in
write(int) and writeChunk(). At glance it would look like that
writeChunk() could be overwriting the input.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Re: [xz-devel] java crc64 implementation

2021-02-05 Thread Lasse Collin

On 2021-02-02 Brett Okken wrote:
> Thus far I have only tested on jdk 11 64bit windows, but the fairly
> clear winner is:
> 
> public void update(byte[] buf, int off, int len) {
> final int end = off + len;
> int i=off;
> if (len > 3) {
> switch (i & 3) {
> case 3:
> crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^
>   (crc >>> 8);
> case 2:
> crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^
>   (crc >>> 8);
> case 1:
> crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^
>   (crc >>> 8);
> }

To ensure (i & 3) == 0 when entering the main loop, the case-labels
should be 1-2-3, not 3-2-1. This may have messed up your tests. :-(

With a very quick test I didn't see much difference if I changed the
case-label order.

On 2021-02-02 Brett Okken wrote:
> I tested jdk 15 64bit and jdk 11 32bit, client and server and the
> above implementation is consistently quite good.
> The alternate in running does not do the leading alignment. This
> version is really close in 64 bit testing and slightly faster for 32
> bit. The differences are pretty small, and both are noticeably better
> than my original proposal (and all 3 are significantly faster than
> current). I think I would lead towards the simplicity of not doing the
> leading alignment, but I do not have a strong opinion.

Let's go with the simpler option.

> switch (len & 3) {
> case 3:
> crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^
>   (crc >>> 8);

I suppose this should use the same (faster) array indexing style as the
main loop:

crc = TABLE[0][(buf[off++] & 0xFF) ^ ((int)crc & 0xFF)]
  ^ (crc >>> 8);

Also, does it really help to unroll the loop? With 8191-byte buffers I
see no significant difference (in a quick not-very-accurate test) if
the switch-statement is replaced with a while-loop.

With these two changes the code becomes functionally identical to the
version I posted with the name "Modified slicing-by-4". Is that an OK
version to commit?

Is the following fine to you as the file header? Your email address can
be omitted if you prefer that. I will mention in the commit message
that you adapted the code from XZ Utils and benchmarked it.

/*
 * CRC64
 *
 * Authors: Brett Okken 
 *  Lasse Collin 
 *
 * This file has been put into the public domain.
 * You can do whatever you want with this file.
 */

Thanks!

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Re: [xz-devel] xz-java minor read improvements

2021-02-05 Thread Lasse Collin

On 2021-02-03 Brett Okken wrote:
> I have not done any testing of xz specifically, but was motivated by
> https://github.com/openjdk/jdk/pull/542, which showed pretty
> noticeable slowdown when biased locking is removed. The specific
> example there was writing 1 byte at a time being transitioned to
> writing the 2-8 bytes to a byte[] first, then writing that buffer to
> the OutputStream. I suspect that reading would have similar impact.

I don't doubt that. However, in XZ the uses of ByteArrayInputStream and
ByteArrayOutputStream are in places where the performance could be
absolutely horrible and it would still make little difference in
overall speed. The amounts of data being read or written are so small.

LZMAInputStream reads the whole file one byte at a time (via
RangeDecoderFromStream.normalize()) and performance suffers compared to
XZInputStream even if one uses BufferedInputStream. BufferedInputStream
has synchronized read(). I don't know how much locking matters in this
case. I'm not curious enough to try with a non-synchronized buffered
input stream now.

There are related comments in the "java buffer writes" thread.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Re: [xz-devel] java buffer writes

2021-02-05 Thread Lasse Collin

On 2021-01-29 Brett Okken wrote:
> There are several places where single byte writes are being done
> during compression. Often this is going to an OutputStream with
> synchronized write methods. Historically that has not mattered much
> because of biased locking. However, biased locking is being
> removed[1]. These changes will batch those writes up to a small
> buffer.

LZMA2OutputStream: I have committed a functionally similar patch.
Thanks!

BlockOutputStream: The ByteBuffer code replacing ByteArrayOutputStream
is more complex than the original code. For example, manually resizing
a buffer may be useful when performance is important but in this class
performance doesn't matter.

IndexEncoder: If there were a huge number of Blocks and thus Records,
it would allocate memory to hold them all. It could be nicer to use
something similar to BufferedOutputStream which would always use the
same small amount of memory. java.io.BufferedOutputStream cannot be
used because its close() and flush() methods call flush() on the
underlying output stream and here it's counter-productive.

The reading side in IndexDecoder and IndexHash could be similarly
optimized to use a buffered input class that takes an argument to limit
how many bytes it may read from the underlying InputStream.

If the Index* classes are optimized, then the CRC32 writing in
XZOutputStream, IndexEncoder, and BlockOutputStream may be worth
optimizing too.

It's important to keep in mind that these make no real difference if the
application buffers the input or output with BufferedInputStream or
BufferedOutputStream. In some use cases it may be impractical though,
and then the small reads and writes may hurt if each read/write results
in a syscall or even sending packets over network; such overheads can
be much larger than locking.

I put these optimizations in the "nice to have" category. Something
could be done to make the code better but it's not urgent and so these
won't be in the next release.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Re: [xz-devel] Re: java LZDecoder small improvement

2021-02-05 Thread Lasse Collin

On 2021-02-03 Brett Okken wrote:
> On Wed, Feb 3, 2021 at 2:56 PM Lasse Collin
>  wrote:
> > It seems to regress horribly if dist is zero. A file with a very
> > long sequence of the same byte is good for testing.
> 
> Would this be a valid test of what you are describing?
[...]
> The source is effectively 160MB of the same byte value.

Yes, it's fine.

> I found a strange bit of behavior with this case in the compression.
> In LZMAEncoderNormal.calcLongRepPrices, I am seeing a case where
> 
> int len2Limit = Math.min(niceLen, avail - len - 1);
> 
> results in -1, (avail and len are both 8). This results in calling
> LZEncoder.getMatchLen with a lenLimit of -1. Is that expected?

I didn't check in detail now, but I think it's expected. There are two
such places. A speed optimization was forgotten in liblzma from these
two places because of this detail. I finally remembered to add the
optimization in 5.2.5.

On 2021-02-03 Brett Okken wrote:
> I still need to do more testing across jdk 8 and 15, but initial
> returns on this are pretty positive. The repeating byte file is
> meaningfully faster than baseline. One of my test files (image1.dcm)
> does not improve much from baseline, but the other 2 files do.

The repeating byte is indeed much faster than the baseline. With normal
files the speed seems to be about the same as the version I posted, so
a minor improvement over the baseline.

With a file with two-byte repeat ("ababababababab"...) it's 50 % slower
than the baseline. Calling arraycopy in a loop, copying two bytes at a
time, is not efficient. I didn't try look how big the copy needs to be
to make the overhead of arraycopy smaller than the benefit but clearly
it needs to be bigger than two bytes.

The use of Arrays.fill to optimize the case of one repeating byte looks
useful especially if it won't hurt performance in other situations.
Still, I'm not sure yet if the LZDecoder optimizations should go in 1.9.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Re: [xz-devel] Re: java LZDecoder small improvement

2021-02-03 Thread Lasse Collin

On 2021-02-01 Brett Okken wrote:
> I have played with this quite a bit and have come up with a slightly
> modified change which does not regress for the smallest of the sample
> objects and shows a nice improvement for the 2 larger files.

It seems to regress horribly if dist is zero. A file with a very long
sequence of the same byte is good for testing.

The problem is that tmp is almost always 1 and then each arraycopy call
will copy exactly one byte. The overhead is very high compared to doing
the copying in a loop like in the original code.

Below is a different version which is a little faster with Java 15 but
worse than the current simple code on Java 8 (tested on the same
computer and OS). The improvement over the current code is like 3-5 %
with Java 15, so not a lot but not insignificant either (such
optimizations add up). However, if the change is neutral or clearly
negative on Java 8, maybe this patch isn't worth the complexity yet.
Java 8 is still supported by its upstream.

Maybe you get different results. Make sure the uncompressed size of the
test files is several times larger than the dictionary size.

With the current knowledge I think this patch will need to wait past XZ
for Java 1.9.

diff --git a/src/org/tukaani/xz/lz/LZDecoder.java 
b/src/org/tukaani/xz/lz/LZDecoder.java
index 85b2ca1..8b3564c 100644
--- a/src/org/tukaani/xz/lz/LZDecoder.java
+++ b/src/org/tukaani/xz/lz/LZDecoder.java
@@ -92,14 +92,39 @@ public final class LZDecoder {
 pendingDist = dist;

 int back = pos - dist - 1;
-if (dist >= pos)
+if (dist >= pos) {
+// We won't get here if the dictionary isn't full.
+assert full == bufSize;
+
+// The distance wraps around to the end of the cyclic dictionary
+// buffer. Here we will never copy more than dist + 1 bytes
+// and so the copying won't repeat from its own output. Thus,
+// we can always use arraycopy safely.
 back += bufSize;
+int copySize = Math.min(bufSize - back, left);
+assert copySize <= dist + 1;
+
+System.arraycopy(buf, back, buf, pos, copySize);
+pos += copySize;
+back = 0;
+left -= copySize;

-do {
-buf[pos++] = buf[back++];
-if (back == bufSize)
-back = 0;
-} while (--left > 0);
+if (left == 0)
+return;
+}
+
+assert left > 0;
+
+if (left > dist + 1) {
+// We are copying more than dist + 1 bytes and thus will partly
+// copy from our own output.
+do {
+buf[pos++] = buf[back++];
+} while (--left > 0);
+} else {
+System.arraycopy(buf, back, buf, pos, left);
+pos += left;
+}

 if (full < pos)
     full = pos;

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Re: [xz-devel] java crc64 implementation

2021-02-02 Thread Lasse Collin

I assume you accidentally didn't post to the list so I'm quoting your
email in full.

On 2021-02-02 Brett Okken wrote:
> > while ((i & 3) != 1 && i < end)  
> 
> Shouldn't that be (i & 3) != 0?
> An offset of 0 should not enter this loop, but 0 & 3 does not equal 1.

The idea really is that offset of 1 doesn't enter the loop, thus the
main slicing-by-4 loop is misaligned. I don't know why it makes a
difference and I'm no longer even sure why I decided to try it. You can
try different (i & 3) != { 0, 1, 2, 3 } combinations.

> > If I change the buffer size from 8192 to 8191 in XZDecDemo.java,
> > then "Modified slicing-by-4" somehow becomes as fast as the
> > "Misaligned slicing-by-4". On the surface it sounds weird because
> > the buffer still has the same alignment, it's just one byte smaller
> > at the end.  
> 
> My guess is that this has to do with how many while loops need to be
> executed/optimized.
> Making it one byte smaller guarantees one of the additional while
> loops actually has to execute. Depending on the initial offset,
> potentially both need to execute.

Maybe you are right, but the confusing thing is that those while-loops
are supposedly slower than the for-loop. :-)

> > It would be nice if you could compare these too and suggest what
> > should be committed. Maybe you can figure out an even better
> > version. Different CPU or 32-bit Java or other things may give
> > quite different results.  
> 
> Truncating the crc to an int 1 time in the loop seems like a clear
> winner. I will play with this in my benchmark.
> My benchmark is calculating the crc64 of 8k of random bytes. I will
> change it to include misaligned read as well.

Thanks.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Re: [xz-devel] xz-java minor read improvements

2021-02-02 Thread Lasse Collin

On 2021-01-29 Brett Okken wrote:
> Here are some small improvements when creating new BlockInputStream
> instances. This reduces the size of the byte[] for the block header to
> the actual size

I committed this part. Thanks!

> and replaces use of ByteArrayInputStream, which has synchronized
> methods, with a ByteBuffer, which provides the same functionality
> without synchronization.

Hmm, it sounds good but I don't like that decodeVLI needs to be
duplicated. The performance of header decoding in BlockInputStream is
fairly unimportant; the performance bottle necks are elsewhere. Keeping
the code tidy matters more.

Obviously one could wrap ByteBuffer into an InputStream or one could
change IndexHash.java and IndexDecoder.java to work with something
else. Those Index* classes might be reading from an InputStream that has
high read()-call overhead for reasons other than locking (although in
such cases the application could then be using BufferedInputStream).

Unless you have a practical situation in mind where these optimizations
make a measurable difference, it's best to not make them more complex
than they are.

By the way, I committed module-info.java support as multi-release JAR,
so multi-release can be used for other things too.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Re: [xz-devel] java crc64 implementation

2021-02-02 Thread Lasse Collin

Hello!

I need to make a new release in the near future so that a minor problem
can be fixed in .7z support in Apache Commons Compress. I thought I
could include simpler and safer changes from your long list of patches
and the CRC64 improvement might be such.

On 2021-01-21 Brett Okken wrote:
> Here is a slice by 4 implementation. It goes byte by byte to easily be
> compatible with older jdks. Performance wise, it is pretty comparable
> to the java port of Adler's stackoverflow implementation:
> 
> Benchmark Mode  Cnt  Score Error  Units
> Hash64Benchmark.adler avgt5   6850.172 ± 251.528  ns/op
> Hash64Benchmark.crc64 avgt5  16347.986 ±  53.702  ns/op
> Hash64Benchmark.slice4avgt5   6842.010 ± 393.149  ns/op

Thank you!

I played around a bit. Seems that the code is *really* sensitive to tiny
changes. It's possible that it depends on the computer and such things
too; I only tried on one machine.

I timed decompression of gigabyte of null bytes using XZDecDemo and
OpenJDK 15 on x86-64. This isn't very accurate but it's enough to sort
them:

Original6.8 s
Modified original   6.2 s
Your slicing-by-4   5.8 s
Modified slicing-by-4   5.6 s
Misaligned slicing-by-4 5.2 s
xz -t   3.6 s

Modified original:

--- a/src/org/tukaani/xz/check/CRC64.java
+++ b/src/org/tukaani/xz/check/CRC64.java
@@ -38,7 +38,8 @@ public class CRC64 extends Check {
 int end = off + len;

 while (off < end)
-crc = crcTable[(buf[off++] ^ (int)crc) & 0xFF] ^ (crc >>> 8);
+crc = crcTable[(buf[off++] & 0xFF) ^ ((int)crc & 0xFF)]
+  ^ (crc >>> 8);
 }

 public byte[] finish() {

Modified slicing-by-4:

public void update(byte[] buf, int off, int len) {
final int end = off + len;
int i = off;

for (int end4 = end - 3; i < end4; i += 4) {
final int tmp = (int)crc;
crc = TABLE[3][(tmp & 0xFF) ^ (buf[i] & 0xFF)] ^
  TABLE[2][((tmp >>> 8) & 0xFF) ^ (buf[i + 1] & 0XFF)] ^
  (crc >>> 32) ^
  TABLE[1][((tmp >>> 16) & 0xFF) ^ (buf[i + 2] & 0XFF)] ^
  TABLE[0][((tmp >>> 24) & 0xFF) ^ (buf[i + 3] & 0XFF)];
}

while (i < end)
crc = TABLE[0][(buf[i++] & 0xFF) ^ ((int)crc & 0xFF)] ^
  (crc >>> 8);
}

Misaligned slicing-by-4 adds an extra while-loop to the beginning:

public void update(byte[] buf, int off, int len) {
final int end = off + len;
int i = off;

while ((i & 3) != 1 && i < end)
crc = TABLE[0][(buf[i++] & 0xFF) ^ ((int)crc & 0xFF)] ^
  (crc >>> 8);

for (int end4 = end - 3; i < end4; i += 4) {
final int tmp = (int)crc;
crc = TABLE[3][(tmp & 0xFF) ^ (buf[i] & 0xFF)] ^
  TABLE[2][((tmp >>> 8) & 0xFF) ^ (buf[i + 1] & 0XFF)] ^
  (crc >>> 32) ^
  TABLE[1][((tmp >>> 16) & 0xFF) ^ (buf[i + 2] & 0XFF)] ^
  TABLE[0][((tmp >>> 24) & 0xFF) ^ (buf[i + 3] & 0XFF)];
}

while (i < end)
crc = TABLE[0][(buf[i++] & 0xFF) ^ ((int)crc & 0xFF)] ^
  (crc >>> 8);
}

If I change the buffer size from 8192 to 8191 in XZDecDemo.java, then
"Modified slicing-by-4" somehow becomes as fast as the "Misaligned
slicing-by-4". On the surface it sounds weird because the buffer still
has the same alignment, it's just one byte smaller at the end.

The same thing happens too if the buffer size is kept at 8192 but first
byte isn't used (making the beginning of the buffer misaligned).

Moving the "(crc32 >> 32)" to a different position in the xor sequence
can affect things too... it's almost spooky. ;-)

It would be nice if you could compare these too and suggest what should
be committed. Maybe you can figure out an even better version.
Different CPU or 32-bit Java or other things may give quite different
results.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Re: [xz-devel] Compatibility between CMake config file and FindLibLZMA.cmake

2021-01-30 Thread Lasse Collin

On 2021-01-23 Markus Rickert wrote:
> This could be solved by adding an alias to the config file:
> add_library(LibLZMA::LibLZMA ALIAS liblzma::liblzma)
> 
> An additional improvement would be to enable this on case-sensitive
> file systems as well. For this, the config file would need to be
> renamed from liblzmaConfig.cmake to liblzma-config.cmake (and the
> version file to liblzma-config-version.cmake), see [2].

I have committed both of your suggestions (hopefully correctly). Thanks!

Some extra thoughts: There are some differences between FindLibLZMA and
the config file:

  - FindLibLZMA doesn't #define LZMA_API_STATIC when building against
static liblzma. LZMA_API_STATIC omits __declspec(dllimport) from
liblzma function declarations on Windows.

  - FindLibLZMA sets a few CMake cache variables that the config file
doesn't, for example, LIBLZMA_HAS_EASY_ENCODER. I have no idea if
there are packages that care about this.

  - The config file has find_dependency(Threads) while FindLibLZMA
doesn't. This can affect the linker flags.

Perhaps there are other details affecting compatiblity. I just wonder
how big mistake it was to use liblzma::liblzma in the config file. I
guess it's too late to change it now.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Re: [xz-devel] [RFC 2/2] Add xxHash, XX3 (128bit) for hashing.

2021-01-29 Thread Lasse Collin

On 2021-01-20 Sebastian Andrzej Siewior wrote:
> On 2021-01-20 00:37:06 [+0100], Sebastian Andrzej Siewior wrote:
> > So this is better than crc64 and close to none while doing
> > something ;)  
> 
> xz -tv -T0 with crc64 reports:
> 
>   100 % 10,2 GiB / 40,0 GiB = 0,255   1,1 GiB/s   0:35
> 
> and the same archive with xxh3:
> 
>   100 % 10,2 GiB / 40,0 GiB = 0,255   1,1 GiB/s   0:34
> 
> which looks like it is not worth the trouble.

If there were a fast algorithm in .xz, then it would be worth the
trouble. Having such an algorithm was in the early plans but so were a
few other nice things but many never materialized.

I will look at the SHA-256 patch later. There are unusually many things
in the queue of XZ-related things.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Re: [xz-devel] [PATCH v2] liblzma: Add multi-threaded decoder

2021-01-24 Thread Lasse Collin

Hello!

I haven't made much progress with this still, I'm sorry. :-( Below are
comments about a few small details. It's not much but I will (slowly)
keep reading and testing.

I applied the outq patch too. The performance numbers you posted looked
promising.


(1) Segfault due to thr->outbuf == NULL

I changed CHUNK_SIZE to 1 to test corner cases. I used
good-1-block_header-1.xz as the test file. It can segfault in
worker_decoder() on the line calling thr->block_decoder.code(...)
because thr->outbuf is NULL (so the problem was introduced in the outq
patch). This happens because of "thr->outbuf = NULL;" later in the
function.

It looks like that it marks the outbuf finished and returns the thread
to the pool too early or forgets to set thr->state = THR_IDLE. As a
temporary workaround, I added "thr->state = THR_IDLE;" after
"thr->outbuf = NULL;".


(2) Block decoder must return LZMA_STREAM_END on success

Because of end marker and integrity check, the output buffer will be
full before the last bytes of input have been processed by the Block
decoder. Thus it is not enough to look at the input and output
positions to determine when decoding has been finished; only
LZMA_STREAM_END should be used to determine that decoding was
successful.

In theory it is OK to mark the outbuf as finished once the output is
full but for simplicity I suggest doing so (and returning the thread to
the pool) only after LZMA_STREAM_END.

I committed a new test file bad-1-check-crc32-2.xz. The last byte in
the Block (last byte of Check) is wrong. Change CHUNK_SIZE to 1 and try
"xz -t -T2 file bad-1-check-crc32-2.xz". The file must be detected to
be corrupt (LZMA_DATA_ERROR).


(3) Bad input where the whole input or output buffer cannot be used

In the old single-threaded decoding, lzma_code() will eventually return
LZMA_BUF_ERROR if the calls to lzma_code() cannot make any progress,
that is, no more input is consumed and no more output is produced. This
condition can happen with correct code if the input file is corrupt in
a certain way, for example, a truncated .xz file.

Since the no-progress detection is centralized in lzma_code(), the
internal decoders including Block decoder don't try to detect this
situation. Currently this means that worker_decoder() should detect it
to catch bad input and prevent hanging on certain malformed Blocks.
However, since the Block decoder knows both Compressed Size and
Uncompressed Size, I think I will improve Block decoder instead so
don't do anything about this for now.

I committed two test files, bad-1-lzma2-9.xz and bad-1-lzma2-10.xz. The
-9 may make worker_decoder() not notice that the Block is invalid. The
-10 makes the decoder hang. Like I said, I might fix these by changing
the Block decoder.


(4) Usage of partial_update in worker_decoder()

Terminology: main mutex means coder->mutex alias thr->coder->mutex.

In worker_decoder(), the main mutex is locked every time there is new
output available in the worker thread. partial_update is only used to
determine when to signal thr->coder->cond.

To reduce contention on the main mutex, worker_decoder() could lock it
only when
  - decoding of the Block has been finished (successfully or
unsuccessfully, that is, ret != LZMA_OK), or
  - there is new output available and partial_update is true; if
partial_update is false, thr->outbuf->pos is not touched.

This way only one worker will be frequently locking the main mutex.
However, I haven't tried it and thus don't know how much this affects
performance in practice. One possible problem might be that it may
introduce a small delay in output availability when the main thread
switches reading from the next outbuf in the list.


(5) Use of mythread_condtime_set()

In the encoder the absolute time is calculated once per lzma_code()
call. The comment in wait_for_work() in in stream_encoder_mt.c was
wrong. The reason the absolute time is calculated once per lzma_code()
call is to ensure that blocking multiple times won't make the timeout
ineffective if each blocking takes less than timeout milliseconds. So
it should be done similarly in the decoder.


(6) Use of lzma_outq_enable_partial_output()

It should be safe to call it unconditionally:

if (thr->outbuf == coder->outq.head)
lzma_outq_enable_partial_output(>outq,
thr_do_partial_update);

If outq.head is something else, it is either already finished or
partial output has already been enabled. In both cases
lzma_outq_enable_partial_output() will do nothing.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Re: [xz-devel] java crc64 implementation

2021-01-19 Thread Lasse Collin

On 2021-01-13 Brett Okken wrote:
> Mark Adler has posted an optimized crc64 implementation on
> stackoverflow[1]. This can be reasonably easily ported to java (that
> post has a link to java impl on github[2] which warrants a little
> clean up, but gives a decent idea).
> 
> I did a quick benchmark calculating the crc64 over 8KB and the results
> were impressive:
> 
> Benchmark  Mode  Cnt  ScoreError  Units
> Hash64Benchmark.adler  avgt5   6908.677 ± 47.790  ns/op
> Hash64Benchmark.crc64  avgt5  16343.091 ± 64.089  ns/op

The CRC64 implementation in XZ for Java is indeed a basic version. I
wanted to keep things simple in the beginning and didn't think about it
much later since the Java version of XZ is slower than C version for
other reasons anyway.

In XZ Utils, slicing-by-4 method is used for CRC64 and slicing-by-8
for CRC32. A reason for not using by-8 for CRC64 is to reduce CPU L1
cache usage: by-4 with CRC64 needs 8 KiB lookup table, by-8 needs 16
KiB. Micro-benchmarking with big table can look good but when the CRC
is just a small part of the application the results are more
complicated (more cache misses to load the bigger table, more other data
pushed out of cache). It is essential to note that the decisions about
table sizes were made over a decade ago with 32-bit CPUs and it's very
much possible that different decisions would be better nowadays.

The version by Mark Adler [1] uses slicing-by-8 with CRC64. It also
includes a method to combine the CRC values of two blocks which is
great if one uses threads to compute a CRC. Threaded CRC doesn't sound
useful with XZ since LZMA isn't that fast anyway.

A side note: GNU gzip uses the basic method for CRC32 [3] while zlib
uses slicing-by-8. Since Deflate is fast to decode, replacing the CRC32
in GNU gzip would make a clear difference in decompression speed.

[3] http://git.savannah.gnu.org/cgit/gzip.git/tree/util.c#n126

> [1] -
> https://stackoverflow.com/questions/20562546/how-to-get-crc64-distributed-calculation-use-its-linearity-property/20579405#20579405
> 
> [2] -
> https://github.com/MrBuddyCasino/crc-64/blob/master/crc-64/src/main/java/net/boeckling/crc/CRC64.java

I didn't find license information from the [2] repository. XZ for Java
is public domain so the license likely wouldn't match anyway.

Porting from XZ Utils shouldn't be too hard, depending on how much one
wishes to optimize it.
  - src/liblzma/check/crc64_fast.c
  - src/liblzma/check/crc_macros.h
  - src/liblzma/check/crc64_tablegen.c (or should it just include
pre-computed tables like liblzma and zlib do?)

Unlike the C version in [1], the Java version in [2] reads the input
byte[] array byte-by-byte. Using a fast method to read 8 *aligned*
bytes at a time in native byte order should give more speed; after all,
it's one of the benefits of this method that one can read multiple
input bytes at a time.

A public domain patch for a faster CRC64 to XZ for Java is welcome.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Re: [xz-devel] xz-java and newer java

2021-01-18 Thread Lasse Collin

On 2021-01-11 Brett Okken wrote:
> I threw together a quick jmh test, and there is no value in the
> changes to Hash234.

OK, let's forget that then.

On 2021-01-16 Brett Okken wrote:
> I have found a way to use VarHandle byte array access at runtime in
> code which is compile time compatible with jdk 7. So here is an
> updated ArrayUtil class which will use a VarHandle to read long values
> in jdk 9+. If that is not available, it will attempt to use
> sun.misc.Unsafe. If that cannot be found, it falls back to standard
> byte by byte comparison.

Sounds promising. :-) You have already done quite a bit of work in both
writing code and benchmarking. Thank you!

The method you ended up is similar to src/liblzma/common/memcmplen.h
in XZ Utils. There 8-byte version is used on 64-bit systems and 4-byte
version on 32-bit systems. In XZ Utils, SSE2 version (16-byte
comparison) is faster than 4-byte compare on 32-bit x86, but on x86-64
the 8-byte version has similar speed or is faster than the SSE2 version
(it depends on the CPU).

Have you tested with 32-bit Java too? It's quite possible that it's
better to use ints than longs on 32-bit system. If so, that should be
detected at runtime too, I guess.

In XZ Utils the arrays have extra room at the end so that memcmplen.h
can always read 4/8/16 bytes at a time. Since this is easy to do, I
think it should be done in XZ for Java too to avoid special handling of
the last bytes.

> I did add an index bounds check for the unsafe implementation and
> found it had minimal impact on over all performance.

Since Java in general is memory safe, having bound checks with Unsafe is
nice as long as it doesn't hurt performance too much. This

if (aFromIndex < 0 || aFromIndex + length > a.length ||
bFromIndex < 0 || bFromIndex + length > b.length) {

is a bit relaxed though since it doesn't catch integer overflows.
Something like this would be more strict:

if (length < 0 ||
aFromIndex < 0 || aFromIndex > a.length - length ||
bFromIndex < 0 || bFromIndex > b.length - length) {

> Using VarHandle (at least on jdk 11) offers very similar performance
> to Unsafe across all 3 files I used for benchmarking.

OK. I cannot comment the details much because I'm not familiar with
either API for now.

Comparing byte arrays as ints or longs results in unaligned/misaligned
memory access. MethodHandles.byteArrayViewVarHandle docs say that this
is OK. A quick web search gave me an impression that it might not be
safe with Unsafe though. Can you verify how it is with Unsafe? If it
isn't allowed, dropping support for Unsafe may be fine. It's just the
older Java versions that would use it anyway.

It is *essential* that the code works well also on archs that don't
have fast unaligned access. Even if the VarHandle method is safe, it's
not clear how the performance is on archs that don't support fast
unaligned access. It would be bad to add an optimization that is good
on x86-64 but counter-productive on some other archs. One may need
arch-specific code just like there is in XZ Utils, although on the
other hand it would be nice to keep the Java code less complicated.

Do you have a way to check how these methods behave on Android and ARM?
(I understand that this might be too much work to check. This may be
skipped.)

I wish to add module-info.java in the next release. Do these new
methods affect what should be in module-info.java? With the current
code this seems to be enough:

module org.tukaani.xz {
exports org.tukaani.xz;
}

> final int leadingZeros = (int)LEADING_ZEROS.invokeExact(diff);
> return i + (leadingZeros / Byte.SIZE);

Seems that Java might not optimize that division to a right shift. It
could be better to use "leadingZeros >>> 3".

> I know you said you were not going to be able to work on xz-java for
> awhile, but given these benchmark results, which really exceeded my
> expectations, could this get some priority to release?

I understood that it's 9-18 % faster. That is significant but it's
still a performance optimization only, not an important bug fix, and to
me the code doesn't feel completely ready yet (for example, the
unaligned access is important to get right).

(Compare to the threaded decompression support that is coming to XZ
Utils. It will speed things up a few hundred percent.)

Can you provide a complete patch to make testing easier (or if not
possible, complete copies of modified files)? Also, please try to wrap
the lines so that they stay within 80 columns (with some long
unbreakable strings this may not be possible, then those lines can be
overlong instead of messing up the indentation).

I think your patch will find its way into XZ for Java in some form
but once again I repeat that it will take some time. These XZ projects
are only a hobby for me and currently I don't even turn on my computer
every day.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

Re: [xz-devel] [PATCH] xz: Fix setting memory limit on 32-bit systems

2021-01-18 Thread Lasse Collin

On 2021-01-10 Sebastian Andrzej Siewior wrote:
> I hope for sane defaults :)

I hope so too. So far I have felt that the suggested solutions have
significant flaws or downsides, and I'm not able to see what is a good
enough compromise. As a result the discussion hasn't progressed much
and I feel it's partly my fault, sorry. I will try again:

I have understood that *in practice* the problem with the xz command
line tool is limited to "xz -T0" usage so fixing this use case is
enough for most people. Please correct me if I missed something.

The change in XZ Utils 5.2.5 helps a little with 32-bit xz running
under 64-bit kernel but only if one specifies a memory usage limit like
-M90% together with -T0. To make plain -T0 work too, in an earlier
email I suggested that -T0 could also imply a memory usage limit if no
limit was otherwise specified (a preliminary patch was included too). I
have been hesitant to make changes to the defaults of the memory usage
limiter but this solution would only affect a very specific situation
and thus I feel it would be fine. Comments would be appreciated.

The problem with applications using liblzma and running out of address
space sounds harder to fix. As I explained in another email, making
liblzma more robust with memory allocation failures is not a perfect
fix and can still result in severe problems depending on how the
application as a whole works (with some apps it could be enough).

An alternative "fix" for the liblzma case could be adding a simple API
function that would scale down the number of threads in a lzma_mt
structure based on a memory usage limit and if the application is 32
bits. Currently the thread count and LZMA2 settings adjusting code is
in xz, not in liblzma.

> Anyway. Not to overcompilcate things: On Linux you can obtain the
> available system memory which I would cap to 2 or 2.5 GiB by default.
> Nobody should be hurt by that.

If full 4 GiB of address space is available, capping to 2 GiB to 2.5 GiB
when the available memory isn't known would mean fewer threads than
with the 4020 MiB limit. Obviously this is less bad than failing due to
running out of address space but it still makes me feel that if
available memory is used on Linux, it should be ported to other OSes
too.

The idea for the current 4020 MiB special limit is based on a patch
that was in use in FreeBSD to solve the problem of 32-bit xz on 64-bit
kernel. So at least FreeBSD should be supported to not make 32-bit xz
worse under 64-bit FreeBSD kernel.

In liblzma, if a new function is added to reduce the thread count based
on a memory usage limit, a capping the limit to 2 to 3 GiB on 32-bit
applications could be fine even if there is more available memory. Being
conservative means fewer threads but it would make it more likely that
things keep working if the application allocates memory after liblzma
has already done so.

Oh well. :-( I think I still made this sound like a mess. In any case,
let's at least try to find some solution to the "xz -T0" case. It would
be nice to hear if my suggestion makes any sense. Thanks.

-- 
Lasse Collin  |  IRC: Larhzu @ IRCnet & Freenode

1 2 >

1 - 100 of 173 matches

Mail list logo