[xz-devel] XZ Utils 5.2.13, 5.4.7, and 5.6.2
--enable-small. (CMake build doesn't support ENABLE_SMALL in XZ Utils 5.2.x.) * xz: - Fix a C standard conformance issue in --block-list parsing (arithmetic on a null pointer). - Fix a warning from GNU groff when processing the man page: "warning: cannot select font 'CW'" - Windows: Handle special files such as "con" or "nul". Earlier the following wrote "foo" to the console and deleted the input file "con_xz": echo foo | xz > con_xz xz --suffix=_xz --decompress con_xz - Windows: Fix an issue that prevented reading from or writing to non-terminal character devices like NUL. * xzless: - With "less" version 451 and later, use "||-" instead of "|-" in the environment variable LESSOPEN. This way compressed files that contain no uncompressed data are shown correctly as empty. - With "less" version 632 and later, use --show-preproc-errors to make "less" show a warning on decompression errors. * Build systems: - Add a new line to liblzma.pc for MSYS2 (Windows): Cflags.private: -DLZMA_API_STATIC When compiling code that will link against static liblzma, the LZMA_API_STATIC macro needs to be defined on Windows. - Autotools (configure): * Symbol versioning variant can now be overridden with --enable-symbol-versions. Documentation in INSTALL was updated to match. - CMake: * Fix a bug that prevented other projects from including liblzma multiple times using find_package(). * Fix a bug where configuring CMake multiple times resulted in HAVE_CLOCK_GETTIME and HAVE_CLOCK_MONOTONIC not being defined. * Fix the build with MinGW-w64-based Clang/LLVM 17. llvm-windres now has more accurate GNU windres emulation so the GNU windres workaround from 5.4.1 is needed with llvm-windres version 17 too. * The import library on Windows is now properly named "liblzma.dll.a" instead of "libliblzma.dll.a" * Add large file support by default for platforms that need it to handle files larger than 2 GiB. This includes MinGW-w64, even 64-bit builds. * Linux on MicroBlaze is handled specially now. This matches the changes made to the Autotools-based build in XZ Utils 5.4.2 and 5.2.11. * Disable symbol versioning on non-glibc Linux to match what the Autotools build does. For example, symbol versioning isn't enabled with musl. * Symbol versioning variant can now be overridden by setting SYMBOL_VERSIONING to "OFF", "generic", or "linux". * Documentation: - Clarify the description of --disable-assembler in INSTALL. The option only affects 32-bit x86 assembly usage. - Don't install the TODO file as part of the documentation. The file is out of date. - Update home page URLs back to their old locations on tukaani.org. - Update maintainer info. -- Lasse Collin
Re: [xz-devel] xz-java and newer java
On 2024-03-20 Brett Okken wrote: > The jdk8 changes show nice improvements over head. My assumption is > that with less math going on in the offsets of the while loop allowed > the jvm to better optimize. Sounds good, thanks! :-) > I am surprised with the binary math behind your handling of long > comparisons here: I had to refresh my memory as I hadn't commented it in memcmplen.h. Now it is (based on Agner Fog's microarchitecture.pdf): - On some x86-64 processors (Intel Sandy Bridge to Tiger Lake), sub+jz and sub+jnz can be fused but xor+jz or xor+jnz cannot. Thus using subtraction has potential to be a tiny amount faster since the code checks if the quotient is non-zero. - Some processors (Intel Pentium 4) used to have more ALU resources for add/sub instructions than and/or/xor. So in the C code it's not a huge thing and in Java it's probably about nothing. But there is no real downside to using subtraction. I understand how xor seems more obvious choice. However, when looking for the lowest differing bit, subtraction will make that bit 1 and the bits below it 0. Only the bits above the 1 will differ between subtraction and xor but those bits are irrelevant here. I created a new branch, bytearrayview, which combines the CRC64 edits with the encoder speed changes as they share the ByteArrayView class (formerly ArrayUtil). > > I still need to check a few of your edits if some of them should be > > included. :-) > > I think the changes to LZMAEncoderNormal as part of this PR to avoid > the negative length comparison would be good to carry forward. Done, I hope. > 1. Use an interface with implementation chosen statically to separate > out the implementation options. I had an early version that used separate implementation classes but I must have done something wrong as that version was *clearly* slower. So I tried it again and it's as you say, no speed difference. :-) > 2. Allow specifying the implementation to use with a system property. Done. I hope it's done in a sensible enough way. The Java < 9 code is completely separate so it cannot be chosen. The property needs to be documented somewhere too. I suppose the ARM64 speed is still to be determined by you or someone else. -- Lasse Collin
Re: [xz-devel] xz-java and newer java
On 2024-03-12 Brett Okken wrote: > I am still working on digesting your branch. I still need to check a few of your edits if some of them should be included. :-) > The difference in method signature is subtle, but I think a key part > of the improvements you are getting. Could you add javadoc to more > clearly describe how the args are to be interpreted and what the > return value means? I pushed basic docs for getMatchLen. Once crc64_varhandle2 is merged then array_compare should use ArrayUtil too. It doesn't make a difference in speed. > I am playing with manually unrolling the java 8 byte-by-byte impl > along with tests comparing unsafe, var handle, and vector approaches. > These tests take a long time to run, so it will be a couple days > before I have complete results. Do you want data as I have it (and it > is interesting), or wait for summary? I can wait for the summary, thanks. > I am not sure when I will get opportunity to test out arm64. If someone has, for example, a Raspberry Pi, the compression of zeros test is simple enough to do and at least on x86-64 has clear enough difference. It's an over-simplified test but it's a data point still. > I do have some things still on jdk 8, but only decompression. Surveys > seem to indicate quite a bit of jdk 8 still in use, but I have no > personal need. Thanks. I was already tilted towards not using Unsafe and now I'm even more. The speed benefit of Unsafe over VarHandle should be tiny enough. It feels better that memory safety isn't ignored on any JDK version. If a bug was found, it's nicer to not wonder if Unsafe had a role in it. This is better for security too. In my previous email I wondered if using Unsafe only with Java 8 would make upgrading to newer JDK look bad if newer JDK used VarHandle instead of Unsafe. Perhaps that worry was overblown. But the other reasons and keeping the code simpler make me want to avoid Unsafe. (C code via JNI wouldn't be memory safe but then the speed benefits should be much more significant too.) -- Lasse Collin
Re: [xz-devel] xz-java and newer java
On 2024-03-09 Brett Okken wrote: > When I tested graviton2 (arm64) previously, Arrays.mismatch was > better than comparing longs using a VarHandle. Sounds promising. :-) However, your array_comparison_performance handles the last 1-7 bytes byte-by-byte. My array_compare branch reserves extra 7 bytes at the end of the array so that one can safely read up to 7 bytes more than one actually needs. This way no bounds checks are needed (even with Unsafe). This might affect the comparision between Arrays.mismatch and VarHandle if the results were close before. > I do like Unsafe as an option for jdk 8 users on x86 or arm64. Unsafe seems very slightly faster than VarHandle. If Java 8 uses Unsafe, should newer versions do too? It could be counter-productive if Java 8 was faster, even if the difference was tiny. Do you have use cases that are (for now) stuck on Java 8 or is your wish a more generic one? -- Lasse Collin
Re: [xz-devel] xz-java and newer java
I created a branch array_compare. It has a simple version for Java <= 8 which seems very slightly faster than the current code in master, at least when tested with OpenJDK 21. For Java >= 9 there is Arrays.mismatch for portability and VarHandle for x86-64 and ARM64. These are clearly faster than the basic version. sun.misc.Unsafe would be a little faster than VarHandle but I feel it's not enough to be worth the downsides (non-standard and not memory safe). 32-bit archs I didn't include, for now at least, since if people want speed I hope they don't run 32-bit Java. Speed differences are very minor when testing with files that don't compress extremely well. That was the problem I had with my earlier test results. With files that have compression ratio like 0.05 the speed differences are clear. I cannot test on ARM64 so it would be great if someone can, comparing the three versions. The most extreme difference is when compressing just zeros: time head -c1 /dev/zero \ | java -jar build/jar/XZEncDemo.jar > /dev/null Internal docs should be added to the branch and perhaps there are other related optimizations to do still. So it's not fully finished yet but now it's ready for testing and feedback. For example, some tweaks from your array_comp_incremental could be considered after testing. -- Lasse Collin
Re: [xz-devel] [BUG] Issue with xz-java: Unknown Filter ID
On 2024-03-05 Dennis Ens wrote: > > I hope 1.10 could be done in a month or two but I don't want to > > make any promises or serious predictions. Historically those > > haven't been accurate at all. > > I'll hope it's on the sooner side then. Is there a reason that > xz-java is so far behind its counterpart? These are unpaid hobby projects and the maintainers work on things they happen to find interesting. The focus was on XZ Utils quite long, now more attention is returning to XZ for Java. > It seems those filters have been in that version for a while, and it > seems strange they aren't compatible with each other. Maybe this > should be made more clear in the README? The README file in XZ for Java 1.9 specifies that the code implements the .xz file format specification version 1.0.4. That doesn't include the ARM64 or RISC-V filters. ARM64 filter was in the master branch already. RISC-V filter is there now too among a few other changes. README refers to spec version 1.2.0 now. I understand it can be cryptic to refer to a spec version but obviously one cannot list what future things are missing. One could list supported filters but in theory something else could be extended too. > I don't see anything about contributing on the xz-java github page. > What are the best practices for contributing to this project? I'm not sure if there is anything specific. Chatting on #tukaani can be good to get ideas discussed quickly but it requires that people happen to be online at the same time. > > The encoder implementations have some minor differences which > > affects both output and speed. Different releases can in theory > > have different output. XZ Utils output might change in future > > versions too. > > I see, that makes sense. I'm glad the difference is explainable and > not a bug. Can you explain exactly what the differences are? I don't remember much now. It's minor details but minor differences affect output already. > Does xz-java always do a better job compressing since it resulted in a > smaller file? They should be very close in practice. You need to compare to XZ Utils in single-threaded mode: xz -T1 -- Lasse Collin
Re: [xz-devel] xz-java and newer java
On 2024-02-29 Brett Okken wrote: > > Thanks! Ideally there would be one commit to add the minimal > > portable version, then separate commits for each optimized variant. > > Would you like me to remove the Unsafe based impl from > https://github.com/tukaani-project/xz-java/pull/13? There are new commits in master now and those might slightly conflict with your PR (@Override additions). I'm playing around a bit and learning about the faster methods still. So right now I don't have wishes for changes; I don't want to request anything when there's a possibility that some other way might end up looking more preferable. In general, I would prefer splitting to more commits. Using your PR as an example: 1. Adding the changes to lz/*.java and the portable *Array*.java code required by those changes. 2. Adding one advanced implementation that affects only the *Array*.java files. 3. Repeat step 2. until all implementations are added. When reasonably possible, the line length should be under 80 chars. > > So far I have given it only a quick try. array_comp_incremental > > seems faster than xz-java.git master. Compression time was reduced > > by about 10 %. :-) This is with OpenJDK 21.0.2, only a quick test, > > and my computer is old so I don't doubt your higher numbers. > > How are you testing? I am using jmh, so it has a warm up period before > actually measuring, giving the jvm plenty of opportunity to perform > optimizations. If you are doing single shot executions to compress a > file, that could provide pretty different results. I was simply timing a XZEncDemo at the default preset (6). I had hoped that big files (binary and source packages) that take tens of seconds to compress, repeating each test a few times, would work well enough. But perhaps the difference is big enough only with certain types of files. On 2024-03-05 Brett Okken wrote: > I have added a comment to the PR with updated benchmark results: > https://github.com/tukaani-project/xz-java/pull/13#issuecomment-1977705691 Thanks! I'm not sure if I read the results well enough. The "Error" column seems to have oddly high values on several lines. If the same test set is run again, are the results in the "Score" column similar enough between the two runs, retaining the speed order of the implementations being tested? If the first file is only ~66KB, I wonder if other factors like initiazing large arrays in the classes take so much time that differences in array comparison speeds becomes hard to measure. When each test is repeated by the benchmarking framework, each run has to allocate the classes again. Perhaps it might trigger garbage collection. Did you have ArrayCache enabled? ArrayCache.setDefaultCache(BasicArrayCache.getInstance()); I suppose optimizing only for new JDK version(s) would be fine if it makes things easier. That is, it could be enough that performance doesn't get worse on Java 8. If the indirection adds overhead, would it make sense to have a preprocessing step that creates .java file variants that directly use the optimized methods? So LZMAEncoder.getInstance could choose at runtime if it should use LZMAEncoderNormalPortable or LZMAEncoderNormalUnsafe or some other implementation. That is, if this cannot be done with multi-release JAR. It's not a pretty solution but if it is faster then it could be one option, maybe. Negative lenLimit currently occurs in two places (at least). Perhaps it should be handled in those places instead of requiring the array comparison to support it (the C code in liblzma does it like that). -- Lasse Collin
Re: [xz-devel] [BUG] Issue with xz-java: Unknown Filter ID
On 2024-03-05 Dennis Ens wrote: > > The XZ for Java development is becoming active again but it may > > still take a while until the next stable release is out. A few > > other things are waiting in the queue from the past three years. > > Ah, I see. Thank you for the answer. Do you have a timeline of when > the changes are expected? I hope 1.10 could be done in a month or two but I don't want to make any promises or serious predictions. Historically those haven't been accurate at all. > First, xz-java seems much slower. I tested compressing and > decompressing a ~1.2 gigabyte file, and xz-java took 17m32.345s > compared to xz's 7m7.615s to compress. Decompressing was 0m21.760s to > 0m6.223s. Is there anything that can be done to improve the speed of > the Java version, or is c just a much more efficient programming > language? Brett Okken's patches (originally from early 2021) should improve compression speed. They are currently under review. Those are one of the things to get into the next stable release. However, Java in general is slower. Some compressors have a Java API but the performance-critical code is native code. For example, java.util.zip calls into native code from zlib. XZ for Java doesn't use any native code (for now at least). XZ for Java lacks threading still. Implementing it is among the most important tasks in XZ for Java. It helps with big files like your test file but makes compressed file a little bigger. From your numbers I'm not certain if you used xz in threaded mode or not. The time difference looks unusually high for single-threaded mode for both compression and decompression. The difference for a big input file in threaded mode looks small though (unless it had lots of trivially-compressible sections). In single-threaded mode, I would expect compressing with xz to take around 30-40 % less time than XZ for Java but your numbers show 60 % time reduction. XZ Utils 5.6.0 added x86-64 assembly (GCC & Clang only) which reduces per-thread decompression time by 20-40 % depending on the file and the computer. So that increases the difference between XZ Utils and XZ for Java too: decompression time can be roughly 50 % less with XZ Utils 5.6.0 in single-threaded mode on x86-64 compared to XZ for Java. XZ Utils 5.6.0 also enables threaded mode by default. > Also, I noticed that the results of compressing the files were > different sizes. They both worked, so I don't know if it's an issue, > but it does seem strange. The xz-java one was slightly smaller than > the xz one. The encoder implementations have some minor differences which affects both output and speed. Different releases can in theory have different output. XZ Utils output might change in future versions too. -- Lasse Collin
Re: [xz-devel] [BUG] Issue with xz-java: Unknown Filter ID
On 2024-03-05 Dennis Ens wrote: > The files specifically were good-1-arm64-lzma2-1.xz and > good-1-arm64-lzma2-2.xz and good-1-riscv-lzma2-1.xz and > good-1-riscv-lzma2-2.xz. These did seem to work fine when I tried > with xz, but not with xz-java. Do you think there might be a fix > available for this soon? XZ for Java 1.9 doesn't have ARM64 or RISC-V filter. The master branch has ARM64 filter. RISC-V filter will likely be there this week. The XZ for Java development is becoming active again but it may still take a while until the next stable release is out. A few other things are waiting in the queue from the past three years. -- Lasse Collin
Re: [xz-devel] xz-java and newer java
On 2024-02-25 Brett Okken wrote: > I created https://github.com/tukaani-project/xz-java/pull/13 with the > bare bones changes to utilize a utility for array comparisons and an > Unsafe implementation. > When/if that is reviewed and approved, we can move on through the > other implementation options. Thanks! Ideally there would be one commit to add the minimal portable version, then separate commits for each optimized variant. So far I have given it only a quick try. array_comp_incremental seems faster than xz-java.git master. Compression time was reduced by about 10 %. :-) This is with OpenJDK 21.0.2, only a quick test, and my computer is old so I don't doubt your higher numbers. With array_comparison_performance the improvement seems to be less, maybe 5 %. I didn't test much yet but it still seems clear that array_comp_incremental is faster on my computer. However, your code produces different output compared to xz-java.git master so the speed comparison isn't entirely fair. I assume there was no intent to affect the encoder output with these changes so I wonder what is going on. Both of your branches produce the same output so it's something common between them that makes the difference. I plan to get back to this next week. > > One thing I wonder is if JNI could help. > > It would most likely make things faster, but also more complicated. I > like the java version for the simplicity. I am not necessarily looking > to compete with native performance, but would like to get improvements > where they are reasonably available. Here there is some complexity in > supporting multiple implementations for different versions and/or > architectures, but that complexity does not intrude into the core of > the xz code. I think your thoughts are similar to mine here. Java version is clearly slower but it's nicer code to read too. A separate class for buffer comparisons indeed doesn't hurt the readability of the core code. On the other hand, if Java version happened to be used a lot then JNI could save both time (up to 50 %) and even electricity. java.util.zip uses native zlib for the performance-critical code. In the long run both faster Java code and JNI might be worth doing. There's more than enough pure Java stuff to do for now so any JNI thoughts have to wait. -- Lasse Collin
Re: [xz-devel] [PATCH] xz: Avoid warnings due to memlimit if threads are in auto mode.
On 2024-02-28 Sebastian Andrzej Siewior wrote: > On 2024-02-28 18:45:03 [+0200], Lasse Collin wrote: > > V_DEBUG was commited to the master and v5.6 branches a few moments > > ago, so yes, your plan sounds good. :-) Feel free to do it as you > > prefer, either just making the change or picking the other simple > > fixes from v5.6 as well. > > Perfect. I just took the patch. Thanks! :-) > > Hopefully the already-added workarounds in other packages don't > > cause any unwanted side effects in the future. > > The plan was to revert it. All good. :-) There is a branch "memavail" on GitHub with experimental support for MemAvailable from Linux /proc/meminfo. It needs discussion and feedback (likely in a new thread). There is no rush as it's not for 5.6.x anyway. -- Lasse Collin
Re: [xz-devel] [PATCH] xz: Avoid warnings due to memlimit if threads are in auto mode.
On 2024-02-28 Sebastian Andrzej Siewior wrote: > I see. In that case let me throw this to V_DEBUG Debian wise and sync > with xz upstream once a new release is up or so. I have two packages > that fail because of this and dpkg added a workaround. So instead > adding another workaround to another package I would fix this on the > xz side. Sounds good? V_DEBUG was commited to the master and v5.6 branches a few moments ago, so yes, your plan sounds good. :-) Feel free to do it as you prefer, either just making the change or picking the other simple fixes from v5.6 as well. Hopefully the already-added workarounds in other packages don't cause any unwanted side effects in the future. Thanks! -- Lasse Collin
Re: [xz-devel] [PATCH] xz: Avoid warnings due to memlimit if threads are in auto mode.
On 2024-02-27 Sebastian Andrzej Siewior wrote: > On 2024-02-27 19:17:48 [+0200], Lasse Collin wrote: > > - The silencing could be done with -q as well though. > > Wouldn't -q also shut some legitime warnings? Yes. When compressing from stdin to stdout, there aren't many possible warnings but there are still a few rare ones. So -q isn't ideal to get rid of thread count reduction messages. > Isn't the automatic memory usage accurate? It's simply 25 % of total RAM. The Linux-specific MemAvailable from /proc/meminfo didn't get into 5.6.0. Perhaps it could be done in the next development cycle, and maybe also look for similar features on a few other OSes. > Not sure if documenting it in the man-page would help here. One issue is that currently the message tells about thread count reduction and what the memlimit is but not how much memory is actually required. One needs to use -vv to get the usage info. Documenting on the man page could be good if it can be explained in an understandable way and people can find it there. The man page is long already. The less average users *need* to understand the details the better. > > There are also messages that are shown when memory limit does affect > > compressed output (switching to single-threaded mode and LZMA2 > > dictionary size adjustment). The verbosity requirement of these > > messages isn't being changed now. > > This sounds like you accept this change in principle but are thinking > if V_VERBOSE or V_DEBUG is the right thing. Me and three other people on IRC think it should be changed but there is no consensus yet what exactly is the best (your patch, -v, or -vv). This is about the thread count messages only as (since 5.4.0) automatic thread count doesn't affect the compressed output. There is some discussion also here: https://github.com/tukaani-project/xz/issues/89 -- Lasse Collin
Re: [xz-devel] [PATCH] xz: Avoid warnings due to memlimit if threads are in auto mode.
On 2024-02-26 Sebastian Andrzej Siewior wrote: > Print the warning about reduced threads only if number is selected > - automatically and asked to be verbose (-v) > - explicit by the user Thanks for the patch! We discussed a bit on IRC and everyone thinks it's on the right track but we are pondering the implementation details still. The thread count messages are shown in situations which don't affect the compressed output, and thus the importance of these messages isn't so high. Originally they were there to reduce the chance of people asking why xz isn't using as many threads as requested. We are considering to simply change those two message() calls to always use V_VERBOSE or V_DEBUG instead of the current V_WARNING. So automatic vs. manual number of threads wouldn't affect it like it does in your patch. Comparing your apporach and this simpler one: + There are scripts that take a user-specified number for parallelization and that number is passed to multiple tools, not just xz. Keeping xz -T16 silent about thread count reduction can make sense in this case. - The silencing could be done with -q as well though. There are pros and cons between V_VERBOSE and V_DEBUG. For (de)compression, a single -v sets V_VERBOSE and actives the progress indicator. If the thread count messages are shown at -v, on some systems progress indicator usage would get the message about reduced thread count as well. + It works as a hint that increasing the memory usage limits manually might allow more threads to be used. - If one uses progress indicator frequently, the thread count reduction message might become slightly annoying as the information is already known by the user. - Progress indicator can be used in non-interactive cases (when stderr isn't a terminal). Then xz only prints a final summary per file. This likely is not a common use case but the thread count messages would be here as well. V_DEBUG is set when -v is used twice (-vv). + Regular progress indicator uses wouldn't get extra messages. - A larger number of users might not become aware that they aren't getting as many threads as they could because the automatic memory usage limit is too low to allow more threads. There are also messages that are shown when memory limit does affect compressed output (switching to single-threaded mode and LZMA2 dictionary size adjustment). The verbosity requirement of these messages isn't being changed now. -- Lasse Collin
Re: [xz-devel] Testing LZMA_RANGE_DECODER_CONFIG
On 2024-02-19 Sebastian Andrzej Siewior wrote: > Okay, so the input matters, too. I tried 1GiB urandom (so it does not > compress so well) but that went quicker than expected… urandom should be incompressible. When LZMA2 cannot compress a chunk it stores it in uncompressed form. Decompression is like "cat with CRC". > I found 3 idle x86 boxes and re-run a test with linux' perf on them > and the arm64 box. I all flavours for the two archives. On RiscV I > did the 'xz -t' thing because perf seems not to be supported well or > I lack access. Great work! Thanks! On IRC one person ran a bunch of tests too. On ARM64 the results were mixed. A variant that was better with GCC could be worse with Clang. So those weren't as clear as your results but they too made me think that using 0 for non-x86-64 is the way to go for 5.6.0. Your x86-64 asm variant results were interesting too. Seems that the bit 0x100 isn't good with GCC although the difference is small. I confirmed this on the tests I did on Celeron G1620 (Ivy Bridge). So I wonder if 0x0F0 should be the x86-64 variant to use in xz 5.6.0 with GCC. On another machine with Clang 16, 0x100 is 8 % faster with Linux kernel source. So the difference is somewhat big. It's still slightly slower than the GCC version. This is on Phenom II X4 920. Since 0x100 is only a little worse with GCC, using it for both GCC and Clang could be OK. An #ifdef __clang__ could be used too but perhaps it's not great in the long term. Something has to be chosen for 5.6.0; further tweaks can be made later. By the way, the "time" command gives more precise results than "xz -v". I use TIMEFORMAT=$'\nreal\t%3R\nuser\t%3U\nsys\t%3S\ncpu%%\t%P' in bash to keep the output as seconds instead of minutes and seconds. -- Lasse Collin
Re: [xz-devel] xz-java and newer java
On 2024-02-19 Brett Okken wrote: > I have created a pr to the GitHub project. > > https://github.com/tukaani-project/xz-java/pull/12 Thanks! I could be good to split into smaller commits to make reviewing easier. > It is not clear to me if that is actually seeing active dev on the > Java project yet. I see now that there are quite a few things on GH. I had forgotten to turn email notifications on for the xz-java project; clearly those aren't on by default. :-( But likely not much would have been done even if I had noticed those issues and PRs earlier so the main problem is that the silence has been impolite. I'm sorry. XZ Utils 5.6.0 has to be released this month since there was a wish to get it into the next Ubuntu LTS. I'm hoping that next month something will finally get done around XZ for Java. We'll see. One thing I wonder is if JNI could help. Optimizing the Java code can help a bit but I suspect that it still won't be very fast. So far it has been nice that the Java code is quite readable and I would like keep it that way in the future too. -- Lasse Collin
Re: [xz-devel] Testing LZMA_RANGE_DECODER_CONFIG
The balance between the hottest locations in the decompressor code varies depending on the input file. Linux kernel source compresses very well (ratio is about 0.10). This reduces the benefit of branchless code. On my main computer I still get about 2 % time reduction with =3. On another x86-64 computer I don't see any difference between =0 and =3 with the Linux kernel source. On the same machine, decompression time of warzone2100-data[1] from Debian is reduced by 10.5 % with =3 compared to =0. It's a package that doesn't compress so well (ratio is about 0.75). On my main computer the time reduction from =0 to =3 is 8.5 %. All numbers are with GCC. Of course, on x86-64 the =0 vs. =3 test isn't that interesting since the asm is so much better. But this highlights how much the test file choice can make a difference. [1] https://packages.debian.org/bookworm/all/warzone2100-data/download -- Lasse Collin
Re: [xz-devel] Testing LZMA_RANGE_DECODER_CONFIG
On 2024-02-17 Sebastian Andrzej Siewior wrote: > I did some testing on !x86. I changed LZMA_RANGE_DECODER_CONFIG to > different values run a test and looked at the MiB/s value. xz_0 means > LZMA_RANGE_DECODER_CONFIG was 0, xz_1 means the define was set to 1. I > touched src/liblzma/lzma/lzma_decoder.c and rebuilt xz. I pinned the > shell to a single CPU and run test for archive (-tv) for one file > three times. Great to see testing! The testing method is fine. If pinning to a single core, I assume --threads=1 was set as well because multithreading is the default now. Branchless code can help when branch prediction penalties are high. So it will depend on the processor (not just the instruction set). On x86-64, there was a clear improvement with the branchless C code. It was a little more with Clang than GCC. So if easily possible, also testing with Clang could be useful. Testing your script on x86-64 could be worth it too so check that at least on x86-64 you get an improvement with =1 and =3 compared to =0. (The bit 1 makes the main difference; 2 should have a small effect, and 4 and 8 are questionable and perhaps not worth benchmarking until the usefulness of =1 or =3 is clear.) If the branchless C code is not consistent outside x86-64, then 5.6.0 likely should stick to =0. From your results it seems that the other tweaks to the code provided a minor improvement on non-x86-64 still. (The tweaks that LZMA_RANGE_DECODER_CONFIG doesn't affect.) Thanks! -- Lasse Collin
[xz-devel] XZ projects license change proposal
Hello! I have made a post on GitHub about possibly moving from public domain to BSD Zero Clause License: https://github.com/tukaani-project/xz/issues/79 Feedback is welcome. Feel free to comment on GitHub, privately via email to x...@tukaani.org, or on the xz-devel mailing list. Thank you! PS. XZ for Java has been idle longer than expected but it should finally get at least some attention in the coming months. -- Lasse Collin
Re: [xz-devel] [PATCH] [xz-embedded] Fix condition that automatically define XZ_DEC_BCJ
On 2023-09-07 Jules Maselbas wrote: > The XZ_DEC_BCJ macro was not defined when only selecting the ARM64 BCJ > decoder, leading to no BCJ decoder being compiled. > > The macro that select XZ_DEC_BCJ if any of the BCJ decoder is > selected was missing a case for the recently added ARM64 BCJ decoder. > > Also the macro `defined(XZ_DEC_ARM)` was used twice in the condition > for selecting XZ_DEC_BCJ, so this patch replaces one with > XZ_DEC_ARM64. Thanks! I kept the ordering of the filter names the same as elsewhere in the file and in xz_dec_bcj.c. The ARM64 filter still hasn't been submitted to Linux but it's on the to-do list. -- Lasse Collin
[xz-devel] XZ Utils 5.2.11 and 5.4.2
XZ Utils 5.2.11 and 5.4.2 are available at <https://tukaani.org/xz/>. The Doxygen-generated liblzma API documentation is now available online at <https://tukaani.org/xz/liblzma-api/files.html>. Please let us know if there is interest in more releases for the 5.2 branch. Jia Tan and I will plan further bug-fix releases for this branch only if people will use it. Future release tarballs might be signed by Jia Tan. Recently he has done most of the work in XZ Utils. :-) Here is an extract from the NEWS file: 5.2.11 (2023-03-18) * Removed all possible cases of null pointer + 0. It is undefined behavior in C99 and C17. This was detected by a sanitizer and had not caused any known issues. * Build systems: - Added a workaround for building with GCC on MicroBlaze Linux. GCC 12 on MicroBlaze doesn't support the __symver__ attribute even though __has_attribute(__symver__) returns true. The build is now done without the extra RHEL/CentOS 7 symbols that were added in XZ Utils 5.2.7. The workaround only applies to the Autotools build (not CMake). - CMake: Ensure that the C compiler language is set to C99 or a newer standard. - CMake changes from XZ Utils 5.4.1: * Added a workaround for a build failure with windres from GNU binutils. * Included the Windows resource files in the xz and xzdec build rules. 5.4.2 (2023-03-18) * All fixes from 5.2.11 that were not included in 5.4.1. * If xz is built with support for the Capsicum sandbox but running in an environment that doesn't support Capsicum, xz now runs normally without sandboxing instead of exiting with an error. * liblzma: - Documentation was updated to improve the style, consistency, and completeness of the liblzma API headers. - The Doxygen-generated HTML documentation for the liblzma API header files is now included in the source release and is installed as part of "make install". All JavaScript is removed to simplify license compliance and to reduce the install size. - Fixed a minor bug in lzma_str_from_filters() that produced too many filters in the output string instead of reporting an error if the input array had more than four filters. This bug did not affect xz. * Build systems: - autogen.sh now invokes the doxygen tool via the new wrapper script doxygen/update-doxygen, unless the command line option --no-doxygen is used. - Added microlzma_encoder.c and microlzma_decoder.c to the VS project files for Windows and to the CMake build. These should have been included in 5.3.2alpha. * Tests: - Added a test to the CMake build that was forgotten in the previous release. - Added and refactored a few tests. * Translations: - Updated the Brazilian Portuguese translation. - Added Brazilian Portuguese man page translation. -- Lasse Collin
[xz-devel] XZ Utils 5.2.10 and 5.4.0
d in CMake-based builds too ("make test"). -- Lasse Collin
[xz-devel] XZ Utils 5.3.5beta
There were technical issues on the tukaani.org website in the past 24 hours. These should have now been fixed. Sorry for the inconvenience. XZ Utils 5.3.5beta is available at <https://tukaani.org/xz/>. Here is an extract from the NEWS file: 5.3.5beta (2022-12-01) * All fixes from 5.2.9. * liblzma: - Added new LZMA_FILTER_LZMA1EXT for raw encoder and decoder to handle raw LZMA1 streams that don't have end of payload marker (EOPM) alias end of stream (EOS) marker. It can be used in filter chains, for example, with the x86 BCJ filter. - Added lzma_str_to_filters(), lzma_str_from_filters(), and lzma_str_list_filters() to make it easier for applications to get custom compression options from a user and convert it to an array of lzma_filter structures. - Added lzma_filters_free(). - lzma_filters_update() can now be used with the multi-threaded encoder (lzma_stream_encoder_mt()) to change the filter chain after LZMA_FULL_BARRIER or LZMA_FULL_FLUSH. - In lzma_options_lzma, allow nice_len = 2 and 3 with the match finders that require at least 3 or 4. Now it is internally rounded up if needed. - ARM64 filter was modified. It is still experimental. - Fixed LTO build with Clang if -fgnuc-version=10 or similar was used to make Clang look like GCC >= 10. Now it uses __has_attribute(__symver__) which should be reliable. * xz: - --threads=+1 or -T+1 is now a way to put xz into multi-threaded mode while using only one worker thread. - In --lzma2=nice=NUMBER allow 2 and 3 with all match finders now that liblzma handles it. * Updated translations: Chinese (simplified), Korean, and Turkish. -- Lasse Collin
[xz-devel] XZ Utils 5.2.9
XZ Utils 5.2.9 is available at <https://tukaani.org/xz/>. Here is an extract from the NEWS file: 5.2.9 (2022-11-30) * liblzma: - Fixed an infinite loop in LZMA encoder initialization if dict_size >= 2 GiB. (The encoder only supports up to 1536 MiB.) - Fixed two cases of invalid free() that can happen if a tiny allocation fails in encoder re-initialization or in lzma_filters_update(). These bugs had some similarities with the bug fixed in 5.2.7. - Fixed lzma_block_encoder() not allowing the use of LZMA_SYNC_FLUSH with lzma_code() even though it was documented to be supported. The sync-flush code in the Block encoder was already used internally via lzma_stream_encoder(), so this was just a missing flag in the lzma_block_encoder() API function. - GNU/Linux only: Don't put symbol versions into static liblzma as it breaks things in some cases (and even if it didn't break anything, symbol versions in static libraries are useless anyway). The downside of the fix is that if the configure options --with-pic or --without-pic are used then it's not possible to build both shared and static liblzma at the same time on GNU/Linux anymore; with those options --disable-static or --disable-shared must be used too. * New email address for bug reports is which forwards messages to Lasse Collin and Jia Tan. -- Lasse Collin
Re: [xz-devel] [PATCH 1/2] Add support openssl's SHA256 implementation
On 2022-11-30 Lasse Collin wrote: > Are there other good library options? If the goal is to use SHA instructions on x86 then intrinsics in the C code with runtime CPU detection are an option too. It's done in crc64_fast.c in 5.3.4alpha already. -- Lasse Collin
Re: [xz-devel] [PATCH 1/2] Add support openssl's SHA256 implementation
Hello! This could be good as an optional feature, disabled by default so that extra dependency doesn't get added accidentally. It's too late for 5.4.0 but perhaps in 5.4.1 or .2. The biggest problem with the patch is that it lacks error checking: - EVP_MD_CTX_new() can return NULL if memory allocation fails. Man page doesn't document this but source code makes it clear. - EVP_get_digestbyname() can return NULL on failure. Perhaps this could be replaced with EVP_sha256()? It seems to return a pointer to a statically-allocated structure and man page implies that it cannot fail. - EVP_DigestInit_ex(), EVP_DigestUpdate(), and EVP_DigestFinal_ex() can in theory fail, perhaps not in practice, I don't know. Currently it is assumed in liblzma that initiazation cannot fail so that would need to be changed. It could be good to check the return values from EVP_DigestUpdate() and EVP_DigestFinal_ex() too. Since it is unlikely that EVP_DigestUpdate() fails it could perhaps be OK to store the failure code and only return it for lzma_check_finish() but I'm not sure if that is acceptable. The configure options perhaps should be --with instead of --enable since it adds a dependency on another package, if one wants to stick to Autoconf's guidlines. (It's less clear if --enable-external-sha256 should be --with since it only affects what to use from the OS base libraries. In any case it won't be changed as it would affect compatibility with build scripts.) Are there other good library options? For example, Nettle's SHA-256 functions don't need any error checking but I haven't checked the performance. Is it a mess for distributions if a dependency of liblzma gets its soname bumped and then liblzma needs to be rebuilt without changing its soname? I suppose such things happen all the time but when a library is needed by a package manager it might perhaps have extra worries. -- Lasse Collin
Re: [xz-devel] RHEL7 ABI patch (913ddc5) breaks linking on ia64
On 2022-11-23 Sebastian Andrzej Siewior wrote: > 3x to be exact: > - 1x shared with threads > - 1x static with threads > - 1x non-shared, no threads, no encoders, just xzdec. > > There are three build folder in the end. The full gets a make install, > the other get xzdec/liblzma.a extracted. Thanks! I remember the details now, it's excellent. I figured out a way to make everything just work in the common case. If --with-pic or --without-pic is used then building both shared and static liblzma at the same time isn't possible (configure will fail). That is, --with-pic or --without-pic requires that also --disable-shared or --disable-static is used on GNU/Linux. It's in xz.git now and will be in the next releases (5.2.9 is needed to fix other bugs) so I hope any workarounds can be removed from distros after that. Thanks to Adrian for reporting the bug! -- Lasse Collin
Re: [xz-devel] RHEL7 ABI patch (913ddc5) breaks linking on ia64
On 2022-11-23 John Paul Adrian Glaubitz wrote: > Well, Debian builds both the static and dynamic libraries in separate > steps, so I'm not sure whether the autotools build system would be > able to detect that. I would assume the separate steps means running configure twice, once to disable static build and once to disable shared build. > I would make --enable-static and --enable-symbol-versions mutually > exclusive so that the configure fails if both are enabled. I was thinking of a slightly friendlier approach so that the combination --disable-shared --enable-static would imply --disable-symbol-versions on GNU/Linux (it doesn't matter elsewhere for now). It's good if people never need to use the option *-symbol-versions. The defaults need to be as good as easily possible. Using --disable-symbol-versions as a temporary workaround is fine but if it is needed in the long term then something is broken. -- Lasse Collin
Re: [xz-devel] RHEL7 ABI patch (913ddc5) breaks linking on ia64
On 2022-11-23 John Paul Adrian Glaubitz wrote: > So, for now, we should build the static library with > "--disable-symbol-versions". An ugly workaround in upstream could be to make configure fail on GNU/Linux if both shared and static libs are about to be built. That is, show an error message describing that one thing has to be built at a time. It's not pretty but with Autotools I don't see any other way except dropping the RHEL/CentOS 7 compat symbols completely. Static libs shouldn't have symbol versions (no matter which arch), somehow it just doesn't always create problems. That is, it would be mandatory to use either --disable-static or --disable-shared to make configure pass. Or would it be less bad to default to shared-only build and require the use of both --disable-shared --enable-static to get static build? I don't like any of these but I don't have better ideas. Thoughts? -- Lasse Collin
Re: [xz-devel] RHEL7 ABI patch (913ddc5) breaks linking on ia64
On 2022-11-23 John Paul Adrian Glaubitz wrote: > On 11/23/22 12:31, Lasse Collin wrote: > > (1) Does this make the problem go away? > > Yes, that fixes the linker problem for me. At least in the case of > mariadb-10.6. Why does it want static liblzma.a in the first place? It sounds weird to require rebuilding of mariadb-10.6 every time liblzma is updated. Can it build against liblzma.so if liblzma.a isn't available? It is fine to build *static* liblzma with --disable-symbol-versions on all archs. Debian-specific workaround is fine in the short term but this should be fixed upstream. One method would be to disable the extra symbols on ia64 but that is not a real fix. Perhaps it's not really possible as long as the main build system is Autotools, I don't currently know. I'm still curious why exactly one symbol (lzma_get_progress) looks special in the readelf output. For some reason no other symbols with the symver declarations are there. Does it happen because of something in XZ Utils or is it weird behavior in the toolchain that creates the static lib. One can wonder if it was a mistake to try to clean up the issues that started from the RHEL/CentOS 7 patch since now it has created a new problem. On the other hand, the same could have happened if this kind of symbol versioning had been done to avoid bumping the soname (which hopefully will never happen though). -- Lasse Collin
Re: [xz-devel] RHEL7 ABI patch (913ddc5) breaks linking on ia64
On 2022-11-23 John Paul Adrian Glaubitz wrote: > I guess the additional unwind section breaks your workaround, so the > best might be to just disable this workaround on ia64 using the > configure flag, no? There currently is no configure option to only disable the CentOS 7 workaround symbols. They are enabled if $host_os matches linux* and --disable-symbol-versions wasn't used. Disabling symbol versions from liblzma.so.5 will cause problems as they have been used since 5.2.0 and many programs and libraries will expect to find XZ_5.0 and XZ_5.2. Having the symbol versions in a static library doesn't make much sense though. Perhaps this is a bug in XZ Utils. As a test, the static liblzma.a could be built without symbol versions with --disable-shared --disable-symbol-versions: (1) Does this make the problem go away? (2) Do the failing builds even require that liblzma.a is present on the system? I don't now how to avoid symvers in a static library as, to my understanding, GNU Libtool doesn't add any -DBUILDING_SHARED_LIBRARY kind of flag which would allow using a #ifdef to know when to use the symbol versions. Libtool does add -DDLL_EXPORT when building a shared library on Windows but that's not useful here. (Switching to another build system would avoid some other Libtool problems too like wrong shared library versioning on some OSes. However, Autotools-based build system is able to produce usable xz on quite a few less-common systems that some other build systems don't support.) A workaround to this workaround could be to disable the CentOS 7 symbols on ia64 by default. Adding an explicit configure option is possible too, if needed. But the first step should be to understand what is going on since the same problem could appear in the future if symbol versions are used for providing compatibility with an actual ABI change (hopefully not needed but still). > Older versions are available through Debian Snapshots: > > > http://snapshot.debian.org/package/xz-utils/ liblzma.a in liblzma-dev_5.2.5-2.1_ia64.deb doesn't have any "@XZ" in it which is expected. This looks normal: : [0x18c0-0x1990], info at +0x100 > > Many other functions are listed in those .IA_64.unwind > > sections too but lzma_get_progress is the only one that has "@XZ" > > as part of the function name. > > Hmm, that definitely seems the problem. Could it be that the symbols > that are exported on ia64 need some additional naming? It seems weird why only one symbol is affected. Perhaps it's a bug in the toolchain creating liblzma.a. However, perhaps the main bug is that XZ Utils build puts symbol versions into a static liblzma. :-( > I think we can waive for CentOS 7 compatibility on Debian unstable > ia64 . There is no official CentOS 7 for ia64 but that isn't the whole story as the broken patch has been used elsewhere too. Not having those extra symbols would still be fine in practice. :-) -- Lasse Collin
Re: [xz-devel] RHEL7 ABI patch (913ddc5) breaks linking on ia64
On 2022-11-22 Sebastian Andrzej Siewior wrote: > This looks like it is staticaly linked against liblzma. The shared libs in Debian seem to be correct as you managed to answer right before my email. Thanks! :-) But the above comment made me look at Debian's liblzma.a. The output of readelf -aW usr/lib/ia64-linux-gnu/liblzma.a includes the following two lines in both 5.2.7 and 5.3.4alpha: Unwind section '.IA_64.unwind' at offset 0x2000 contains 15 entries: [...] : [0x1980-0x1a50], info at +0x108 There are no older versions on the mirror so I didn't check what pre-5.2.7 would have. But .IA_64.unwind is a ia64-specific thing. Many other functions are listed in those .IA_64.unwind sections too but lzma_get_progress is the only one that has "@XZ" as part of the function name. I don't understand these details but I wanted let you know anyway in case it isn't a coincidence why lzma_get_progress appears in a special form in both liblzma.a and in the linker error messages. The error has @@XZ_5.2 (which even 5.2.0 has in shared liblzma.so.5) but here the static lib has @XZ_5.2.2 which exists solely for CentOS 7 compatibility. lzma_cputhreads doesn't show the same special behavior in ia64 liblzma.a even though lzma_cputhreads is handled exactly like lzma_get_progress in the liblzma C code and linker script. -- Lasse Collin
Re: [xz-devel] RHEL7 ABI patch (913ddc5) breaks linking on ia64
On 2022-11-22 John Paul Adrian Glaubitz wrote: > Does anyone have a clue why this particular change may have broken > the linking on ia64? Thanks for your report. This is important to fix. What do these commands print? Fix the path to liblzma.so.5 if needed. readelf --dyn-syms -W /lib/liblzma.so.5 \ | grep lzma_get_progress readelf --dyn-syms -W /lib/liblzma.so.5 \ | grep lzma_stream_encoder_mt_memusage The first should print 2 lines and the second 3 lines. The rightmost columns should be like these: FUNCGLOBAL DEFAULT 11 lzma_get_progress@@XZ_5.2 FUNCGLOBAL DEFAULT 11 lzma_get_progress@XZ_5.2.2 FUNCGLOBAL DEFAULT 11 lzma_stream_encoder_mt_memusage@@XZ_5.2 FUNCGLOBAL DEFAULT 11 lzma_stream_encoder_mt_memusage@XZ_5.1.2alpha FUNCGLOBAL DEFAULT 11 lzma_stream_encoder_mt_memusage@XZ_5.2.2 Pay close attention to @ vs. @@. The XZ_5.2 must be the ones with @@. If you see the same as above then I don't have a clue. By any chance, was XZ Utils built with GCC older than 10 using link-time optimization (LTO, -flto)? As my commit message describes and NEWS warns, GCC < 10 and LTO will not produce correct results due to the symbol versions. It should work fine with GCC >= 10 or Clang. For what it is worth, when I wrote the patch I tested it on on Slackware 10.1 (32-bit x86) that has GCC 3.3.4 and it worked perfectly there. This symbol version stuff isn't a new thing so it really should work. -- Lasse Collin
Re: [xz-devel] [PATCH] add xz arm64 bcj filter support
Hello! On 2021-09-02 Liao Hua wrote: > +#define LZMA_FILTER_ARM64 LZMA_VLI_C(0x0a) Is this ID 0x0A in actual use somewhere? Can it be used in the official .xz format for something else than the filter you submitted? On 2021-09-08 Lasse Collin wrote: > On 2021-09-02 Liao Hua wrote: > > We have some questions about xz bcj filters. > > 1. Why ARM and ARM-Thumb bcj filters are little endian only? > > Perhaps it's an error. Long ago when I wrote the docs, I knew that the > ARM filters worked on little endian code but didn't know how big > endian ARM was done. I read about this and if I have understood correctly, in the past big endian ARM could use big endian instruction encoding too but nowadays instructions are always in little endian order, even if data access is big endian. The endianness in the docs is about instruction encoding. The filters don't care about data access. The mention of endianness has been removed in 5.3.4alpha (and thus 5.4.0) since it is more confusing than useful. The PowerPC filter is indeed big endian only. Little endian PowerPC would need a new filter. Filtering little endian PowerPC code would have comparable improvement in compression as the current big endian filter does. > > 2. Why there is no arm64 bcj filter? Are there any technical risks? > > Or other considerations? > > It just hasn't been done, no other reason. There will probably be a new ARM64 filter in 5.4.0. The exact design is still not frozen. Different parameters work a little better or worse in different situations. It doesn't seem practical to make a tunable filter since few people would try different settings and it would make the code slower and a little bigger (which matters in XZ Embedded). With ARM64 it is good to use --lzma2=lc=2,lp=2 instead of the default lc=3,lp=0. This alone can give a little over 1 % smaller file. -- Lasse Collin
[xz-devel] XZ Utils 5.3.4alpha
XZ Utils 5.3.4alpha is available at <https://tukaani.org/xz/>. Here is an extract from the NEWS file: 5.3.4alpha (2022-11-15) * All fixes from 5.2.7 and 5.2.8. * liblzma: - Minor improvements to the threaded decoder. - Added CRC64 implementation that uses SSSE3, SSE4.1, and CLMUL instructions on 32/64-bit x86 and E2K. On 32-bit x86 it's not enabled unless --disable-assembler is used but then the non-CLMUL code might be slower. Processor support is detected at runtime so this is built by default on x86-64 and E2K. On these platforms, if compiler flags indicate unconditional CLMUL support (-msse4.1 -mpclmul) then the generic version is not built, making liblzma 8-9 KiB smaller compared to having both versions included. With extremely compressible files this can make decompression up to twice as fast but with typical files 5 % improvement is a more realistic expectation. The CLMUL version is slower than the generic version with tiny inputs (especially at 1-8 bytes per call, but up to 16 bytes). In normal use in xz this doesn't matter at all. - Added an experimental ARM64 filter. This is *not* the final version! Files created with this experimental version won't be supported in the future versions! The filter design is a compromise where improving one use case makes some other cases worse. - Added decompression support for the .lz (lzip) file format version 0 and the original unextended version 1. See the API docs of lzma_lzip_decoder() for details. Also lzma_auto_decoder() supports .lz files. - Building with --disable-threads --enable-small is now thread-safe if the compiler supports __attribute__((__constructor__)) * xz: - Added support for OpenBSD's pledge(2) as a sandboxing method. - Don't mention endianness for ARM and ARM-Thumb filters in --long-help. The filters only work for little endian instruction encoding but modern ARM processors using big endian data access still use little endian instruction encoding. So the help text was misleading. In contrast, the PowerPC filter is only for big endian 32/64-bit PowerPC code. Little endian PowerPC would need a separate filter. - Added --experimental-arm64. This will be renamed once the filter is finished. Files created with this experimental filter will not be supported in the future! - Added new fields to the output of xz --robot --info-memory. - Added decompression support for the .lz (lzip) file format version 0 and the original unextended version 1. It is autodetected by default. See also the option --format on the xz man page. * Scripts now support the .lz format using xz. * Build systems: - New #defines in config.h: HAVE_ENCODER_ARM64, HAVE_DECODER_ARM64, HAVE_LZIP_DECODER, HAVE_CPUID_H, HAVE_FUNC_ATTRIBUTE_CONSTRUCTOR, HAVE_USABLE_CLMUL - New configure options: --disable-clmul-crc, --disable-microlzma, --disable-lzip-decoder, and 'pledge' is now an option in --enable-sandbox (but it's autodetected by default anyway). - INSTALL was updated to document the new configure options. - PACKAGERS now lists also --disable-microlzma and --disable-lzip-decoder as configure options that must not be used in builds for non-embedded use. * Tests: - Fix some of the tests so that they skip instead of fail if certain features have been disabled with configure options. It's still not perfect. - Other improvements to tests. * Updated translations: Croatian, Finnish, Hungarian, Polish, Romanian, Spanish, Swedish, and Ukrainian. -- Lasse Collin
[xz-devel] XZ Utils 5.2.8
XZ Utils 5.2.8 is available at <https://tukaani.org/xz/>. Here is an extract from the NEWS file: 5.2.8 (2022-11-13) * xz: - If xz cannot remove an input file when it should, this is now treated as a warning (exit status 2) instead of an error (exit status 1). This matches GNU gzip and it is more logical as at that point the output file has already been successfully closed. - Fix handling of .xz files with an unsupported check type. Previously such printed a warning message but then xz behaved as if an error had occurred (didn't decompress, exit status 1). Now a warning is printed, decompression is done anyway, and exit status is 2. This used to work slightly before 5.0.0. In practice this bug matters only if xz has been built with some check types disabled. As instructed in PACKAGERS, such builds should be done in special situations only. - Fix "xz -dc --single-stream tests/files/good-0-empty.xz" which failed with "Internal error (bug)". That is, --single-stream was broken if the first .xz stream in the input file didn't contain any uncompressed data. - Fix displaying file sizes in the progress indicator when working in passthru mode and there are multiple input files. Just like "gzip -cdf", "xz -cdf" works like "cat" when the input file isn't a supported compressed file format. In this case the file size counters weren't reset between files so with multiple input files the progress indicator displayed an incorrect (too large) value. * liblzma: - API docs in lzma/container.h: * Update the list of decoder flags in the decoder function docs. * Explain LZMA_CONCATENATED behavior with .lzma files in lzma_auto_decoder() docs. - OpenBSD: Use HW_NCPUONLINE to detect the number of available hardware threads in lzma_physmem(). - Fix use of wrong macro to detect x86 SSE2 support. __SSE2_MATH__ was used with GCC/Clang but the correct one is __SSE2__. The first one means that SSE2 is used for floating point math which is irrelevant here. The affected SSE2 code isn't used on x86-64 so this affects only 32-bit x86 builds that use -msse2 without -mfpmath=sse (there is no runtime detection for SSE2). It improves LZMA compression speed (not decompression). - Fix the build with Intel C compiler 2021 (ICC, not ICX) on Linux. It defines __GNUC__ to 10 but doesn't support the __symver__ attribute introduced in GCC 10. * Scripts: Ignore warnings from xz by using --quiet --no-warn. This is needed if the input .xz files use an unsupported check type. * Translations: - Updated Croatian and Turkish translations. - One new translations wasn't included because it needed technical fixes. It will be in upcoming 5.4.0. No new translations will be added to the 5.2.x branch anymore. - Renamed the French man page translation file from fr_FR.po to fr.po and thus also its install directory (like /usr/share/man/fr_FR -> .../fr). - Man page translations for upcoming 5.4.0 are now handled in the Translation Project. * Update doc/faq.txt a little so it's less out-of-date. -- Lasse Collin
Re: [xz-devel] XZ Utils 5.3.3alpha
On 2022-09-29 Guillem Jover wrote: > On Wed, 2022-09-28 at 21:41:59 +0800, Jia Tan wrote: > > […] The > > interface for liblzma and xz for the multi threaded decoder does not > > have any planned changes, so things could probably be developed and > > tested using 5.3.3. > > Ah, thanks, that's reassuring then. It's one of the things I was > worried about when having to decide whether to merge the patch I've > got implementing this support into dpkg. So, once the alpha version > has been packaged for Debian experimental, I'll test the patch and > commit it. There are no planned changes but that isn't a *promise* that there won't be any changes before 5.4.0. I don't track API or ABI compatibility within development releases and thus binaries linked against shared liblzma from one alpha/beta release won't run with liblzma from the next alpha/beta *if* they depend on unstable symbols (symbol versioning stops it). This includes the xz binary itself and would include dpkg too if it uses the threaded decoder. Sometimes it can be worked around with distro-specific patches but that's extra hassle and can go wrong too. Please don't end up with a similar result that happened with RHEL/CentOS 7 which ended up affecting users of other distributions too (this is included in 5.2.7): https://git.tukaani.org/?p=xz.git;a=commitdiff;h=913ddc5572b9455fa0cf299be2e35c708840e922 So while I encourage testing, one needs to be careful when it can affect critical tools in the operating system. :-) -- Lasse Collin
Re: [xz-devel] XZ Utils 5.3.3alpha
On 2022-09-28 Jia Tan wrote: > On 2022-09-27 Sebastian Andrzej Siewior wrote: > > Okay, so that is what you are tracking. I remember that there was a > > stall in the decoding but I don't remember how it played out. > > > > I do remember that I had something for memory allocation/ limit but > > I don't remember if we settled on something or if discussion is > > needed. Also how many decoding threads make sense, etc. > > We ended up changing xz to use (total_ram / 4) as the default "soft > limit". If the soft limit is reached, xz will decode single threaded. > The "hard limit" shares the same environment variable and xz option > (--memlimit-decompress). There is also the 1400 MiB cap for 32-bit executables. The memory limiting in threaded decompression (two separate limits in parallel) is one thing where feedback would be important as after the liblzma API, ABI and xz tool syntax are in a stable release, backward compatibility has to be maintained. Another thing needing feedback is the new behavior of -T0 when no memlimit has been specified. Now it has a default soft limit. I hope it is an improvement but quite possibly it could be improved. Your suggestion to use MemAvailable on Linux is one thing that could be included if people think it is a good way to go as a Linux-specific behavior (having more benefits than downsides). These are documented on the xz man page. I hope it is clear enough. It feels a bit complicated, which is a bad sign but on the other hand I feel the underlying problem isn't as trivial as it seems on the surface. So far Jia Tan and I have received no feedback about these things at all. I would prefer to hear the complaints before 5.4.0 is out. :-) > > This reminds me that I once posted a patch to use openssl for the > > sha256. > > https://www.mail-archive.com/xz-devel@tukaani.org/msg00429.html > > > > Some distro is using sha256 instead crc64 by default, I don't > > remember which one… Not that I care personally ;) > > I am unsure if we will have time to include your sha256 patch, but if > we finish all the tasks with extra time it may be considered. There's more to this than available time. 5.1.2alpha added support for using SHA-256 from the OS base libraries (not OpenSSL) but starting with 5.2.3 it is disabled by default. Some OS libs use (or used to use) the same symbol names for SHA-256 functions as OpenSSL while having incompatible ABI. This lead to weird problems when an application needed both liblzma and OpenSSL as liblzma ended up calling OpenSSL functions. Plus, some of the OS-specific implementations were slower than the C code in liblzma (OpenSSL would be faster). OpenSSL's license has compatibility questions with GNU GPL. If I remember correctly, some distributions consider OpenSSL to be part of the core operating system and thus avoid the compatibility problem with the GPL. I'm not up to date how distros handle it in 2022 but perhaps it should be taken into account so that apps depending on liblzma won't get legally unacceptable OpenSSL linkage. So if OpenSSL support is added it likely should be disabled by default in configure.ac. > > > This is everything currently planned. Translations need to be updated too once the strings and man pages are close to final. A development release needs to be sent to the Translation Project at some point. If people want to translate the man pages too, they will need quite a bit of time. -- Lasse Collin
[xz-devel] XZ Utils 5.2.7
XZ Utils 5.2.7 is available at <https://tukaani.org/xz/>. Here is an extract from the NEWS file: 5.2.7 (2022-09-30) * liblzma: - Made lzma_filters_copy() to never modify the destination array if an error occurs. lzma_stream_encoder() and lzma_stream_encoder_mt() already assumed this. Before this change, if a tiny memory allocation in lzma_filters_copy() failed it would lead to a crash (invalid free() or invalid memory reads) in the cleanup paths of these two encoder initialization functions. - Added missing integer overflow check to lzma_index_append(). This affects xz --list and other applications that decode the Index field from .xz files using lzma_index_decoder(). Normal decompression of .xz files doesn't call this code and thus most applications using liblzma aren't affected by this bug. - Single-threaded .xz decoder (lzma_stream_decoder()): If lzma_code() returns LZMA_MEMLIMIT_ERROR it is now possible to use lzma_memlimit_set() to increase the limit and continue decoding. This was supposed to work from the beginning but there was a bug. With other decoders (.lzma or threaded .xz decoder) this already worked correctly. - Fixed accumulation of integrity check type statistics in lzma_index_cat(). This bug made lzma_index_checks() return only the type of the integrity check of the last Stream when multiple lzma_indexes were concatenated. Most applications don't use these APIs but in xz it made xz --list not list all check types from concatenated .xz files. In xz --list --verbose only the per-file "Check:" lines were affected and in xz --robot --list only the "file" line was affected. - Added ABI compatibility with executables that were linked against liblzma in RHEL/CentOS 7 or other liblzma builds that had copied the problematic patch from RHEL/CentOS 7 (xz-5.2.2-compat-libs.patch). For the details, see the comment at the top of src/liblzma/validate_map.sh. WARNING: This uses __symver__ attribute with GCC >= 10. In other cases the traditional __asm__(".symver ...") is used. Using link-time optimization (LTO, -flto) with GCC versions older than 10 can silently result in broken liblzma.so.5 (incorrect symbol versions)! If you want to use -flto with GCC, you must use GCC >= 10. LTO with Clang seems to work even with the traditional __asm__(".symver ...") method. * xzgrep: Fixed compatibility with old shells that break if comments inside command substitutions have apostrophes ('). This problem was introduced in 5.2.6. * Build systems: - New #define in config.h: HAVE_SYMBOL_VERSIONS_LINUX - Windows: Fixed liblzma.dll build with Visual Studio project files. It broke in 5.2.6 due to a change that was made to improve CMake support. - Windows: Building liblzma with UNICODE defined should now work. - CMake files are now actually included in the release tarball. They should have been in 5.2.5 already. - Minor CMake fixes and improvements. * Added a new translation: Turkish -- Lasse Collin
[xz-devel] XZ Utils 5.3.3alpha
age.sh to create a code coverage report of the tests. * Build systems: - Automake's parallel test harness is now used to make tests finish faster. - Added the CMake files to the distribution tarball. These were supposed to be in 5.2.5 already. - Added liblzma tests to the CMake build. - Windows: Fix building of liblzma.dll with the included Visual Studio project files. -- Lasse Collin
Re: [xz-devel] VS projects fail to build the resource file
On 2022-08-18 Olivier B. wrote: > The cmake windows build in a 5.2.6 git clone seem to build and install > fine for me! Good to know, thanks! > As small an improvement to them, I wouldn't mind if the pdbs were > installed too in the configurations where they are generated (and > actually also in release builds) I see .pdb files are for debug symbols and I see CMake has some properties related to them but I don't know much more. Are the .pdb files generated by default in the CMake-generated debug targets but not by the release targets? Does the following do something good? diff --git a/CMakeLists.txt b/CMakeLists.txt index 2a88af3..ccfb217 100644 --- a/CMakeLists.txt +++ b/CMakeLists.txt @@ -499,6 +499,14 @@ install(DIRECTORY src/liblzma/api/ DESTINATION "${CMAKE_INSTALL_INCLUDEDIR}" FILES_MATCHING PATTERN "*.h") +if(MSVC) +# Install MSVC debug symbol file if it was generated. +install(FILES $ +DESTINATION "${CMAKE_INSTALL_BINDIR}" +COMPONENT liblzma_Development +OPTIONAL) +endif() + # Install the CMake files that other packages can use to find liblzma. set(liblzma_INSTALL_CMAKEDIR "${CMAKE_INSTALL_LIBDIR}/cmake/liblzma" I understood that the above can only work for DLLs. Static library would need compiler-generated debug info which CMake supports via COMPILE_PDB_NAME property. If .pdb files aren't created for release builds by default, there likely is a way to enable it. I cannot test MSVC builds now so I won't make many blind guesses. -- Lasse Collin
Re: [xz-devel] VS projects fail to build the resource file
On 2022-08-18 Olivier B. wrote: > Yes, indeed. I sent the mail after having only fixed one > configuration, but the full solution build needs the six modifications OK, thanks! I committed it to vs2013, vs2017, and vs2019 files, also to the v5.2 branch. > Is it normal that CMakeLists and other files are not in the 5.2.6 (or > 5.3.2) tarball, only in the git? That's not intentional. Seems that I have forgotten to add those to Automake's dist target. 5.2.5 was supposed to have experimental CMake files already as it was mentioned in the NEWS file. It has been fixed, also in the v5.2 branch. Thanks! -- Lasse Collin
Re: [xz-devel] VS projects fail to build the resource file
On 2022-08-18 Olivier B. wrote: > I am trying to build 5.2.6 on windows, but, presumably after > 352ba2d69af2136bc814aa1df1a132559d445616, he build using the MSVC 2013 > project file fails. Thanks! So the fix for one thing broke another situation. :-( I cannot test but it seems the same addition is needed in six places, not just in "Debug|Win32" case, right? diff --git a/windows/vs2013/liblzma_dll.vcxproj b/windows/vs2013/liblzma_dll.vcxproj index 2bf3e41..f24cd6f 100644 --- a/windows/vs2013/liblzma_dll.vcxproj +++ b/windows/vs2013/liblzma_dll.vcxproj @@ -137,6 +137,7 @@ ./;../../src/liblzma/common;../../src/common;../../src/liblzma/api; + HAVE_CONFIG_H @@ -154,6 +155,7 @@ ./;../../src/liblzma/common;../../src/common;../../src/liblzma/api; + HAVE_CONFIG_H @@ -173,6 +175,7 @@ ./;../../src/liblzma/common;../../src/common;../../src/liblzma/api; + HAVE_CONFIG_H @@ -191,6 +194,7 @@ ./;../../src/liblzma/common;../../src/common;../../src/liblzma/api; + HAVE_CONFIG_H @@ -210,6 +214,7 @@ ./;../../src/liblzma/common;../../src/common;../../src/liblzma/api; + HAVE_CONFIG_H @@ -228,6 +233,7 @@ ./;../../src/liblzma/common;../../src/common;../../src/liblzma/api; + HAVE_CONFIG_H I will commit the above to all VS project files if you think it's good. Does it work with CMake for you? I'm hoping that the VS project files can be removed in the near-future and CMake used for building with VS. That way there are fewer build files to maintain. -- Lasse Collin
[xz-devel] XZ Utils 5.2.6
XZ Utils 5.2.6 is available at <https://tukaani.org/xz/>. Here is an extract from the NEWS file: 5.2.6 (2022-08-12) * xz: - The --keep option now accepts symlinks, hardlinks, and setuid, setgid, and sticky files. Previously this required using --force. - When copying metadata from the source file to the destination file, don't try to set the group (GID) if it is already set correctly. This avoids a failure on OpenBSD (and possibly on a few other OSes) where files may get created so that their group doesn't belong to the user, and fchown(2) can fail even if it needs to do nothing. - Cap --memlimit-compress to 2000 MiB instead of 4020 MiB on MIPS32 because on MIPS32 userspace processes are limited to 2 GiB of address space. * liblzma: - Fixed a missing error-check in the threaded encoder. If a small memory allocation fails, a .xz file with an invalid Index field would be created. Decompressing such a file would produce the correct output but result in an error at the end. Thus this is a "mild" data corruption bug. Note that while a failed memory allocation can trigger the bug, it cannot cause invalid memory access. - The decoder for .lzma files now supports files that have uncompressed size stored in the header and still use the end of payload marker (end of stream marker) at the end of the LZMA stream. Such files are rare but, according to the documentation in LZMA SDK, they are valid. doc/lzma-file-format.txt was updated too. - Improved 32-bit x86 assembly files: * Support Intel Control-flow Enforcement Technology (CET) * Use non-executable stack on FreeBSD. - Visual Studio: Use non-standard _MSVC_LANG to detect C++ standard version in the lzma.h API header. It's used to detect when "noexcept" can be used. * xzgrep: - Fixed arbitrary command injection via a malicious filename (CVE-2022-1271, ZDI-CAN-16587). A standalone patch for this was released to the public on 2022-04-07. A slight robustness improvement has been made since then and, if using GNU or *BSD grep, a new faster method is now used that doesn't use the old sed-based construct at all. This also fixes bad output with GNU grep >= 3.5 (2020-09-27) when xzgrepping binary files. This vulnerability was discovered by: cleemy desu wayo working with Trend Micro Zero Day Initiative - Fixed detection of corrupt .bz2 files. - Improved error handling to fix exit status in some situations and to fix handling of signals: in some situations a signal didn't make xzgrep exit when it clearly should have. It's possible that the signal handling still isn't quite perfect but hopefully it's good enough. - Documented exit statuses on the man page. - xzegrep and xzfgrep now use "grep -E" and "grep -F" instead of the deprecated egrep and fgrep commands. - Fixed parsing of the options -E, -F, -G, -P, and -X. The problem occurred when multiple options were specied in a single argument, for example, echo foo | xzgrep -Fe foo treated foo as a filename because -Fe wasn't correctly split into -F -e. - Added zstd support. * xzdiff/xzcmp: - Fixed wrong exit status. Exit status could be 2 when the correct value is 1. - Documented on the man page that exit status of 2 is used for decompression errors. - Added zstd support. * xzless: - Fix less(1) version detection. It failed if the version number from "less -V" contained a dot. * Translations: - Added new translations: Catalan, Croatian, Esperanto, Korean, Portuguese, Romanian, Serbian, Spanish, Swedish, and Ukrainian - Updated the Brazilian Portuguese translation. - Added French man page translation. This and the existing German translation aren't complete anymore because the English man pages got a few updates and the translators weren't reached so that they could update their work. * Build systems: - Windows: Fix building of resource files when config.h isn't used. CMake + Visual Studio can now build liblzma.dll. - Various fixes to the CMake support. Building static or shared liblzma should work fine in most cases. In contrast, building the command line tools with CMake is still clearly incomplete and experimental and should be used for testing only. -- Lasse Collin
Re: [xz-devel] [PATCH] LZMA_FINISH will now trigger LZMA_BUF_ERROR on truncated xz files right away
On 2022-04-21 Jia Tan wrote: > The current behavior of LZMA_FINISH in the decoder is a little > confusing because it requires calling lzma_code a few times without > providing more input to trigger a LZMA_BUF_ERROR. The current behavior basically ignores the use LZMA_FINISH when determining if LZMA_BUF_ERROR should be returned. I understand that it can be confusing since after LZMA_FINISH there is nothing a new call to lzma_code() can do to avoid the problem. However, I don't think it's a problem in practice: - Application that calls lzma_code() in a loop will just call lzma_code() again and eventually get LZMA_BUF_ERROR. - Application that does single-shot decoding without a loop tends to check for LZMA_STREAM_END as a success condition and treats other codes, including LZMA_OK, as a problem. In the worst case a less robust application could break if this LZMA_OK becomes LZMA_BUF_ERROR as the existing API doc says that LZMA_BUF_ERROR won't be returned immediately. The docs don't give any indication that LZMA_FINISH could affect this behavior. - An extra call or two to lzma_code() in an error condition doesn't matter in terms of performance. > This patch replaces return LZMA_OK lines with: > > return action == LZMA_FINISH && *out_pos != out_size ? LZMA_BUF_ERROR > : LZMA_OK; I don't like replacing a short statement with a copy-pasted long statement since it is needed in so many places. A benefit of the current approach is that the handling of LZMA_BUF_ERROR is in lzma_code() and (most of the time) the rest of code can ignore the problem completely. Also, the condition *out_pos != out_size is confusing in a few places. For example, in SEQ_STREAM_HEADER: --- a/src/liblzma/common/stream_decoder.c +++ b/src/liblzma/common/stream_decoder.c @@ -118,7 +118,8 @@ stream_decode(void *coder_ptr, const lzma_allocator *allocator, // Return if we didn't get the whole Stream Header yet. if (coder->pos < LZMA_STREAM_HEADER_SIZE) - return LZMA_OK; + return action == LZMA_FINISH && *out_pos != out_size + ? LZMA_BUF_ERROR : LZMA_OK; coder->pos = 0; In SEQ_STREAM_HEADER no output can be produced, only input will be read. Still the condition checks for full output buffer which is not only confusing but wrong: if there was an empty Stream ahead, having no output space would be fine! In such a situation this can return LZMA_OK even when the intention was to return LZMA_BUF_ERROR due to truncated input. To make this work, only places that can produce output should check if the output buffer is full. However, I don't think the current behavior is worth changing. As you pointed out, it is a bit weird (and I had never noticed it myself before you mentioned it). It's not actually broken though and some applications doing single-shot decoding might even rely on the current behavior. Trying to change this could cause problems in rare cases and, if not done carefully enough, introduce new bugs. So I thank you for the patch but it won't be included. -- Lasse Collin
[xz-devel] Man page translations for XZ Utils 5.2.6
Hello! A bugfix release will be made around mid-August 2022. The German and French translations of the man pages need updating. There are a few small changes to the factual content but there are also style changes which increases the number of strings that have been modified. A pre-release snapshot from the v5.2 branch is available here: https://tukaani.org/xz/xz-5.2.5-85-g275de.tar.xz A tiny thing: I changed the po4a --copyright-holder argument to "[See the headers in the input files.]" since the three small man pages inherited from GNU gzip are GNU GPLv2+. It affects the comment that gets put on top of xz-man.pot. The strings in the command line tools haven't changed since 5.2.5 or even 5.2.4 apart from one string being removed completely. Jia Tan fixed all white-space bugs from the pending translations so 5.2.6 will have many new translations. :-) With 5.2.6 I will also finally release 5.3.3alpha. The development branch has some of the difficult string split into separate strings for easier translation. I suppose 5.3.3alpha or a later snapshot could be sent to the Translation Project somewhat soon and perhaps creation of xz-man domain reconsidered at the same time since I got one new feedback wishing for xz-man in the TP. Clearly there are people who wish to translate man pages. :-) I won't be on my computer for about two weeks so I won't be able to reply to emails before that. Thanks! -- Lasse Collin
Re: [xz-devel] Question about using Java API for geospatial data
On 2022-07-09 Gary Lucas wrote: > I am using the library to compress a public-domain data product called > ETOPO1. ETOPO1 provides a global-scale grid of 233 million elevation > and ocean depth samples as integer meters. My implementation > compresses the data in separate blocks of about 20 thousand values > each. So that is about 12 thousand blocks? > Previously, I used Huffman coding and Deflate to reduce the size > of the data to about 4.39 bits per value. With your library, LZMA > reduces that to 4.14 bits per value and XZ to 4.16. Is the compressed size of each block about ten kilobytes? > The original implementation requires an average of 4.8 seconds to > decompress the full set of 233 million points. The LZMA version > requires 15.2 seconds, and the XZ version requires 18.9 seconds. The Deflate implementation in java.util.zip uses zlib (native code). XZ for Java is pure Java. LZMA is significantly slower than Deflate and being pure Java makes the difference even bigger. > My understanding is that XZ should perform better than LZMA. Since > that is not the case, could there be something suboptimal with the way > my code uses the API? The core compression code is the same in both: XZ uses LZMA2 which is LZMA with framing. XZ adds a few features like filters, integrity checking, and block-based random access reading. > And here are the Code Snippets: The XZ examples don't use XZ for Java directly. This is clear due to "Xz" vs. "XZ" difference in the class names and that XZOutputStream has no constructor that takes the input size as an argument. Non-performance notes: - Section "When uncompressed size is known beforehand" in XZInputStream is worth reading. Basically adding a check that "xzIn.read() == -1" is true at the end to verify the integrity check. This at least used to be true (I haven't tested recently) for GZipInputStream too. - When compressing, .finish() is redundant. .close() will do it anyway. - If XZ data is embedded insize another file format, you may want to use SingleXZInputStream instead of XZInputStream. XZInputStream supports concatenated streams that are possible on standalone .xz files but probably shouldn't occur when embedded inside another format. In your case this likely makes no difference in practice. Might affect performance: - The default LZMA2 dictionary size is 8 MiB. If the uncompressed size is known to be much smaller than this, it's waste of memory to use so big dictionary. In that case pick a value that is at least as big as the largest uncompressed size, possibly round up to 2^n value. - Compressing or decompressing multiple streams that use identical settings means creating many compressor or decompressor instances. To reduce garbage collector pressure there is ArrayCache which reuses large array allocations. You can enable this globally with this: ArrayCache.setDefaultCache(BasicArrayCache.getInstance()); However, setting the default like this might not be desired if multiple unrelated things in the application might use XZ for Java. Note that ArrayCache can help both LZMA and XZ classes. Likely will affect performance: - Since compression ratio is high, the integrity checking starts to become more significant for performance. To test how much integrity checking slows XZ down, use SingleXZInputStream or XZInputStream constructor that takes "boolean verifyCheck" and set it to false. You can also compress to XZ without integrity checking at all (using XZ.CHECK_NONE as the third argument in XZOutputStream constructor). Using XZ.CHECK_CRC32 is likely much faster than the default XZ.CHECK_CRC64 because CRC32 comes from java.util.zip which uses native code from zlib. It's quite possible that XZ provides no value over raw LZMA in this application, especially if you don't need integrity checking. Raw LZMA instead of .lzma will even avoid the 13-byte .lzma header saving 150 kilobytes with 12 thousand blocks. If the uncompressed size is stored in the container headers then further 4-5 bytes per block can be saved by telling the size to the raw LZMA encoder and decoder. Note that LZMAOutputStream and LZMAInputStream support .lzma and raw LZMA: the choise between these is done by picking the right constructors. Finally, it might be worth playing with the lc/lp/pb parameters in LZMA/LZMA2. Usually those make only tiny difference but with some data types they have a bigger effect. These won't affect performance other than that the smaller the compressed file the faster it tends to decompress in case of LZMA/LZMA2. Other compressors might be worth trying too. Zstandard typically compresses only slightly worse than XZ/LZMA but it is *a lot* faster to decompress. -- Lasse Collin
Re: [xz-devel] XZ for Java
On 2022-06-21 Dennis Ens wrote: > Why not pass on maintainership for XZ for C so you can give XZ for > Java more attention? Or pass on XZ for Java to someone else to focus > on XZ for C? Trying to maintain both means that neither are > maintained well. Finding a co-maintainer or passing the projects completely to someone else has been in my mind a long time but it's not a trivial thing to do. For example, someone would need to have the skills, time, and enough long-term interest specifically for this. There are many other projects needing more maintainers too. As I have hinted in earlier emails, Jia Tan may have a bigger role in the project in the future. He has been helping a lot off-list and is practically a co-maintainer already. :-) I know that not much has happened in the git repository yet but things happen in small steps. In any case some change in maintainership is already in progress at least for XZ Utils. -- Lasse Collin
Re: [xz-devel] XZ for Java
On 2022-06-07 Jigar Kumar wrote: > Progress will not happen until there is new maintainer. XZ for C has > sparse commit log too. Dennis you are better off waiting until new > maintainer happens or fork yourself. Submitting patches here has no > purpose these days. The current maintainer lost interest or doesn't > care to maintain anymore. It is sad to see for a repo like this. I haven't lost interest but my ability to care has been fairly limited mostly due to longterm mental health issues but also due to some other things. Recently I've worked off-list a bit with Jia Tan on XZ Utils and perhaps he will have a bigger role in the future, we'll see. It's also good to keep in mind that this is an unpaid hobby project. Anyway, I assure you that I know far too well about the problem that not much progress has been made. The thought of finding new maintainers has existed for a long time too as the current situation is obviously bad and sad for the project. A new XZ Utils stable branch should get released this year with threaded decoder etc. and a few alpha/beta releases before that. Perhaps the moment after the 5.4.0 release would be a convenient moment to make changes in the list of project maintainer(s). Forks are obviously another possibility and I cannot control that. If those happen, I hope that file format changes are done so that no silly problems occur (like using the same ID for different things in two projects). 7-Zip supports .xz and keeping its developer Igor Pavlov informed about format changes (including new filters) is important too. -- Lasse Collin
Re: [xz-devel] XZ for Java
On 2022-05-19 Dennis Ens wrote: > Is XZ for Java still maintained? Yes, by some definition at least, like if someone reports a bug it will get fixed. Development of new features definitely isn't very active. :-( > I asked a question here a week ago and have not heard back. I saw. I have lots of unanswered emails at the moment and obviously that isn't a good thing. After the latest XZ for Java release I've tried focus on XZ Utils (and ignored XZ for Java), although obviously that hasn't worked so well either even if some progress has happened with XZ Utils. > When I view the git log I can see it has not updated in over a year. > I am looking for things like multithreaded encoding / decoding and a > few updates that Brett Okken had submited (but are still waiting for > merge). Should I add these things to only my local version, or is > there a plan for these things in the future? Brett Okken's patches I haven't reviewed so I cannot give definite answers about if you should include them in your local version, sorry. The match finder optimizations are more advanced as they are somewhat arch-specific so it could be good to have broader testing how much they help on different systems (not just x86-64 but 32-bit x86, ARM64, ...) and if they behave well on Android too. The benefits have to be clear enough (and cause no problems) to make the extra code worth it. The Delta coder patch is small and relative improvement is big, so that likely should get included. The Delta filter is used rarely though and even a slow version isn't *that* slow in the big picture (there will also be LZMA2 and CRC32/CRC64). Threading would be nice in the Java version. Threaded decompression only recently got committed to XZ Utils repository. Jia Tan has helped me off-list with XZ Utils and he might have a bigger role in the future at least with XZ Utils. It's clear that my resources are too limited (thus the many emails waiting for replies) so something has to change in the long term. -- Lasse Collin
Re: [xz-devel] [PATCH] xz: Fix setting memory limit on 32-bit systems
On 2021-01-20 Sebastian Andrzej Siewior wrote: > On 2021-01-18 23:52:50 [+0200], Lasse Collin wrote: > > I have understood that *in practice* the problem with the xz command > > line tool is limited to "xz -T0" usage so fixing this use case is > > enough for most people. Please correct me if I missed something. > > Correct. There is some code for special behavior with -T0 now for both compression and decompression. I haven't updated the man page yet but the commit messages should be helpful. I hope it can be documented so that it sounds simple enough. :-) > In the parallel decompress I added code on Linux to query the > available memory. I would prefer that as an upper limit on 64bit if no > limit is given. The reason is that *this* amount of memory is safe to > use without over-committing / involving swap. This may be the way to go on Linux but I didn't add it yet. The committed code uses total_ram / 4. Since MemAvail is Linux-specific something more broadly available needs exist for better portability, and total_ram / 4 could perhaps be it. It can be tweaked if needed, it's just a starting point. > For 32bit applications I would cap that limit to 2.5 GiB or so. The > reason is that the *normal* case is to run 32bit application on a > 32bit kernel and so likely only 3GiB can be addressed at most (minus > a few details like linked in libs, NULL page, guard pages and so on). > The 32bit application on 64bit kernel is probably a shortcut where > something is done a 32bit chroot - like building a package. > > I'm not sure what a sane upper limit is on other OSes. Limitting it on > 32bit does probably more good than bad if there is no -M parameter. I think a generic cap needs to be below 2 GiB. For example, if 32-bit MIPS can do only 2 GiB. There could be OS+arch-specific exceptions though. The code currently in xz.git uses 1400 MiB. There needs to be some extra room if repeated mallocs and frees fragment the address space a little. Perhaps it's too conservative but it allows eight compression threads at the default xz -6, and one thread at -9 in threaded mode (so it can create a file that can be decompressed in threaded mode). > > An alternative "fix" for the liblzma case could be adding a simple > > API function that would scale down the number of threads in a > > lzma_mt structure based on a memory usage limit and if the > > application is 32 bits. Currently the thread count and LZMA2 > > settings adjusting code is in xz, not in liblzma. > > It might help. dpkg checks the memlimit with > lzma_stream_encoder_mt_memusage() and decreases the memory limit until > it fits. It looks simpler compared to rpm's attempt and various > exceptions. Now that lzma_mt structure contains memlimit_threading already, a flag could be added to use it to reduce the number of threads at the encoder initialization. I suppose reducing the thread count would go a long way. It doesn't affect the compressed output so it can be done when people wish reproducible output. > > The idea for the current 4020 MiB special limit is based on a patch > > that was in use in FreeBSD to solve the problem of 32-bit xz on > > 64-bit kernel. So at least FreeBSD should be supported to not make > > 32-bit xz worse under 64-bit FreeBSD kernel. > > Is this a common case? I don't *know* but I guess some build 32-bit packages on a 64-bit kernel so it may be common enough use case. > While poking around, Linux has this personality() syscall/function. > There is a flag called PER_LINUX32_3GB and PER_LINUX_32BIT which are > set if the command is invoked with `linux32' say > linux32 xz > > then it would set that flag set and could act. It is not set by > starting a 32bit application on a 64bit kernel on its own or on a > 32bit kernel. I don't know if this is common practise but I use this > in my chroots. So commands like `uname -m' return `i686' instead of > `x86_64'. If other chroot environments do it as well then it could be > used as a hack to assume that it is run on 64bit kernel. That is if > we want that ofcourse :) I haven't look at this but it sounds that it could be useful. If xz knows that it has 4 GiB of address space the default limit could be much higher. -- Lasse Collin
[xz-devel] xzgrep security fix for XZ Utils <= 5.2.5, 5.3.2alpha (ZDI-CAN-16587)
Malicious filenames can make xzgrep to write to arbitrary files or (with a GNU sed extension) lead to arbitrary code execution. xzgrep from XZ Utils versions up to and including 5.2.5 are affected. 5.3.1alpha and 5.3.2alpha are affected as well. This patch works for all of them. This bug was inherited from gzip's zgrep. gzip 1.12 includes a fix for zgrep. This vulnerability was discovered by: cleemy desu wayo working with Trend Micro Zero Day Initiative The patch and signature are available here: https://tukaani.org/xz/xzgrep-ZDI-CAN-16587.patch https://tukaani.org/xz/xzgrep-ZDI-CAN-16587.patch.sig It is also linked from the XZ Utils home page <https://tukaani.org/xz/>. -- Lasse Collin
Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder
On 2022-03-17 Jia Tan wrote: > I attached two patches to this message. The first should fix a bug > with the timeouts. Thanks! This and the deadlock are now fixed (I committed them a few days ago). > The second patch is for the memlimit_threading update. I added a new > API function that will fail for anything that is not the multithreaded > decoder. I need to consider this a little later. Some of the things I will do next (some already have a patch on this list): - Add fail-fast flag to lzma_stream_decoder_mt(). - Possibly fix a corner case in threaded coder if lzma_code() is called in a similar way as in zpipe.c in in <https://zlib.net/zlib_how.html>. That is, currently it doesn't work but it can be made to work, I think. Supporting it makes threaded decoder a little easier to adapt to existing apps if they use that kind of decoding loop. - --memlmit-threading, I wrote this weeks ago except a few details that need to be decided. For example, I guess -M should set --memlimit-threading just like it sets --memlimit-compress and --memlimit-decompress. - Initial version of automatic memlimit with --threads=0. First version can be based on lzma_physmem() but other methods can be added. Sebastian's patch uses MemAvailable on Linux, your patch uses freemem from sysinfo() which equals MemFree in /proc/meminfo. I suppose MemAvailable is a better starting point. - Support for forcing single/multi-threaded mode with --threads for cases when xz decides to use only one thread. - Fix changing memlimit after LZMA_MEMLIMIT_ERROR in the old single-threaded decoder. (I knew it's a rare use case but clearly it's not a use case at all since I haven't seen bug reports.) - Your test framework patches I suppose then the next alpha release is close to ready. -- Lasse Collin
Re: [xz-devel] Re: improve java delta performance
> On Thu, May 6, 2021 at 4:18 PM Brett Okken > wrote: > > > These changes reduce the time of DeltaEncoder by ~65% and > > DeltaDecoder by ~40%, assuming using arrays that are several KB in > > size. On 2022-02-12 Brett Okken wrote: > Can this be reviewed? It looks reasonable but I try to focus on XZ Utils at the moment. The Delta code in XZ Utils is also very simple and could be optimized the same way. But since Delta isn't used alone (it's used together with LZMA2) I suspect the overall improvement isn't big. It could still be done as it is simple but I won't look at it now. For the ArrayUtil patch, it's a complex one and I'm not able to look at it for now. -- Lasse Collin
Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder
On 2022-03-15 Jia Tan wrote: > As promised, I have attached a patch to solve the problem. Instead of > doing as I had originally proposed, I simply added a wake up signal > to a sleeping thread if partial updates are enabled. When the worker > wakes up, it checks if no more input > is available and signals to the main thread if it has output ready > before going back > to sleep. This prevents the deadlock on my liblzma tests and testing > xz with/without timeout. Thanks to both of you for debugging this. I see now that I had completely missed this corner case. The patch looks correct except that the mutex locking order is wrong which can cause a new deadlock. If both thr->mutex and coder->mutex are locked at the same time, coder->mutex must be locked first. About memlimit updates, that may indeed need some work but I don't know yet how much is worth the trouble. stream_decoder_mt_memconfig() has a few FIXMEs too, maybe they don't need to be changed but it needs to be decided. I'm in a hurry now but I should have time for xz next week. :-) -- Lasse Collin
Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder
Hello! Once again, sorry for the delay. I will be busy the rest of the week. I will get back to xz early next week. On 2022-03-07 Sebastian Andrzej Siewior wrote: > 32 cores: > > | $ time ./src/xz/xz -tv tars.tar.xz -T0 > | tars.tar.xz (1/1) > | 100 % 2.276,2 MiB / 18,2 GiB = 0,122 1,6 GiB/s 0:11 > | > | real0m11,162s > | user5m44,108s > | sys 0m1,988s > > 256 cores: > | $ time ./src/xz/xz -tv tars.tar.xz -T0 > | tars.tar.xz (1/1) > | 100 % 2.276,2 MiB / 18,2 GiB = 0,122 3,4 GiB/s 0:05 > | > | real0m5,403s > | user4m0,298s > | sys 0m24,315s > > it appears to work :) If I see this right, then the file is too small > or xz too fast but it does not appear that xz manages to create more > than 100 threads. Thanks! The scaling is definitely good enough. :-) Even if there was room for improvement I won't think about it much for now. A curious thing above is the ratio of user-to-sys time. With more threads a lot more is spent in syscalls. > and decompression to disk > | $ time ~bigeasy/xz/src/xz/xz -dvk tars.tar.xz -T0 > | tars.tar.xz (1/1) > | 100 % 2.276,2 MiB / 18,2 GiB = 0,122 746 MiB/s 0:24 > | > | real0m25,064s > | user3m49,175s > | sys 0m29,748s > > appears to block at around 10 to 14 threads or so and then it hangs > at the end until disk I/O finishes. Decent. > Assuming disk I/O is slow, say 10MiB/s, and we would 388 CPUs > (blocks/2) then it would decompress the whole file into memory and > stuck on disk I/O? Yes. I wonder if the way xz does I/O might affect performance. Every time the 8192-byte input buffer is empty (that is, liblzma has consumed it), xz will block reading more input until another 8192 bytes have been read. As long as threads can consume more input, each call to lzma_code() will use all 8192 bytes. Each call might pass up to 8192 bytes of output from liblzma to xz too. If compression ratio is high and reading input isn't very fast, then perhaps performance might go down because blocking on input prevents xz from producing more output. Only when liblzma cannot consume more input xz will produce output at full speed. That is, I wonder if with slow input the output speed will be limited until the input buffers inside liblzma have been filled. My explanation isn't very good, sorry. Ideally input and output would be in different threads but the liblzma API doesn't really allow that. Based on your benchmarks the current method likely is easily good enough in practice. > In terms of scaling, xz -tv of that same file with with -T1…64: [...] > time of 1 CPU / 64 = (3 * 60 + 38) / 64 = 3.40625 > > Looks okay. Yes, thanks! > > If the input is broken, it should produce as much output as the > > single-threaded stable version does. That is, if one thread detects > > an error, the data before that point is first flushed out before > > the error is reported. This has pros and cons. It would be easy to > > add a flag to allow switching to fast error reporting for > > applications that don't care about partial output from broken > > files. > > I guess most of them don't care because an error is usually an abort, > the sooner, the better. It is probably the exception that you want > decompress it despite the error and maybe go on with the next block > and see what is left. I agree. Over 99 % of the time any error means that the whole output will be discarded. However, I would like to make the threaded decoder to (optionally) have very similar external behavior as the single-threaded version for cases where it might matter. It's not perfect at the moment but I think it's decent enough (bugs excluded). Truncated files are a special case of corrupt input because, unless LZMA_FINISH is used, liblzma cannot know if the input is truncated or if there is merely a pause in the input for some application-specific reason. That can result in LZMA_BUF_ERROR but if the application knows that such pauses are possible then it can handle LZMA_BUF_ERROR specially and continue decoding when more input is available. -- Lasse Collin
Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder
Hello! I committed something. The liblzma part shouldn't need any big changes, I hope. There are a few FIXMEs but some of them might actually be fine as is. The xz side is just an initial commit, there isn't even --memlimit-threading option yet (I will add it). Testing is welcome. It would be nice if someone who has 12-24 hardware threads could test if it scales well. One needs a file with like a hundred blocks, so with the default xz -6 that means a 2.5 gigabyte uncompressed file, smaller if one uses, for example, --block-size=8MiB when compressing. If the input is broken, it should produce as much output as the single-threaded stable version does. That is, if one thread detects an error, the data before that point is first flushed out before the error is reported. This has pros and cons. It would be easy to add a flag to allow switching to fast error reporting for applications that don't care about partial output from broken files. -- Lasse Collin
Re: [xz-devel] [PATCH] liblzma: Use non-executable stack on FreeBSD as on Linux
On 2022-02-11 Ed Maste wrote: > src/liblzma/check/crc32_x86.S | 4 ++-- > src/liblzma/check/crc64_x86.S | 4 ++-- > 2 files changed, 4 insertions(+), 4 deletions(-) I have committed (but not tested) this. Thanks! -- Lasse Collin
Re: [xz-devel] xz-utils-man.po, French translation
On 2022-02-10 Mario Blättermann wrote: > The file is broken; due to some markup errors it produces only one of > the manpages. See the attached patch. Sorry to all, this time I had skipped testing and checking the translation before committing it and it broke the build (po4a failure). I have committed your patch. Now it works. :-) Thanks! > Lasse, besides the markup issues, both French and German translations > are meanwhile incomplete and partially outdated. Yes, although in context of the v5.2 branch it should be slightly less outdated. The master is still in alpha stage and not meant for any distribution like Debian. This is a problem with translations as it's not clear if v5.2 or master should be translated. They don't differ much but still. If 5.2.6 will be needed, then translating v5.2 might make more sense, maybe. > Please update po4a/xz-man.pot, and then consider to create a kind of > "intermediate" tarball and send it to the TP robot, requesting a new > TP domain for "xz-man". I tried requesting for xz-man domain a year ago and that didn't go well for a few reasons. Perhaps maybe I might dare to retry when the master branch is getting close to becoming a stable release. Or it might be easier to handle the man pages outside the Translation Project, we'll see. There are many open issues in the project that have been accumulated over the years; translations are unfortunately just one thing. I have many xz-related emails that I haven't answered yet. So the situation is a bit chaotic. My life situation is now a little different and I'm hoping I can focus on xz more now. So I'm trying to sort this, we'll see how it goes in the next 2-4 months. I'm hoping to commit a version of the threaded decoder in a few days. All big FIXMEs are solved, only a few small ones to do. :-) Gitweb is working again. -- Lasse Collin
Re: [xz-devel] xz-utils-man.po, French translation
On 2022-01-08 Jean-Pierre Giraud wrote: > Package xz-utils > version 5.2.5-2 > > Hi, > Please find attached the french translation of the xz-utils manpage > done by "bubu" and proofread by the debian-l10n-french mailing list > contributors. Thanks! Committed. -- Lasse Collin
Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder
On 2021-12-31 Sebastian Andrzej Siewior wrote: > On 2021-12-15 23:33:58 [+0200], Lasse Collin wrote: > > Yes. It's fairly simple from implementation point of view but is it > > clear enough for the users, I'm not sure. > > > > I suppose the alternative is having just one limit value and a flag > > to tell if it is a soft limit (so no limit for single-threaded > > case) or a hard limit (return LZMA_MEM_ERROR if too low for even > > single thread). Having separate soft and hard limits instead can > > achieve the same and a little more, so I think I'll choose the > > two-value approach and hope it's clear enough for users. > > The value approach might work. I'm not sure if the term `soft' and > `hard' are good here. Using `memlimit' and `memlimit_threaded' (or so) > might make more obvious and easier to understand. > But then this just some documentation that needs to be read and > understood so maybe `softlimit' and `hardlimit' will work just fine. I now plan to use memlimit_threading and memlimit_stop in the lzma_mt structure. Documentation is still needed but hopefully those are a bit more obvious. > > I was hoping to get this finished by Christmas but due to a recent > > sad event, late January is my target for the next alpha release > > now. And I'm late again. :-( This is more work than I had expected because there unfortunately are a few problems in the code and fixing them all requires quite significant changes (and I'm slow). As a bonus, working on this made me notice a few small bugs in the old liblzma code too (not yet committed). The following tries to explain some of the problems and what I have done locally. I don't have code to show yet because it still contains too many small FIXMEs but, as unbelievable as it might sound, this will get done. I need a few more days; I have other things I must do too. The biggest issue is handling of memory usage and threaded vs. direct mode. The memory usage limiting code makes assumptions that are true with the most common files but there are situations where these assumptions fail: (1) If a non-first Block requires a lot more memory than the first Block and so the memory limit would be exceeded in threaded mode, the decoder will not switch to direct mode even with LZMA_MEMLIMIT_COMPLETE. Instead the decoder proceeds with one thread and uses as much memory as that needs. (2) If a non-first Block lacks size info in its Block Header, the decoder won't switch to direct mode. It returns LZMA_PROG_ERROR instead. (3) The per-thread input buffers can grow as bigger Blocks are seen but the buffers cannot shrink. This has pros and cons. It's a problem if a single Block is very big and others are not. I thought it's better to first decode the Block Header to coder->block_options and then, based on the facts from that Block Header, determine memory usage and how to proceed (including switching to/from direct mode). This way there is no need to assume or expect anything. (coder->block_options need to be copied to a thread-specific structure before initializing the decoder.) For direct mode, I added separate SEQ states for it. This also helps making the code more similar to the single-threaded decoder in both looks and behavior. I hope that with memlimit_threading = 0 the threaded version can have identical externally-visible behavior as the original single-threaded version. This way xz doesn't need both functions (the single-threaded function is still needed if built with --disable-threads). Corner cases of the buffer-to-buffer API: (4) In some use cases there might be long pauses where no new input is available (for example, sending a live log file over network with compression). It is essential that the decoder will still provide all output that is easily possible from the input so far. That is, if the decoder was called without providing any new input, it might need to be handled specially. SEQ_BLOCK_HEADER and SEQ_INDEX return immediately if the application isn't providing any new input data, and so eventually lzma_code() will return LZMA_BUF_ERROR even when there would be output available from the worker threads. try_copy_decoded() could be called earlier but there is more to fix (see (5) and (6)). (Also remember my comment above that I changed the code so that Block Header is decoded first before getting a thread. That adds one more SEQ point where waiting for output is needed.) (5) The decoder must work when the application provides an output buffer whose size is exactly the uncompressed size of the file. This means that one cannot simply use *out_pos == out_size to determine when to return LZMA_OK. Perhaps the decoder hasn't marked its lzma_outbuf as finished but no more output will be coming, or there is an empty Block (empty Blocks perh
Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder
On 2021-12-04 Sebastian Andrzej Siewior wrote: > On 2021-11-30 00:25:11 [+0200], Lasse Collin wrote: > > Separate soft and hard limits might be convenient from > > implementation point of view though. xz would need --memlimit-soft > > (or some better name) which would always have some default value > > (like MemAvailable). The threaded decoder in liblzma would need to > > take two memlimit values. Then there would be no need for an enum > > (or a flag) to specify the memlimit mode (assuming that > > LZMA_MEMLIMIT_THREAD is removed). > > Ah I see. So one would say soft-limit 80MiB, hard-limit 2^60bytes and > would get no threading at all / LZMA_MEMLIMIT_NO_THREAD. And with soft > 1GiB, hard 2^60bytes would get the threading mode. (2^60 is made up > no limit). Yes. It's fairly simple from implementation point of view but is it clear enough for the users, I'm not sure. I suppose the alternative is having just one limit value and a flag to tell if it is a soft limit (so no limit for single-threaded case) or a hard limit (return LZMA_MEM_ERROR if too low for even single thread). Having separate soft and hard limits instead can achieve the same and a little more, so I think I'll choose the two-value approach and hope it's clear enough for users. > > I wonder if relying on the lzma_mt struct is useful for the decoder. > > Perhaps the options could be passed directly as arguments as there > > are still 2-3 fewer than needed for the encoder. > > Thre is > - num threads > - flags > - memlimit > - timeout > > One struct to rule them all and you could extend it without the need > to change the ABI. > I took one of the reserved ones for the memlimit. If you put the two > memory limits and number of threads in one init/configure function > then only flags and timeout is left. Maybe that would be enought then. You have a valid point. Either approach works, new functions can be added if needed for extending the ABI, but having just one can be nice in the long term. I was hoping to get this finished by Christmas but due to a recent sad event, late January is my target for the next alpha release now. I hope to include a few other things too, including some of Jia Tan's patches (we've chatted outside the xz-devel list). Thank you for understanding. -- Lasse Collin
Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder
Hello! On 2021-02-05 Sebastian Andrzej Siewior wrote: > - Added enum `lzma_memlimit_opt' to lzma_stream_decoder_mt() as an > init parameter. The idea is to specify how to obey the memory limit > so the user can keep using one API and not worry to fail due to the > memory limit. Lets assume the archive has a 9MiB dictionary, 24MiB > block of uncompressed data. The archive contains two compressed > blocks of 10 MiB each. Using two threads, the memory requirement is > roughly (9 + 24 + 10) * 2 = 86 MiB > > On a system with 64 MiB of memory with additional 128MiB of swap it > likely leads to the use of (say 30 MiB) swap memory during > decompression which will slow down the whole operation. > The synchronous API would do just fine with only 9 MiB of memory. > > So to not complicate things, invoking lzma_stream_decoder_mt() with > a memory limit of 32 MiB three scenarios are possible: > - LZMA_MEMLIMIT_THREAD > One thread requires 43MiB of memory and would exceed the memory > limit. However, continue with one thread instead of possible two. > > - LZMA_MEMLIMIT_NO_THREAD > One thread requires 43MiB of memory and would exceed the memory > limit. Fallback to the synchronous API without buffered input / > output memory. > > - LZMA_MEMLIMIT_COMPLETE > In this scenario it would behave like LZMA_MEMLIMIT_NO_THREAD. > However, with a dictionary size > 32MiB it would abort. In the old single-threaded code, if no memory usage limit is specified the worst case memory usage with LZMA2 is about 4 GiB (the format allows 4 GiB dict although the current encoder only supports 1536 MiB). With the threaded decoder it's the same with LZMA_MEMLIMIT_NO_THREAD. However, LZMA_MEMLIMIT_THREAD sounds a bit scary. There are no practical limits to the block size so there can be a .xz file that makes the decoder allocate a huge amount of memory. It doesn't even need to be an intentionally malicious file, it just needs to have the size fields present. Thus, I think LZMA_MEMLIMIT_THREAD should be removed. One-thread multi-threaded mode will still be used with LZMA_MEMLIMIT_NO_THREAD if the limit is high enough. LZMA_MEMLIMIT_NO_THREAD should be the default in xz when no memory usage limit has been explicitly specified. There needs to be a default "soft limit" (the MemAvailable method is such) that will drop xz to single-threaded mode if the soft limit is too high for threaded mode (even with just one thread). LZMA_MEMLIMIT_COMPLETE could be the mode to use when a memlimit is explicitly specified (a "hard limit") on the xz command line. This would match the existing behavior of the old single-threaded decoder. It would be good to have a way to specify a soft limit on the xz command line too. It could make sense to have both soft and hard limit at the same time but perhaps it gets too confusing: Soft limit that would be used to restrict the number of threads (and even drop to single-threaded mode) and hard limit which can return LZMA_MEMLIMIT_ERROR. If one is fine to use 300 MiB in threaded mode but still wants to allow up to 600 MiB in case the file *really* requires that much even in single-threaded mode, then this would be useful. Separate soft and hard limits might be convenient from implementation point of view though. xz would need --memlimit-soft (or some better name) which would always have some default value (like MemAvailable). The threaded decoder in liblzma would need to take two memlimit values. Then there would be no need for an enum (or a flag) to specify the memlimit mode (assuming that LZMA_MEMLIMIT_THREAD is removed). Extra idea, maybe useless: The --no-adjust option could be used to specify that if the specified number of threads isn't possible due to a memlimit then xz will abort. This is slightly weird as it doesn't provide real performance guarantees anyway (block sizes could vary a lot) but it's easy to implement if it is wanted. I wonder if relying on the lzma_mt struct is useful for the decoder. Perhaps the options could be passed directly as arguments as there are still 2-3 fewer than needed for the encoder. I've made some other minor edits locally already so I would prefer to *not* get new patch revisions until I have committed something. Comments are very welcome. :-) Thanks! -- Lasse Collin
Re: [xz-devel] [PATCH] xz: Multithreaded mode now always uses stream_encoder_mt to ensure reproducible builds
On 2021-11-29 Jia Tan wrote: > This patch addresses the issues with reproducible builds when using > multithreaded xz. Previously, specifying --threads=1 instead of > --threads=[n>1] creates different output. Now, setting any number of > threads forces multithreading mode, even if there is only 1 worker > thread. This is an old problem that should have been fixed long ago. Unfortunately I think the fix needs to be a little more complex due to backward compatibility. With this patch, if threading has been enabled, no further option on the command line (except --flush-timeout) will disable threading. Sometimes there are default options (for exampe, XZ_DEFAULTS) that enable threading and one wants to disable it in a specific situation (like running multiple xz commands in parallel via xargs). If --threads=1 always enables threading, memory usage will be quite a bit higher than in non-threaded mode (94 MiB vs. 166 MiB for the default compression level -6; 674 MiB vs. 1250 MiB for -9). To be backward compatible, maybe it needs extra syntax within the --threads option or a new command line option. Both are a bit annoying and ugly but I don't have a better idea. Currently one-thread multi-threading is done if one specifies two or more threads but the memory limit is so low that only one thread can be used. In that case xz will never switch to non-threaded mode. This ensures that the output file is always the same even if the number of threads gets reduced. When -T0 is used, that is broken in sense that threading mode (and thus encoded output) depends on how many hardware threads are supported. So perhaps -T0 should mean that multi-threaded mode must be used even for single thread (your patch would do this too). A way to explicitly specify one-thread multi-threaded mode is still needed but I guess it wouldn't need to be used so often if -T0 handles it already. -T0 needs improvements in default memory usage limiting too, and both changes could make the default behavior better. The opposite functionality could be made available too: if the number of threads becomes one for whatever reason, an option could tell xz to always use single-threaded mode to get better compression and to save RAM. > +#include "common.h" [...] > // The max is from src/liblzma/common/common.h. > hardware_threads_set(str_to_uint64("threads", > - optarg, 0, 16384)); > + optarg, 0, LZMA_THREADS_MAX)); common.h is internal to liblzma and must not be used from xz. Maybe LZMA_THREADS_MAX could be moved to the public API, I don't know right now. -- Lasse Collin
Re: [xz-devel] [PATCH] xz: Added .editorconfig file for simple style guide encouragement
Hello! On 2021-10-30 Jia Tan wrote: > This patch adds a .editorconfig to the root directory. Thanks! I hadn't heard about this before but it sounds nice. > +[*] > +insert_final_newline = true > +trim_trailing_whitespace = true I think it should be fine to add these: charset = utf-8 end_of_line = lf The exception are some files under windows/vs*. Those files will hopefully be gone in the future though. They use LF, not CR+LF but have BOM: [*.vcxproj,xz_win.sln] charset = utf-8-bom > +[src/,tests/] If the syntax is similar to gitignore, then src/ will match also foo/bar/src/. It doesn't really matter here but I suppose /src/ is a tiny bit more correct. > +indent_style = tab I guess it makes sense to set also indent_size = 8 because viewing the files with any other setting will look weird when long lines are wrapped and can editing can result in wrong word wrapping. There are multiple indentation styles even under src. Instead of specifying directories, how about specifying file suffixes like *.c so it won't matter where the files are. There are .sh files with different styles but maybe it's not that important. I ended up with this: --- # To use this config on your editor, follow the instructions at: # https://editorconfig.org/ root = true [*] charset = utf-8 end_of_line = lf insert_final_newline = true trim_trailing_whitespace = true [*.c,*.h,*.S,*.map,*.sh,*.bash,Makefile*,/configure.ac,/po4a/update-po,/src/scripts/{xzless,xzmore}.in] indent_style = tab indent_size = 8 [/src/scripts/{xzdiff,xzgrep}.in] indent_style = space indent_size = 2 [CMakeLists.txt,*.cmake] indent_style = space indent_size = 4 [*.vcxproj,xz_win.sln] charset = utf-8-bom --- Is it good enough or did I add bad bugs? :-) -- Lasse Collin
Re: [xz-devel] Multithreaded decompression for XZ Utils.
On 2021-11-06 Sebastian Andrzej Siewior wrote: > just spotted that Christmas is around the corner. I *think* that I've > been a good boy over the year. I plan to keep it that way just to be > sure. Not trying to push my luck here but what are my chances to find > parallel decompression in xz-utils under the christmas tree? You have been very good boy indeed and I have been the opposite, still not gotten this done. I don't want to give any odds, although there are reasons why the odds should be better than a month or two ago, but I will really try so that Santa can deliver a new alpha package. -- Lasse Collin
[xz-devel] XZ Utils 5.3.2alpha
XZ Utils 5.3.2alpha is available at <https://tukaani.org/xz/>. Here is an extract from the NEWS file: This release was made on short notice so that recent erofs-utils can be built with LZMA support without needing a snapshot from xz.git. Thus many pending things were not included, not even updated translations (which would need to be updated for the new --list strings anyway). * All fixes from 5.2.5. * xz: - When copying metadata from the source file to the destination file, don't try to set the group (GID) if it is already set correctly. This avoids a failure on OpenBSD (and possibly on a few other OSes) where files may get created so that their group doesn't belong to the user, and fchown(2) can fail even if it needs to do nothing. - The --keep option now accepts symlinks, hardlinks, and setuid, setgid, and sticky files. Previously this required using --force. - Split the long strings used in --list and --info-memory modes to make them much easier for translators. - If built with sandbox support and enabling the sandbox fails, xz will now immediately exit with exit status of 1. Previously it would only display a warning if -vv was used. - Cap --memlimit-compress to 2000 MiB on MIPS32 because on MIPS32 userspace processes are limited to 2 GiB of address space. * liblzma: - Added lzma_microlzma_encoder() and lzma_microlzma_decoder(). The API is in lzma/container.h. The MicroLZMA format is a raw LZMA stream (without end marker) whose first byte (always 0x00) has been replaced with bitwise-negation of the LZMA properties (lc/lp/pb). It was created for use in EROFS but may be used in other contexts as well where it is important to avoid wasting bytes for stream headers or footers. The format is also supported by XZ Embedded. The MicroLZMA encoder API in liblzma can compress into a fixed-sized output buffer so that as much data is compressed as can be fit into the buffer while still creating a valid MicroLZMA stream. This is needed for EROFS. - Added fuzzing support. - Support Intel Control-flow Enforcement Technology (CET) in 32-bit x86 assembly files. - Visual Studio: Use non-standard _MSVC_LANG to detect C++ standard version in the lzma.h API header. It's used to detect when "noexcept" can be used. * Scripts: - Fix exit status of xzdiff/xzcmp. Exit status could be 2 when the correct value is 1. - Fix exit status of xzgrep. - Detect corrupt .bz2 files in xzgrep. - Add zstd support to xzgrep and xzdiff/xzcmp. - Fix less(1) version detection in xzless. It failed if the version number from "less -V" contained a dot. * Fix typos and technical issues in man pages. * Build systems: - Windows: Fix building of resource files when config.h isn't used. CMake + Visual Studio can now build liblzma.dll. - Various fixes to the CMake support. It might still need a few more fixes even for liblzma-only builds. -- Lasse Collin
Re: [xz-devel] [PATCH] xz: Avoid fchown(2) failure.
On 2021-10-05 Alexander Bluhm wrote: > OpenBSD does not allow to change the group of a file if the user > does not belong to this group. In contrast to Linux, OpenBSD also > fails if the new group is the same as the old one. Do not call > fchown(2) in this case, it would change nothing anyway. Thanks! Committed. -- Lasse Collin
Re: [xz-devel] [PATCH] add xz arm64 bcj filter support
On 2021-09-02 Liao Hua wrote: > We have some questions about xz bcj filters. > 1. Why ARM and ARM-Thumb bcj filters are little endian only? Perhaps it's an error. Long ago when I wrote the docs, I knew that the ARM filters worked on little endian code but didn't know how big endian ARM was done. If it always uses the same encoding for instructions, then the docs should be fixed. The same is likely true about PowerPC. > 2. Why there is no arm64 bcj filter? Are there any technical risks? > Or other considerations? It just hasn't been done, no other reason. In general I haven't gotten much done in years and there even are a few patches (unrelated to BCJ) that have been waiting my feedback for a very long time. :-( > We add arm64 bcj filter support in local xz codes and it works ok. > We modify the Linux Kernel codes accordingly and use the new xz to > compress kernel, and kernel is decompressed successfully during > startup. > > The following is the patch for arm64 bcj filter support which is > based on xz 5.2.5 version. Thanks! > + // arm64 bl instruction: 0x94 and 0x97; > + if (buffer[i + 3] == 0x94 || buffer[i + 3] == 0x97) { The "bl" instruction takes a signed 26-bit immediate value that encodes the offsets as a multiple of four bytes. The above matches only when the two highest bits are either 00 or 11. Is it intentional that it ignores immediate values with the the highest bits 01 and 10? Ignoring 01 (offset > 64 MiB) and 10 (offset < -64 MiB) results in fewer false matches when the filter is applied to non-code data. Also, perhaps such offsets aren't so common in actual code (they can appear in big binaries only). If false matches are an issue, it might even make sense to reduce the range further (+/-32 MiB would be the same as on 32-bit ARM): for (i = 0; i + 4 <= size; i += 4) { const uint32_t instr = read32le(buffer + i); const uint32_t x = instr & 0xFF80; if (x == 0x9400 || x == 0x9780) { ... It's not obvious what is better so it would be good to test with a few types of files (kernel image, and a few GNU/Linux distro packages containing both executable and data files). Also, the way the two highest bits are ignored means that the sign bit isn't taken into account when doing the conversion. The calculation of "dest" will never flip the sign bit(s) (0x94 to 0x97 or vice versa) when the addition/substraction wraps around. Maybe it doesn't matter much in practice. Have you tested if instructions other than "bl" could be worth converting too? Unconditional branch instruction "b" is the most obvious candidate to try (0x14 instead of 0x94). I don't expect much but at this point it is easy to test. It's possible that it depends too much on what kind of code the input file has (it might help with some files and be harmful with many others). Since this is a new filter, I would like to avoid a problem that other BCJ filters have: Linux kernel modules, static libraries and such files have the address part in the instructions filled with zeroes (correct values will be set when the file is linked). For example, if you run "objdump -d" on a x86-64 Linux module, there are lots of "call" instructions encoded as "e8 00 00 00 00". I haven't checked if this is similar on ARM64 but it sounds likely. The existing BCJ filters make compression worse with these files. The correct action would be to do nothing with zeroed addresses: if (src == 0) continue; However, the encoder has to avoid conversions that would result in a zero that the decoder would ignore. On the other hand, the decoder will never need to decode a non-zero input value to a zero. These special cases can be used together. Untested code: if (src == 0) continue; src <<= 2; const uint32_t pc = now_pos + (uint32_t)(i); uint32_t dest = is_encoder ? src + pc : src - pc; // The mask assumes that only 24 bits of the 26-bit immedate // are used. if ((dest & 0x3FC) == 0) { assert((pc & 0x3FC) != 0); dest = is_encoder ? pc : 0U - pc; } dest >>= 2; The "start=offset" option probably could be omitted. It's quite useless inside .xz. XZ Embedded doesn't support it anyway. Once a filter is ready, I will need to discuss it with Igor Pavlov (the 7-Zip's developer) too, and add the new filter ID to the official .xz specification. -- Lasse Collin
Re: [xz-devel] [PATCH v3] liblzma: Add multi-threaded decoder
Hello! On 2021-07-20 Guillem Jover wrote: > I've only skimmer very quickly over the patch, but I've been running > it on my system in addition to a locally modified dpkg that uses this > new support, and it seems to be working great. :) Great to hear, thanks! :-) Unfortunately I don't have any news. :-( -- Lasse Collin
Re: [xz-devel] Go/Golang bindings for xz
Hello! On 2021-04-12 James Fennell wrote: > Over the last couple of weeks I've been working on a project to add > Go bindings for the xz format: https://github.com/jamespfennell/xz :-) > The project uses the Go technology cgo to compile the relevant > liblzma C files automatically and link them in with the Go binary. That made me wonder about config.h and the #defines. With a really quick look I found https://github.com/jamespfennell/xz/blob/main/lzma/lzma.go which sets a few #defines but it's quite limited, for example, a comment tells that only 64-bit systems are supported. I also don't see TUKLIB_FAST_UNALIGNED_ACCESS which is good on 32/64-bit x86 and some ARMs to get a little better encoder performance. Also #define TUKLIB_SYMBOL_PREFIX lzma_ could be good to have to ensure that all symbols begin with "lzma_". Of course these don't matter if the system liblzma is used instead. I understood that it's an option too. > Lasse, would you be interested in adding a link under the bindings > section of the xz website? I can. Since there are other bindings to use liblzma, I wonder if some of those should be listed too. What do you think? I have no Go experience so I have no idea which are good or already popular. Thanks! -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: [xz-devel] Status of man page translations?
On 2021-04-15 Mario Blättermann wrote: > Am So., 11. Apr. 2021 um 20:48 Uhr schrieb Lasse Collin > : > > I suppose I can just submit a snapshot from the master branch. I have done this. > I am curious to see when the first new translations will arrive :) Me too. It's a lot of work to translate them all. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: [xz-devel] Status of man page translations?
On 2021-04-04 Mario Blättermann wrote: > But what ist the blocker which still prevents you from creating an > intermediate tarball, send it to the TP coordinator and tell him to > create a new domain named "xz-man"? I suppose I had forgotten it. If there were other reasons, I have forgotten them too. Sorry. I suppose I can just submit a snapshot from the master branch. xz-man.pot is compatible with v5.2 for now. xz.pot isn't compatible between the branches though but if 5.2.6 is needed (impossible to know now) maybe it's not that bad: The command line tool translations in v5.2 have strings that are difficult to get right. The master branch has such strings too but not as many. For the 5.2.5 release, many translations didn't pass basic quality control due to these strings. Some translators (individuals or teams) replied to my emails about suggested white-space corrections, some didn't. Thus multiple translations were omitted from 5.2.5. With this background I feel that if 5.2.6 is needed I won't consider any *new* xz.po files for it anyway; new xz-man.po languages would be fine. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: [xz-devel] [PATCH] Reduce maximum possible memory limit on MIPS32
On 2021-04-09 Vitaly Chikunov wrote: > From: "Ivan A. Melnikov" > > Due to architectural limitations, address space available to a single > userspace process on MIPS32 is limited to 2 GiB, not 4, even on > systems that have more physical RAM -- e.g. 64-bit systems with 32-bit > userspace, or systems that use XPA (an extension similar to x86's > PAE). > > So, for MIPS32, we have to impose stronger memory limits. I've chosen > 2000MiB to give the process some headroom. Thanks! Committed. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
[xz-devel] XZ for Java 1.9
XZ for Java 1.9 is available at <https://tukaani.org/xz/java.html> and in the Maven Central (groupId = org.tukaani, artifactId = xz). Here is an extract from the NEWS file: * Add LZMAInputStream.enableRelaxedEndCondition(). It allows decompression of LZMA streams whose uncompressed size is known but it is unknown if the end of stream marker is present. This method is meant to be useful in Apache Commons Compress to support .7z files created by certain very old 7-Zip versions. Such files have the end of stream marker in the LZMA data even though the uncompressed size is known. 7-Zip supports such files and thus other implementations of the .7z format should support them too. * Make LZMA/LZMA2 decompression faster. With files that compress extremely well the performance can be a lot better but with more typical files the improvement is minor. * Make the CRC64 code faster. * Add module-info.java as multi-release JAR. The attribute Automatic-Module-Name was removed. * The binaries for XZ for Java 1.9 in the Maven Central now require Java 7. Building the package requires at least Java 9 for module-info support but otherwise the code should still be Java 5 compatible (see README and comments in build.properties). -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: [xz-devel] Re: java LZDecoder small improvement
On 2021-03-01 Brett Okken wrote: > > One thing that confuses me in your version is the special handling > > of the first byte: > > > > buf[pos++] = buf[back++]; > > --left; > > > > If there are two bytes to copy, then one will be copied above and > > the other with arraycopy later. If there are more bytes to copy and > > distance is very small, incrementing "back" above can mean that an > > extra arraycopy call might be needed in the loop because the first > > copy will be one byte smaller. > > > > I understand that it might help when there is just one byte to > > repeat because then the while-loop will be skipped. In all other > > situations it sounds like that the special handling of the first > > byte would in theory be harmful. Note that I don't doubt your test > > results; I already saw with the CRC64 code that some changes in the > > code can affect performance in weird ways. > > The image1.dcm is the most impacted by this optimization. Again, this > file is basically a large greyscale bmp. This results in a significant > number of single byte repeats. Optimizing for the single byte improves > performance in that file by 3-5%, while having smaller effects on the > other 2 files (ihe_ovly_pr.dcm slightly slower, large.xml slightly > faster) OK, that is an interesting test case. > I agree your approach is more readable. From your version of it, I was > expecting that simplicity in reading to translate into better > performance. > This latest version actually does appear to do that. The image1.dcm > performance matches my version and the other 2 are a bit faster. > Adding the single byte optimization still speeds up image1.dcm (~8ms, > ~2%) and large.xml (~3ms, 2%), while slowing ihe_ovly_pr.dcm (~.008ms, > ~1%). [...] > Version 3 is better for all 3 files. With these results I now plan to include version 3 in the next release. It sounds that the single-byte optimization has a fairly small effect. Omitting it keeps the code a tiny bit simpler. I have committed the change. I think xz-java.git should now be almost ready for a release. I just need to add NEWS and bump the version number. Thanks for your help! -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: [xz-devel] Re: java LZDecoder small improvement
On 2021-02-13 Brett Okken wrote: > On Thu, Feb 11, 2021 at 12:51 PM Lasse Collin > wrote: > > I still worry about short copies. If the file is full of tiny > > matches/repeats of 1-3 bytes or so, arraycopy can be slower. Such > > files aren't typical at all but I don't want to add a corner case > > where the performance drops too much. > > Do you have examples of such files, or code on how to generate one? Use the patch below and compress with this: java -jar build/jar/XZEncDemo.jar 2 < infile > outfile.xz" Adjust LIMIT to get longer matches. diff --git a/src/org/tukaani/xz/lzma/LZMAEncoderFast.java b/src/org/tukaani/xz/lzma/LZMAEncoderFast.java index f8230ee..cd92ca6 100644 --- a/src/org/tukaani/xz/lzma/LZMAEncoderFast.java +++ b/src/org/tukaani/xz/lzma/LZMAEncoderFast.java @@ -44,6 +44,8 @@ final class LZMAEncoderFast extends LZMAEncoder { return smallDist < (bigDist >>> 7); } +private static final int LIMIT = 2; + int getNextSymbol() { // Get the matches for the next byte unless readAhead indicates // that we already got the new matches during the previous call @@ -66,11 +68,13 @@ final class LZMAEncoderFast extends LZMAEncoder { int bestRepIndex = 0; for (int rep = 0; rep < REPS; ++rep) { int len = lz.getMatchLen(reps[rep], avail); +if (len > LIMIT) +len = LIMIT; if (len < MATCH_LEN_MIN) continue; // If it is long enough, return it. -if (len >= niceLen) { +if (len >= LIMIT) { back = rep; skip(len - 1); return len; @@ -88,9 +92,11 @@ final class LZMAEncoderFast extends LZMAEncoder { if (matches.count > 0) { mainLen = matches.len[matches.count - 1]; +if (mainLen > LIMIT) +mainLen = LIMIT; mainDist = matches.dist[matches.count - 1]; -if (mainLen >= niceLen) { +if (mainLen >= LIMIT) { back = mainDist + REPS; skip(mainLen - 1); return mainLen; With a quick try I got a feeling that my worry about short repeats was wrong. It doesn't matter because decoding each LZMA symbol is much more expensive. What matters is avoiding multiple tiny arraycopy calls within a single run of the repeat method, and that problem was already solved. > > I came up with the following. I haven't decided yet if I like it. > > On the 3 files I have been testing with, this change is a mixed bag. > Compared to trunk 1 regresses by ~8%. While the other 2 do improve, > neither are better than my last patch. OK, thanks. So it isn't great. I wonder which details make the difference. One thing that confuses me in your version is the special handling of the first byte: buf[pos++] = buf[back++]; --left; If there are two bytes to copy, then one will be copied above and the other with arraycopy later. If there are more bytes to copy and distance is very small, incrementing "back" above can mean that an extra arraycopy call might be needed in the loop because the first copy will be one byte smaller. I understand that it might help when there is just one byte to repeat because then the while-loop will be skipped. In all other situations it sounds like that the special handling of the first byte would in theory be harmful. Note that I don't doubt your test results; I already saw with the CRC64 code that some changes in the code can affect performance in weird ways. Your code needs if (back == bufSize) back = 0; in the beginning of the while-loop and later checking for tmp > 0. My version avoids these branches by handling those cases under "if (back < 0)" (which is equivalent to "if (dist >= pos)"). On the other hand, under "if (back < 0)" all copies, including tiny copies, are done with arraycopy. Another tiny difference is that your code uses left shift to double the copy size in the loop while I used Math.min(pos - back, left). > I was able to improve this a bit by pulling the handling of small > copies outside of the while loop. This eliminates the regressions > compared to trunk, but still does not feel like an improvement over my > last patch. Yeah, the switch isn't worth it. If I understand it correctly now, trying to avoid arraycopy for the tiny copies wasn't a useful idea in the first place. So the code can be simplified ("version 3"): int back = pos - dist - 1; if (back < 0) { // The distance wraps around to the end of the cyclic dictionary // buffer. We cannot get here if the dictionary isn't full. assert full == bufSize; back += bufSize; // Here we
Re: [xz-devel] xz-java and newer java
I quickly tried these with "XZEncDemo 2". I used the preset 2 because that uses LZMAEncoderFast instead of LZMAEncoderNormal where the negative lengths result in a crash. The performance was about the same or worse than the original code. I don't know why. I didn't spend much time on this and it's possible that I messed up something. One thing that may be worth checking out is how in HC4.java (and BT4.java too) the patch doesn't try to quickly skip too short matches like the original code does. I suppose the first set of patches should be such that they only replace the byte-by-byte loops with a function call to make comparison as fair as possible. These patches won't get into XZ for Java 1.9 but might be in a later version if I see them being/becoming good. The only remaining patch that might get into 1.9 is LZDecoder.repeat improvements. When you post a patch or other code, please make sure that word-wrapping is disabled in the email client or use attachments. Thanks! -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: [xz-devel] java array cache fill
On 2021-02-16 Brett Okken wrote: > We found in LZDecoder that using System.arrayCopy with doubling size > is faster than Arrays.fill (especially for larger arrays). > We can apply that knowledge in the BasicArrayCache, where there are > some use cases which require clearing out the array prior to returning > it. A simple micro-benchmark gives me a very different result. The alternative method is roughly 70 % slower than Arrays.fill on my system with a big array. If Arrays.fill were so terrible, it should be improved instead. Even if the alternative method were faster, it would need to be a lot faster to be worth the extra complexity. If Arrays.fill version (uncomment/comment the code) is slower for you, it must depend on the Java runtime or operating system or such things. import java.util.Arrays; public class Foo { public static void main(String[] args) throws Exception { byte[] buf = new byte[10 << 20]; for (int i = 0; i < 4000; ++i) { //Arrays.fill(buf, (byte)0); buf[0] = (byte)0; buf[1] = (byte)0; buf[2] = (byte)0; buf[3] = (byte)0; int toCopy = 4; int remaining = buf.length - toCopy; do { System.arraycopy(buf, 0, buf, toCopy, toCopy); remaining -= toCopy; toCopy <<= 1; } while (remaining >= toCopy); if (remaining != 0) { System.arraycopy(buf, 0, buf, toCopy, remaining); } } } } -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: [xz-devel] jdk9+ CRC64
On 2021-02-13 Brett Okken wrote: > We can make it look even more like liblzma :) It can be done but I'm not sure yet if it should be done. Your implementation looks very neat though. :-) > In my benchmark I observe no negative impact of using the functions. > Which is to say that this is still 5-7% faster than the byte-by-byte > approach. With a dumb test with XZDecDemo, it seems faster than the current code (8.5 s vs. 7.9 s). However, if I misalign the buffer in XZDecDemo.java like this int size; while ((size = in.read(buf, 1, 8191)) != -1) System.out.write(buf, 1, size); then both versions are about as fast (7.9 s). The weird behavior with misaligned buffers was discussed earlier. My point is that if tiny things like buffer alignment can make as big a difference as supposedly better code, perhaps the explanation for the speed difference isn't the code being better but some side-effect that I don't understand. On your systems the results might differ significantly and more information is welcome. With the current information I think the possible benefit of the fancier code isn't worth it (bigger xz.jar, more code to maintain). In any case, any further CRC64 improvements will need to wait past the 1.9 release. The test file I used contains a repeating 257-byte pattern where each 8-bit value occurs at least once. It is extremely compressible and thus makes the differences in CRC64 speed as big as they can be with LZMA2. With real files the differences are smaller. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: [xz-devel] Compatibility between CMake config file and FindLibLZMA.cmake
think the CMake build files also were not yet included in any > official release. CMakeLists.txt and friends were included in XZ Utils 5.2.5 (with the bug that shared library doesn't build on Windows). It's described as experimental so in that sense it could be OK to change things. > You can add an alias for target "liblzma" to target "LibLZMA" in the > CMakeLists.txt file (after the target definition in add_library, line > 193) for users that embed the xz project as a subdirectory: > add_library(LibLZMA::LibLZMA ALIAS LibLZMA) > add_library(liblzma ALIAS LibLZMA::LibLZMA) > add_library(liblzma::liblzma ALIAS LibLZMA::LibLZMA) If I change the main add_library(liblzma ) to add_library(LibLZMA ) then the filename will be LibLZMA.something too. That isn't good because then one cannot replace a CMake-built shared liblzma with an Autotools-built one on operating systems where file and library names are case sensitive. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: [xz-devel] Re: java LZDecoder small improvement
On 2021-02-05 Brett Okken wrote: > I worked this out last night. We need to double how much we copy each > time by not advancing "back". This actually works even better than > Arrays.fill for the single byte case also. This clearly is a good idea in a Java implementation. :-) I still worry about short copies. If the file is full of tiny matches/repeats of 1-3 bytes or so, arraycopy can be slower. Such files aren't typical at all but I don't want to add a corner case where the performance drops too much. I came up with the following. I haven't decided yet if I like it. public void repeat(int dist, int len) throws IOException { if (dist < 0 || dist >= full) throw new CorruptedInputException(); int left = Math.min(limit - pos, len); pendingLen = len - left; pendingDist = dist; int back = pos - dist - 1; if (back < 0) { // We won't get here if the dictionary isn't full. assert full == bufSize; // The distance wraps around to the end of the cyclic dictionary // buffer. Here we will never copy more than dist + 1 bytes // and so the copying won't repeat from its own output. Thus, // we can always use arraycopy safely. back += bufSize; int copySize = Math.min(bufSize - back, left); assert copySize <= dist + 1; System.arraycopy(buf, back, buf, pos, copySize); pos += copySize; back = 0; left -= copySize; if (left == 0) return; } assert back < pos; assert left > 0; do { // Determine the number of bytes to copy on this loop iteration: // copySize is set so that the source and destination ranges // don't overlap. If "left" is large enough, the destination // range will start right after the last byte of the source // range. This way we don't need to advance "back" which // allows the next iteration of this loop to copy (up to) // twice the number of bytes. int copySize = Math.min(left, pos - back); // With tiny copy sizes arraycopy is slower than a byte-by-byte // loop. With typical files the difference is tiny but with // unusual files this can matter more. if (copySize < 4) { int i = 0; do { buf[pos + i] = buf[back + i]; } while (++i < copySize); } else { System.arraycopy(buf, back, buf, pos, copySize); } pos += copySize; left -= copySize; } while (left > 0); if (full < pos) full = pos; } -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: [xz-devel] jdk9+ CRC64
On 2021-02-06 Brett Okken wrote: > Since it is quite easy to read an int from a byte[] in jdk 9, the > CRC64 implementation can be optimized to operate on an int rather than > byte by byte as part of a multi-release jar. This shows to be 5-7% > faster in a microbenchmark of just the crc64 calculation. In jdk 11 it > speeds up the decompression of the repeating single byte by ~1%. To avoid byte swapping in the main loop on big endian systems, the lookup table would need to be big endian and operations need to be bitwise-mirrored too just like in liblzma. I'm not convinced yet that it's worth the extra effort and complexity for such a small speed gain. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: [xz-devel] java LZMA2OutputStream changes
On 2021-02-05 Brett Okken wrote: > > > Now that there is a 6 byte chunkHeader, could the 1 byte tempBuf > > > be removed? > > > > It's better to keep it. It would be confusing to use the same > > buffer in write(int) and writeChunk(). At glance it would look like > > that writeChunk() could be overwriting the input. > > I assumed that lz.fillWindow(buf, off, len); would always process the > 1 byte. Yes, but it's not immediately obvious to a new reader. Also, many other classes have tempBuf for identical use so it's good to keep that pattern consistent. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: [xz-devel] java crc64 implementation
On 2021-02-05 Brett Okken wrote: > On Fri, Feb 5, 2021 at 11:07 AM Lasse Collin > wrote: > > Also, does it really help to unroll the loop? With 8191-byte > > buffers I see no significant difference (in a quick > > not-very-accurate test) if the switch-statement is replaced with a > > while-loop. > > The differences are pretty minimal. My observation was switch a bit > faster than for loop, which was a bit faster than a while loop. But > the differences in averages were less than the confidence interval for > the given tests. OK, smaller code wins then. > > With these two changes the code becomes functionally identical to > > the version I posted with the name "Modified slicing-by-4". Is that > > an OK version to commit? > > Yes. OK. > > Is the following fine to you as the file header? Your email address > > can be omitted if you prefer that. I will mention in the commit > > message that you adapted the code from XZ Utils and benchmarked it. > > > > /* > > * CRC64 > > * > > * Authors: Brett Okken > > * Lasse Collin > > * > > * This file has been put into the public domain. > > * You can do whatever you want with this file. > > */ > > That is fine. You can include my e-mail. OK. :-) I have committed it. Thank you! The LZDecoder changes I may still look at before the next release. Then I will go back to the XZ Utils code. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: [xz-devel] java LZMA2OutputStream changes
On 2021-02-05 Brett Okken wrote: > After recent changes, the LZMA2OutputStream class no longer uses > DataOutputStream, but the import statement is still present. Fixed. Thanks! > Now that there is a 6 byte chunkHeader, could the 1 byte tempBuf be > removed? It's better to keep it. It would be confusing to use the same buffer in write(int) and writeChunk(). At glance it would look like that writeChunk() could be overwriting the input. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: [xz-devel] java crc64 implementation
On 2021-02-02 Brett Okken wrote: > Thus far I have only tested on jdk 11 64bit windows, but the fairly > clear winner is: > > public void update(byte[] buf, int off, int len) { > final int end = off + len; > int i=off; > if (len > 3) { > switch (i & 3) { > case 3: > crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ > (crc >>> 8); > case 2: > crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ > (crc >>> 8); > case 1: > crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ > (crc >>> 8); > } To ensure (i & 3) == 0 when entering the main loop, the case-labels should be 1-2-3, not 3-2-1. This may have messed up your tests. :-( With a very quick test I didn't see much difference if I changed the case-label order. On 2021-02-02 Brett Okken wrote: > I tested jdk 15 64bit and jdk 11 32bit, client and server and the > above implementation is consistently quite good. > The alternate in running does not do the leading alignment. This > version is really close in 64 bit testing and slightly faster for 32 > bit. The differences are pretty small, and both are noticeably better > than my original proposal (and all 3 are significantly faster than > current). I think I would lead towards the simplicity of not doing the > leading alignment, but I do not have a strong opinion. Let's go with the simpler option. > switch (len & 3) { > case 3: > crc = TABLE[0][(buf[i++] ^ (int) crc) & 0xFF] ^ > (crc >>> 8); I suppose this should use the same (faster) array indexing style as the main loop: crc = TABLE[0][(buf[off++] & 0xFF) ^ ((int)crc & 0xFF)] ^ (crc >>> 8); Also, does it really help to unroll the loop? With 8191-byte buffers I see no significant difference (in a quick not-very-accurate test) if the switch-statement is replaced with a while-loop. With these two changes the code becomes functionally identical to the version I posted with the name "Modified slicing-by-4". Is that an OK version to commit? Is the following fine to you as the file header? Your email address can be omitted if you prefer that. I will mention in the commit message that you adapted the code from XZ Utils and benchmarked it. /* * CRC64 * * Authors: Brett Okken * Lasse Collin * * This file has been put into the public domain. * You can do whatever you want with this file. */ Thanks! -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: [xz-devel] xz-java minor read improvements
On 2021-02-03 Brett Okken wrote: > I have not done any testing of xz specifically, but was motivated by > https://github.com/openjdk/jdk/pull/542, which showed pretty > noticeable slowdown when biased locking is removed. The specific > example there was writing 1 byte at a time being transitioned to > writing the 2-8 bytes to a byte[] first, then writing that buffer to > the OutputStream. I suspect that reading would have similar impact. I don't doubt that. However, in XZ the uses of ByteArrayInputStream and ByteArrayOutputStream are in places where the performance could be absolutely horrible and it would still make little difference in overall speed. The amounts of data being read or written are so small. LZMAInputStream reads the whole file one byte at a time (via RangeDecoderFromStream.normalize()) and performance suffers compared to XZInputStream even if one uses BufferedInputStream. BufferedInputStream has synchronized read(). I don't know how much locking matters in this case. I'm not curious enough to try with a non-synchronized buffered input stream now. There are related comments in the "java buffer writes" thread. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: [xz-devel] java buffer writes
On 2021-01-29 Brett Okken wrote: > There are several places where single byte writes are being done > during compression. Often this is going to an OutputStream with > synchronized write methods. Historically that has not mattered much > because of biased locking. However, biased locking is being > removed[1]. These changes will batch those writes up to a small > buffer. LZMA2OutputStream: I have committed a functionally similar patch. Thanks! BlockOutputStream: The ByteBuffer code replacing ByteArrayOutputStream is more complex than the original code. For example, manually resizing a buffer may be useful when performance is important but in this class performance doesn't matter. IndexEncoder: If there were a huge number of Blocks and thus Records, it would allocate memory to hold them all. It could be nicer to use something similar to BufferedOutputStream which would always use the same small amount of memory. java.io.BufferedOutputStream cannot be used because its close() and flush() methods call flush() on the underlying output stream and here it's counter-productive. The reading side in IndexDecoder and IndexHash could be similarly optimized to use a buffered input class that takes an argument to limit how many bytes it may read from the underlying InputStream. If the Index* classes are optimized, then the CRC32 writing in XZOutputStream, IndexEncoder, and BlockOutputStream may be worth optimizing too. It's important to keep in mind that these make no real difference if the application buffers the input or output with BufferedInputStream or BufferedOutputStream. In some use cases it may be impractical though, and then the small reads and writes may hurt if each read/write results in a syscall or even sending packets over network; such overheads can be much larger than locking. I put these optimizations in the "nice to have" category. Something could be done to make the code better but it's not urgent and so these won't be in the next release. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: [xz-devel] Re: java LZDecoder small improvement
On 2021-02-03 Brett Okken wrote: > On Wed, Feb 3, 2021 at 2:56 PM Lasse Collin > wrote: > > It seems to regress horribly if dist is zero. A file with a very > > long sequence of the same byte is good for testing. > > Would this be a valid test of what you are describing? [...] > The source is effectively 160MB of the same byte value. Yes, it's fine. > I found a strange bit of behavior with this case in the compression. > In LZMAEncoderNormal.calcLongRepPrices, I am seeing a case where > > int len2Limit = Math.min(niceLen, avail - len - 1); > > results in -1, (avail and len are both 8). This results in calling > LZEncoder.getMatchLen with a lenLimit of -1. Is that expected? I didn't check in detail now, but I think it's expected. There are two such places. A speed optimization was forgotten in liblzma from these two places because of this detail. I finally remembered to add the optimization in 5.2.5. On 2021-02-03 Brett Okken wrote: > I still need to do more testing across jdk 8 and 15, but initial > returns on this are pretty positive. The repeating byte file is > meaningfully faster than baseline. One of my test files (image1.dcm) > does not improve much from baseline, but the other 2 files do. The repeating byte is indeed much faster than the baseline. With normal files the speed seems to be about the same as the version I posted, so a minor improvement over the baseline. With a file with two-byte repeat ("ababababababab"...) it's 50 % slower than the baseline. Calling arraycopy in a loop, copying two bytes at a time, is not efficient. I didn't try look how big the copy needs to be to make the overhead of arraycopy smaller than the benefit but clearly it needs to be bigger than two bytes. The use of Arrays.fill to optimize the case of one repeating byte looks useful especially if it won't hurt performance in other situations. Still, I'm not sure yet if the LZDecoder optimizations should go in 1.9. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: [xz-devel] Re: java LZDecoder small improvement
On 2021-02-01 Brett Okken wrote: > I have played with this quite a bit and have come up with a slightly > modified change which does not regress for the smallest of the sample > objects and shows a nice improvement for the 2 larger files. It seems to regress horribly if dist is zero. A file with a very long sequence of the same byte is good for testing. The problem is that tmp is almost always 1 and then each arraycopy call will copy exactly one byte. The overhead is very high compared to doing the copying in a loop like in the original code. Below is a different version which is a little faster with Java 15 but worse than the current simple code on Java 8 (tested on the same computer and OS). The improvement over the current code is like 3-5 % with Java 15, so not a lot but not insignificant either (such optimizations add up). However, if the change is neutral or clearly negative on Java 8, maybe this patch isn't worth the complexity yet. Java 8 is still supported by its upstream. Maybe you get different results. Make sure the uncompressed size of the test files is several times larger than the dictionary size. With the current knowledge I think this patch will need to wait past XZ for Java 1.9. diff --git a/src/org/tukaani/xz/lz/LZDecoder.java b/src/org/tukaani/xz/lz/LZDecoder.java index 85b2ca1..8b3564c 100644 --- a/src/org/tukaani/xz/lz/LZDecoder.java +++ b/src/org/tukaani/xz/lz/LZDecoder.java @@ -92,14 +92,39 @@ public final class LZDecoder { pendingDist = dist; int back = pos - dist - 1; -if (dist >= pos) +if (dist >= pos) { +// We won't get here if the dictionary isn't full. +assert full == bufSize; + +// The distance wraps around to the end of the cyclic dictionary +// buffer. Here we will never copy more than dist + 1 bytes +// and so the copying won't repeat from its own output. Thus, +// we can always use arraycopy safely. back += bufSize; +int copySize = Math.min(bufSize - back, left); +assert copySize <= dist + 1; + +System.arraycopy(buf, back, buf, pos, copySize); +pos += copySize; +back = 0; +left -= copySize; -do { -buf[pos++] = buf[back++]; -if (back == bufSize) -back = 0; -} while (--left > 0); +if (left == 0) +return; +} + +assert left > 0; + +if (left > dist + 1) { +// We are copying more than dist + 1 bytes and thus will partly +// copy from our own output. +do { +buf[pos++] = buf[back++]; +} while (--left > 0); +} else { +System.arraycopy(buf, back, buf, pos, left); +pos += left; +} if (full < pos) full = pos; -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: [xz-devel] java crc64 implementation
I assume you accidentally didn't post to the list so I'm quoting your email in full. On 2021-02-02 Brett Okken wrote: > > while ((i & 3) != 1 && i < end) > > Shouldn't that be (i & 3) != 0? > An offset of 0 should not enter this loop, but 0 & 3 does not equal 1. The idea really is that offset of 1 doesn't enter the loop, thus the main slicing-by-4 loop is misaligned. I don't know why it makes a difference and I'm no longer even sure why I decided to try it. You can try different (i & 3) != { 0, 1, 2, 3 } combinations. > > If I change the buffer size from 8192 to 8191 in XZDecDemo.java, > > then "Modified slicing-by-4" somehow becomes as fast as the > > "Misaligned slicing-by-4". On the surface it sounds weird because > > the buffer still has the same alignment, it's just one byte smaller > > at the end. > > My guess is that this has to do with how many while loops need to be > executed/optimized. > Making it one byte smaller guarantees one of the additional while > loops actually has to execute. Depending on the initial offset, > potentially both need to execute. Maybe you are right, but the confusing thing is that those while-loops are supposedly slower than the for-loop. :-) > > It would be nice if you could compare these too and suggest what > > should be committed. Maybe you can figure out an even better > > version. Different CPU or 32-bit Java or other things may give > > quite different results. > > Truncating the crc to an int 1 time in the loop seems like a clear > winner. I will play with this in my benchmark. > My benchmark is calculating the crc64 of 8k of random bytes. I will > change it to include misaligned read as well. Thanks. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: [xz-devel] xz-java minor read improvements
On 2021-01-29 Brett Okken wrote: > Here are some small improvements when creating new BlockInputStream > instances. This reduces the size of the byte[] for the block header to > the actual size I committed this part. Thanks! > and replaces use of ByteArrayInputStream, which has synchronized > methods, with a ByteBuffer, which provides the same functionality > without synchronization. Hmm, it sounds good but I don't like that decodeVLI needs to be duplicated. The performance of header decoding in BlockInputStream is fairly unimportant; the performance bottle necks are elsewhere. Keeping the code tidy matters more. Obviously one could wrap ByteBuffer into an InputStream or one could change IndexHash.java and IndexDecoder.java to work with something else. Those Index* classes might be reading from an InputStream that has high read()-call overhead for reasons other than locking (although in such cases the application could then be using BufferedInputStream). Unless you have a practical situation in mind where these optimizations make a measurable difference, it's best to not make them more complex than they are. By the way, I committed module-info.java support as multi-release JAR, so multi-release can be used for other things too. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: [xz-devel] java crc64 implementation
Hello! I need to make a new release in the near future so that a minor problem can be fixed in .7z support in Apache Commons Compress. I thought I could include simpler and safer changes from your long list of patches and the CRC64 improvement might be such. On 2021-01-21 Brett Okken wrote: > Here is a slice by 4 implementation. It goes byte by byte to easily be > compatible with older jdks. Performance wise, it is pretty comparable > to the java port of Adler's stackoverflow implementation: > > Benchmark Mode Cnt Score Error Units > Hash64Benchmark.adler avgt5 6850.172 ± 251.528 ns/op > Hash64Benchmark.crc64 avgt5 16347.986 ± 53.702 ns/op > Hash64Benchmark.slice4avgt5 6842.010 ± 393.149 ns/op Thank you! I played around a bit. Seems that the code is *really* sensitive to tiny changes. It's possible that it depends on the computer and such things too; I only tried on one machine. I timed decompression of gigabyte of null bytes using XZDecDemo and OpenJDK 15 on x86-64. This isn't very accurate but it's enough to sort them: Original6.8 s Modified original 6.2 s Your slicing-by-4 5.8 s Modified slicing-by-4 5.6 s Misaligned slicing-by-4 5.2 s xz -t 3.6 s Modified original: --- a/src/org/tukaani/xz/check/CRC64.java +++ b/src/org/tukaani/xz/check/CRC64.java @@ -38,7 +38,8 @@ public class CRC64 extends Check { int end = off + len; while (off < end) -crc = crcTable[(buf[off++] ^ (int)crc) & 0xFF] ^ (crc >>> 8); +crc = crcTable[(buf[off++] & 0xFF) ^ ((int)crc & 0xFF)] + ^ (crc >>> 8); } public byte[] finish() { Modified slicing-by-4: public void update(byte[] buf, int off, int len) { final int end = off + len; int i = off; for (int end4 = end - 3; i < end4; i += 4) { final int tmp = (int)crc; crc = TABLE[3][(tmp & 0xFF) ^ (buf[i] & 0xFF)] ^ TABLE[2][((tmp >>> 8) & 0xFF) ^ (buf[i + 1] & 0XFF)] ^ (crc >>> 32) ^ TABLE[1][((tmp >>> 16) & 0xFF) ^ (buf[i + 2] & 0XFF)] ^ TABLE[0][((tmp >>> 24) & 0xFF) ^ (buf[i + 3] & 0XFF)]; } while (i < end) crc = TABLE[0][(buf[i++] & 0xFF) ^ ((int)crc & 0xFF)] ^ (crc >>> 8); } Misaligned slicing-by-4 adds an extra while-loop to the beginning: public void update(byte[] buf, int off, int len) { final int end = off + len; int i = off; while ((i & 3) != 1 && i < end) crc = TABLE[0][(buf[i++] & 0xFF) ^ ((int)crc & 0xFF)] ^ (crc >>> 8); for (int end4 = end - 3; i < end4; i += 4) { final int tmp = (int)crc; crc = TABLE[3][(tmp & 0xFF) ^ (buf[i] & 0xFF)] ^ TABLE[2][((tmp >>> 8) & 0xFF) ^ (buf[i + 1] & 0XFF)] ^ (crc >>> 32) ^ TABLE[1][((tmp >>> 16) & 0xFF) ^ (buf[i + 2] & 0XFF)] ^ TABLE[0][((tmp >>> 24) & 0xFF) ^ (buf[i + 3] & 0XFF)]; } while (i < end) crc = TABLE[0][(buf[i++] & 0xFF) ^ ((int)crc & 0xFF)] ^ (crc >>> 8); } If I change the buffer size from 8192 to 8191 in XZDecDemo.java, then "Modified slicing-by-4" somehow becomes as fast as the "Misaligned slicing-by-4". On the surface it sounds weird because the buffer still has the same alignment, it's just one byte smaller at the end. The same thing happens too if the buffer size is kept at 8192 but first byte isn't used (making the beginning of the buffer misaligned). Moving the "(crc32 >> 32)" to a different position in the xor sequence can affect things too... it's almost spooky. ;-) It would be nice if you could compare these too and suggest what should be committed. Maybe you can figure out an even better version. Different CPU or 32-bit Java or other things may give quite different results. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: [xz-devel] Compatibility between CMake config file and FindLibLZMA.cmake
On 2021-01-23 Markus Rickert wrote: > This could be solved by adding an alias to the config file: > add_library(LibLZMA::LibLZMA ALIAS liblzma::liblzma) > > An additional improvement would be to enable this on case-sensitive > file systems as well. For this, the config file would need to be > renamed from liblzmaConfig.cmake to liblzma-config.cmake (and the > version file to liblzma-config-version.cmake), see [2]. I have committed both of your suggestions (hopefully correctly). Thanks! Some extra thoughts: There are some differences between FindLibLZMA and the config file: - FindLibLZMA doesn't #define LZMA_API_STATIC when building against static liblzma. LZMA_API_STATIC omits __declspec(dllimport) from liblzma function declarations on Windows. - FindLibLZMA sets a few CMake cache variables that the config file doesn't, for example, LIBLZMA_HAS_EASY_ENCODER. I have no idea if there are packages that care about this. - The config file has find_dependency(Threads) while FindLibLZMA doesn't. This can affect the linker flags. Perhaps there are other details affecting compatiblity. I just wonder how big mistake it was to use liblzma::liblzma in the config file. I guess it's too late to change it now. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: [xz-devel] [RFC 2/2] Add xxHash, XX3 (128bit) for hashing.
On 2021-01-20 Sebastian Andrzej Siewior wrote: > On 2021-01-20 00:37:06 [+0100], Sebastian Andrzej Siewior wrote: > > So this is better than crc64 and close to none while doing > > something ;) > > xz -tv -T0 with crc64 reports: > > 100 % 10,2 GiB / 40,0 GiB = 0,255 1,1 GiB/s 0:35 > > and the same archive with xxh3: > > 100 % 10,2 GiB / 40,0 GiB = 0,255 1,1 GiB/s 0:34 > > which looks like it is not worth the trouble. If there were a fast algorithm in .xz, then it would be worth the trouble. Having such an algorithm was in the early plans but so were a few other nice things but many never materialized. I will look at the SHA-256 patch later. There are unusually many things in the queue of XZ-related things. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: [xz-devel] [PATCH v2] liblzma: Add multi-threaded decoder
Hello! I haven't made much progress with this still, I'm sorry. :-( Below are comments about a few small details. It's not much but I will (slowly) keep reading and testing. I applied the outq patch too. The performance numbers you posted looked promising. (1) Segfault due to thr->outbuf == NULL I changed CHUNK_SIZE to 1 to test corner cases. I used good-1-block_header-1.xz as the test file. It can segfault in worker_decoder() on the line calling thr->block_decoder.code(...) because thr->outbuf is NULL (so the problem was introduced in the outq patch). This happens because of "thr->outbuf = NULL;" later in the function. It looks like that it marks the outbuf finished and returns the thread to the pool too early or forgets to set thr->state = THR_IDLE. As a temporary workaround, I added "thr->state = THR_IDLE;" after "thr->outbuf = NULL;". (2) Block decoder must return LZMA_STREAM_END on success Because of end marker and integrity check, the output buffer will be full before the last bytes of input have been processed by the Block decoder. Thus it is not enough to look at the input and output positions to determine when decoding has been finished; only LZMA_STREAM_END should be used to determine that decoding was successful. In theory it is OK to mark the outbuf as finished once the output is full but for simplicity I suggest doing so (and returning the thread to the pool) only after LZMA_STREAM_END. I committed a new test file bad-1-check-crc32-2.xz. The last byte in the Block (last byte of Check) is wrong. Change CHUNK_SIZE to 1 and try "xz -t -T2 file bad-1-check-crc32-2.xz". The file must be detected to be corrupt (LZMA_DATA_ERROR). (3) Bad input where the whole input or output buffer cannot be used In the old single-threaded decoding, lzma_code() will eventually return LZMA_BUF_ERROR if the calls to lzma_code() cannot make any progress, that is, no more input is consumed and no more output is produced. This condition can happen with correct code if the input file is corrupt in a certain way, for example, a truncated .xz file. Since the no-progress detection is centralized in lzma_code(), the internal decoders including Block decoder don't try to detect this situation. Currently this means that worker_decoder() should detect it to catch bad input and prevent hanging on certain malformed Blocks. However, since the Block decoder knows both Compressed Size and Uncompressed Size, I think I will improve Block decoder instead so don't do anything about this for now. I committed two test files, bad-1-lzma2-9.xz and bad-1-lzma2-10.xz. The -9 may make worker_decoder() not notice that the Block is invalid. The -10 makes the decoder hang. Like I said, I might fix these by changing the Block decoder. (4) Usage of partial_update in worker_decoder() Terminology: main mutex means coder->mutex alias thr->coder->mutex. In worker_decoder(), the main mutex is locked every time there is new output available in the worker thread. partial_update is only used to determine when to signal thr->coder->cond. To reduce contention on the main mutex, worker_decoder() could lock it only when - decoding of the Block has been finished (successfully or unsuccessfully, that is, ret != LZMA_OK), or - there is new output available and partial_update is true; if partial_update is false, thr->outbuf->pos is not touched. This way only one worker will be frequently locking the main mutex. However, I haven't tried it and thus don't know how much this affects performance in practice. One possible problem might be that it may introduce a small delay in output availability when the main thread switches reading from the next outbuf in the list. (5) Use of mythread_condtime_set() In the encoder the absolute time is calculated once per lzma_code() call. The comment in wait_for_work() in in stream_encoder_mt.c was wrong. The reason the absolute time is calculated once per lzma_code() call is to ensure that blocking multiple times won't make the timeout ineffective if each blocking takes less than timeout milliseconds. So it should be done similarly in the decoder. (6) Use of lzma_outq_enable_partial_output() It should be safe to call it unconditionally: if (thr->outbuf == coder->outq.head) lzma_outq_enable_partial_output(>outq, thr_do_partial_update); If outq.head is something else, it is either already finished or partial output has already been enabled. In both cases lzma_outq_enable_partial_output() will do nothing. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: [xz-devel] java crc64 implementation
On 2021-01-13 Brett Okken wrote: > Mark Adler has posted an optimized crc64 implementation on > stackoverflow[1]. This can be reasonably easily ported to java (that > post has a link to java impl on github[2] which warrants a little > clean up, but gives a decent idea). > > I did a quick benchmark calculating the crc64 over 8KB and the results > were impressive: > > Benchmark Mode Cnt ScoreError Units > Hash64Benchmark.adler avgt5 6908.677 ± 47.790 ns/op > Hash64Benchmark.crc64 avgt5 16343.091 ± 64.089 ns/op The CRC64 implementation in XZ for Java is indeed a basic version. I wanted to keep things simple in the beginning and didn't think about it much later since the Java version of XZ is slower than C version for other reasons anyway. In XZ Utils, slicing-by-4 method is used for CRC64 and slicing-by-8 for CRC32. A reason for not using by-8 for CRC64 is to reduce CPU L1 cache usage: by-4 with CRC64 needs 8 KiB lookup table, by-8 needs 16 KiB. Micro-benchmarking with big table can look good but when the CRC is just a small part of the application the results are more complicated (more cache misses to load the bigger table, more other data pushed out of cache). It is essential to note that the decisions about table sizes were made over a decade ago with 32-bit CPUs and it's very much possible that different decisions would be better nowadays. The version by Mark Adler [1] uses slicing-by-8 with CRC64. It also includes a method to combine the CRC values of two blocks which is great if one uses threads to compute a CRC. Threaded CRC doesn't sound useful with XZ since LZMA isn't that fast anyway. A side note: GNU gzip uses the basic method for CRC32 [3] while zlib uses slicing-by-8. Since Deflate is fast to decode, replacing the CRC32 in GNU gzip would make a clear difference in decompression speed. [3] http://git.savannah.gnu.org/cgit/gzip.git/tree/util.c#n126 > [1] - > https://stackoverflow.com/questions/20562546/how-to-get-crc64-distributed-calculation-use-its-linearity-property/20579405#20579405 > > [2] - > https://github.com/MrBuddyCasino/crc-64/blob/master/crc-64/src/main/java/net/boeckling/crc/CRC64.java I didn't find license information from the [2] repository. XZ for Java is public domain so the license likely wouldn't match anyway. Porting from XZ Utils shouldn't be too hard, depending on how much one wishes to optimize it. - src/liblzma/check/crc64_fast.c - src/liblzma/check/crc_macros.h - src/liblzma/check/crc64_tablegen.c (or should it just include pre-computed tables like liblzma and zlib do?) Unlike the C version in [1], the Java version in [2] reads the input byte[] array byte-by-byte. Using a fast method to read 8 *aligned* bytes at a time in native byte order should give more speed; after all, it's one of the benefits of this method that one can read multiple input bytes at a time. A public domain patch for a faster CRC64 to XZ for Java is welcome. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: [xz-devel] xz-java and newer java
On 2021-01-11 Brett Okken wrote: > I threw together a quick jmh test, and there is no value in the > changes to Hash234. OK, let's forget that then. On 2021-01-16 Brett Okken wrote: > I have found a way to use VarHandle byte array access at runtime in > code which is compile time compatible with jdk 7. So here is an > updated ArrayUtil class which will use a VarHandle to read long values > in jdk 9+. If that is not available, it will attempt to use > sun.misc.Unsafe. If that cannot be found, it falls back to standard > byte by byte comparison. Sounds promising. :-) You have already done quite a bit of work in both writing code and benchmarking. Thank you! The method you ended up is similar to src/liblzma/common/memcmplen.h in XZ Utils. There 8-byte version is used on 64-bit systems and 4-byte version on 32-bit systems. In XZ Utils, SSE2 version (16-byte comparison) is faster than 4-byte compare on 32-bit x86, but on x86-64 the 8-byte version has similar speed or is faster than the SSE2 version (it depends on the CPU). Have you tested with 32-bit Java too? It's quite possible that it's better to use ints than longs on 32-bit system. If so, that should be detected at runtime too, I guess. In XZ Utils the arrays have extra room at the end so that memcmplen.h can always read 4/8/16 bytes at a time. Since this is easy to do, I think it should be done in XZ for Java too to avoid special handling of the last bytes. > I did add an index bounds check for the unsafe implementation and > found it had minimal impact on over all performance. Since Java in general is memory safe, having bound checks with Unsafe is nice as long as it doesn't hurt performance too much. This if (aFromIndex < 0 || aFromIndex + length > a.length || bFromIndex < 0 || bFromIndex + length > b.length) { is a bit relaxed though since it doesn't catch integer overflows. Something like this would be more strict: if (length < 0 || aFromIndex < 0 || aFromIndex > a.length - length || bFromIndex < 0 || bFromIndex > b.length - length) { > Using VarHandle (at least on jdk 11) offers very similar performance > to Unsafe across all 3 files I used for benchmarking. OK. I cannot comment the details much because I'm not familiar with either API for now. Comparing byte arrays as ints or longs results in unaligned/misaligned memory access. MethodHandles.byteArrayViewVarHandle docs say that this is OK. A quick web search gave me an impression that it might not be safe with Unsafe though. Can you verify how it is with Unsafe? If it isn't allowed, dropping support for Unsafe may be fine. It's just the older Java versions that would use it anyway. It is *essential* that the code works well also on archs that don't have fast unaligned access. Even if the VarHandle method is safe, it's not clear how the performance is on archs that don't support fast unaligned access. It would be bad to add an optimization that is good on x86-64 but counter-productive on some other archs. One may need arch-specific code just like there is in XZ Utils, although on the other hand it would be nice to keep the Java code less complicated. Do you have a way to check how these methods behave on Android and ARM? (I understand that this might be too much work to check. This may be skipped.) I wish to add module-info.java in the next release. Do these new methods affect what should be in module-info.java? With the current code this seems to be enough: module org.tukaani.xz { exports org.tukaani.xz; } > final int leadingZeros = (int)LEADING_ZEROS.invokeExact(diff); > return i + (leadingZeros / Byte.SIZE); Seems that Java might not optimize that division to a right shift. It could be better to use "leadingZeros >>> 3". > I know you said you were not going to be able to work on xz-java for > awhile, but given these benchmark results, which really exceeded my > expectations, could this get some priority to release? I understood that it's 9-18 % faster. That is significant but it's still a performance optimization only, not an important bug fix, and to me the code doesn't feel completely ready yet (for example, the unaligned access is important to get right). (Compare to the threaded decompression support that is coming to XZ Utils. It will speed things up a few hundred percent.) Can you provide a complete patch to make testing easier (or if not possible, complete copies of modified files)? Also, please try to wrap the lines so that they stay within 80 columns (with some long unbreakable strings this may not be possible, then those lines can be overlong instead of messing up the indentation). I think your patch will find its way into XZ for Java in some form but once again I repeat that it will take some time. These XZ projects are only a hobby for me and currently I don't even turn on my computer every day. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode
Re: [xz-devel] [PATCH] xz: Fix setting memory limit on 32-bit systems
On 2021-01-10 Sebastian Andrzej Siewior wrote: > I hope for sane defaults :) I hope so too. So far I have felt that the suggested solutions have significant flaws or downsides, and I'm not able to see what is a good enough compromise. As a result the discussion hasn't progressed much and I feel it's partly my fault, sorry. I will try again: I have understood that *in practice* the problem with the xz command line tool is limited to "xz -T0" usage so fixing this use case is enough for most people. Please correct me if I missed something. The change in XZ Utils 5.2.5 helps a little with 32-bit xz running under 64-bit kernel but only if one specifies a memory usage limit like -M90% together with -T0. To make plain -T0 work too, in an earlier email I suggested that -T0 could also imply a memory usage limit if no limit was otherwise specified (a preliminary patch was included too). I have been hesitant to make changes to the defaults of the memory usage limiter but this solution would only affect a very specific situation and thus I feel it would be fine. Comments would be appreciated. The problem with applications using liblzma and running out of address space sounds harder to fix. As I explained in another email, making liblzma more robust with memory allocation failures is not a perfect fix and can still result in severe problems depending on how the application as a whole works (with some apps it could be enough). An alternative "fix" for the liblzma case could be adding a simple API function that would scale down the number of threads in a lzma_mt structure based on a memory usage limit and if the application is 32 bits. Currently the thread count and LZMA2 settings adjusting code is in xz, not in liblzma. > Anyway. Not to overcompilcate things: On Linux you can obtain the > available system memory which I would cap to 2 or 2.5 GiB by default. > Nobody should be hurt by that. If full 4 GiB of address space is available, capping to 2 GiB to 2.5 GiB when the available memory isn't known would mean fewer threads than with the 4020 MiB limit. Obviously this is less bad than failing due to running out of address space but it still makes me feel that if available memory is used on Linux, it should be ported to other OSes too. The idea for the current 4020 MiB special limit is based on a patch that was in use in FreeBSD to solve the problem of 32-bit xz on 64-bit kernel. So at least FreeBSD should be supported to not make 32-bit xz worse under 64-bit FreeBSD kernel. In liblzma, if a new function is added to reduce the thread count based on a memory usage limit, a capping the limit to 2 to 3 GiB on 32-bit applications could be fine even if there is more available memory. Being conservative means fewer threads but it would make it more likely that things keep working if the application allocates memory after liblzma has already done so. Oh well. :-( I think I still made this sound like a mess. In any case, let's at least try to find some solution to the "xz -T0" case. It would be nice to hear if my suggestion makes any sense. Thanks. -- Lasse Collin | IRC: Larhzu @ IRCnet & Freenode