Re: SMP Version of tar

2012-10-10 Thread Kurt Lidl
On Tue, Oct 09, 2012 at 09:54:03PM -0700, Tim Kientzle wrote:
 
 On Oct 8, 2012, at 3:21 AM, Wojciech Puchar wrote:
 
  Not necessarily.  If I understand correctly what Tim means, he's talking
  about an in-memory compression of several blocks by several separate
  threads, and then - after all the threads have compressed their
  
  but gzip format is single stream. dictionary IMHO is not reset every X 
  kilobytes.
  
  parallel gzip is possible but not with same data format.
 
 Yes, it is.
 
 The following creates a compressed file that
 is completely compatible with the standard
 gzip/gunzip tools:
 
* Break file into blocks
* Compress each block into a gzip file (with gzip header and trailer 
 information)
* Concatenate the result.
 
 This can be correctly decoded by gunzip.
 
 In theory, you get slightly worse compression.  In practice, if your blocks 
 are reasonably large (a megabyte or so each), the difference is negligible.

I am not sure, but I think this conversation might have a slight
misunderstanding due to imprecisely specified language, while the
technical part is in agreement.

Tim is correct in that gzip datastream allows for concatenation of
compressed blocks of data, so you might break the input stream into
a bunch of blocks [A, B, C, etc], and then can append those together
into [A.gz, B.gz, C.gz, etc], and when uncompressed, you will get
the original input stream.

I think that Wojciech's point is that the compressed data stream for
for the single datastream is different than the compressed data
stream of [A.gz, B.gz, C.gz, etc].  Both will decompress to the same
thing, but the intermediate compressed representation will be different.

-Kurt
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: SMP Version of tar

2012-10-10 Thread Wojciech Puchar


Tim is correct in that gzip datastream allows for concatenation of
compressed blocks of data, so you might break the input stream into
a bunch of blocks [A, B, C, etc], and then can append those together
into [A.gz, B.gz, C.gz, etc], and when uncompressed, you will get
the original input stream.
I think that Wojciech's point is that the compressed data stream for
for the single datastream is different than the compressed data
stream of [A.gz, B.gz, C.gz, etc].  Both will decompress to the same
thing, but the intermediate compressed representation will be different.


So - after your response it is clear that parallel generated tar.gz will 
be different and have slightly (can be ignored) worse compression, and 
WILL be compatible with standard gzip as it can decompress from multiple 
streams which i wasn't aware of.


That's good. at the same time parallel tar will go back to single thread 
when unpacking standard .tar.gz - not a big deal, as gzip decompression is 
untrafast and I/O is usually a limit.



___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: SMP Version of tar

2012-10-09 Thread Wojciech Puchar

Not necessarily.  If I understand correctly what Tim means, he's talking
about an in-memory compression of several blocks by several separate
threads, and then - after all the threads have compressed their
respective blocks - writing out the result to the output file in order.
Of course, this would incur a small penalty in that the dictionary would
not be reused between blocks, but it might still be worth it.
all fine. i just wanted to point out that ungzipping normal standard gzip 
file cannot be multithreaded, and multithreaded-compressed gzip 
output would be different.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: SMP Version of tar

2012-10-09 Thread Tim Kientzle

On Oct 8, 2012, at 3:21 AM, Wojciech Puchar wrote:

 Not necessarily.  If I understand correctly what Tim means, he's talking
 about an in-memory compression of several blocks by several separate
 threads, and then - after all the threads have compressed their
 
 but gzip format is single stream. dictionary IMHO is not reset every X 
 kilobytes.
 
 parallel gzip is possible but not with same data format.

Yes, it is.

The following creates a compressed file that
is completely compatible with the standard
gzip/gunzip tools:

   * Break file into blocks
   * Compress each block into a gzip file (with gzip header and trailer 
information)
   * Concatenate the result.

This can be correctly decoded by gunzip.

In theory, you get slightly worse compression.  In practice, if your blocks are 
reasonably large (a megabyte or so each), the difference is negligible.

Tim

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: SMP Version of tar

2012-10-08 Thread Trond Endrestøl
On Sun, 7 Oct 2012 19:00+0200, Wojciech Puchar wrote:

  I would be willing to work on a SMP version of tar (initially just gzip or
  something).
  
  I don't have the best experience in compression, and how to multi-thread it,
  but I think I would be able to learn and help out.
 
 gzip cannot - it is single stream.
 
 bzip2 - no idea

Check out archivers/pbzip2.

 grzip (from ports, i use it) - can be multithreaded as it packs using fixed
 large chunks.

-- 
+---++
| Vennlig hilsen,   | Best regards,  |
| Trond Endrestøl,  | Trond Endrestøl,   |
| IT-ansvarlig, | System administrator,  |
| Fagskolen Innlandet,  | Gjøvik Technical College, Norway,  |
| tlf. mob.   952 62 567,   | Cellular...: +47 952 62 567,   |
| sentralbord 61 14 54 00.  | Switchboard: +47 61 14 54 00.  |
+---++___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Re: SMP Version of tar

2012-10-08 Thread Wojciech Puchar

gzip cannot - it is single stream.


gunzip commutes with cat, so gzip
compression can be multi-threaded
by compressing separate blocks and
concatenating the result.


right. but resulting file format must be different.
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: SMP Version of tar

2012-10-08 Thread Peter Pentchev
On Mon, Oct 08, 2012 at 08:38:33AM +0200, Wojciech Puchar wrote:
 gzip cannot - it is single stream.
 
 gunzip commutes with cat, so gzip
 compression can be multi-threaded
 by compressing separate blocks and
 concatenating the result.
 
 right. but resulting file format must be different.

Not necessarily.  If I understand correctly what Tim means, he's talking
about an in-memory compression of several blocks by several separate
threads, and then - after all the threads have compressed their
respective blocks - writing out the result to the output file in order.
Of course, this would incur a small penalty in that the dictionary would
not be reused between blocks, but it might still be worth it.

G'luck,
Peter

-- 
Peter Pentchev  r...@ringlet.net r...@freebsd.org pe...@packetscale.com
PGP key:http://people.FreeBSD.org/~roam/roam.key.asc
Key fingerprint FDBA FD79 C26F 3C51 C95E  DF9E ED18 B68D 1619 4553
Hey, out there - is it *you* reading me, or is it someone else?


signature.asc
Description: Digital signature


Re: SMP Version of tar

2012-10-08 Thread Wojciech Puchar

Not necessarily.  If I understand correctly what Tim means, he's talking
about an in-memory compression of several blocks by several separate
threads, and then - after all the threads have compressed their


but gzip format is single stream. dictionary IMHO is not reset every X 
kilobytes.


parallel gzip is possible but not with same data format.


By the way in my opinion grzip (in ports/archivers/grzip) is one of the 
most under-valued software. almost unknown. compresses faster than bzip2, 
with better results, it is very simple and making parallel version is 
trivial - there is just a procedure to compress single block.


Personally i use gzip for fast compression and grzip for strong 
compression, and don't use bzip2 at all

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: SMP Version of tar

2012-10-07 Thread Wojciech Puchar
I would be willing to work on a SMP version of tar (initially just gzip or 
something).


I don't have the best experience in compression, and how to multi-thread it, 
but I think I would be able to learn and help out.


gzip cannot - it is single stream.

bzip2 - no idea

grzip (from ports, i use it) - can be multithreaded as it packs using 
fixed large chunks.

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: SMP Version of tar

2012-10-07 Thread Tim Kientzle
On Oct 7, 2012, at 10:00 AM, Wojciech Puchar wrote:

 I would be willing to work on a SMP version of tar (initially just gzip or 
 something).
 
 I don't have the best experience in compression, and how to multi-thread it, 
 but I think I would be able to learn and help out.
 
 gzip cannot - it is single stream.

gunzip commutes with cat, so gzip
compression can be multi-threaded
by compressing separate blocks and
concatenating the result.

For proof, look at Mark Adler's pigz
program, which does exactly this.

GZip decompression is admittedly trickier.


 bzip2 - no idea

bzip2 is block oriented and can be multi-threaded for both compression and 
decompression.


Tim

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: SMP Version of tar

2012-10-03 Thread John Nielsen
On Oct 2, 2012, at 12:36 AM, Yamagi Burmeister li...@yamagi.org wrote:

 On Mon, 1 Oct 2012 22:16:53 -0700
 Tim Kientzle t...@kientzle.com wrote:
 
 There are a few different parallel command-line compressors and 
 decompressors in ports; experiment a lot (with large files being read from 
 and/or written to disk) and see what the real effect is.  In particular, 
 some decompression algorithms are actually faster than memcpy() when run on 
 a single processor.  Parallelizing such algorithms is not likely to help 
 much in the real world.
 
 The two popular algorithms I would expect to benefit most are bzip2 
 compression and lzma compression (targeting xz or lzip format).  For 
 decompression, bzip2 is block-oriented so fits SMP pretty naturally.  Other 
 popular algorithms are stream-oriented and less amenable to parallelization.
 
 Take a careful look at pbzip2, which is a parallelized bzip2/bunzip2 
 implementation that's already under a BSD license.  You should be able to 
 get a lot of ideas about how to implement a parallel compression algorithm.  
 Better yet, you might be able to reuse a lot of the existing pbzip2 code.
 
 Mark Adler's pigz is also worth studying.  It's also license-friendly, and 
 is built on top of regular zlib, which is a nice technique when it's 
 feasible.
 
 Just a small note: There's a parallel implementation of xz called
 pixz. It's build atop of liblzma and libarchiv and stands under a 
 BSD style license. See: https://github.com/vasi/pixz Maybe it's
 possible to reuse most of the code.


See also below, which has some bugfixes/improvements that AFAIK were never 
committed in the original project (though they were submitted).
https://github.com/jlrobins/pixz

JN

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: SMP Version of tar

2012-10-03 Thread Richard Yao
On 10/02/2012 03:06 AM, Adrian Chadd wrote:
 .. please keep in mind that embedded platforms (a) don't necessarily
 benefit from it, and (b) have a very small footprint. Bloating out the
 compression/archival tools for the sake of possible SMP support will
 make me very, very sad.
 
 
 
 Adrian
 ___
 freebsd-hackers@freebsd.org mailing list
 http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
 To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org

Someone might want to ask if parallelizing tar is even possible. tar is
meant to serially write to tape. It should not be possible possible to
parallelize that operation.

I can imagine parallelizing compression algorithms, but I cannot imagine
parallelizing tar.



signature.asc
Description: OpenPGP digital signature


Re: SMP Version of tar

2012-10-03 Thread Tim Kientzle
 Someone might want to ask if parallelizing tar is even possible.

Answer:  Yes.  Here's a simple parallel version of tar:

   find . | cpio -o -H ustar | gzip  outfile.tgz

There are definitely other approaches.


Tim

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: SMP Version of tar

2012-10-02 Thread Yamagi Burmeister
On Mon, 1 Oct 2012 22:16:53 -0700
Tim Kientzle t...@kientzle.com wrote:

 There are a few different parallel command-line compressors and decompressors 
 in ports; experiment a lot (with large files being read from and/or written 
 to disk) and see what the real effect is.  In particular, some decompression 
 algorithms are actually faster than memcpy() when run on a single processor.  
 Parallelizing such algorithms is not likely to help much in the real world.
 
 The two popular algorithms I would expect to benefit most are bzip2 
 compression and lzma compression (targeting xz or lzip format).  For 
 decompression, bzip2 is block-oriented so fits SMP pretty naturally.  Other 
 popular algorithms are stream-oriented and less amenable to parallelization.
 
 Take a careful look at pbzip2, which is a parallelized bzip2/bunzip2 
 implementation that's already under a BSD license.  You should be able to get 
 a lot of ideas about how to implement a parallel compression algorithm.  
 Better yet, you might be able to reuse a lot of the existing pbzip2 code.
 
 Mark Adler's pigz is also worth studying.  It's also license-friendly, and is 
 built on top of regular zlib, which is a nice technique when it's feasible.

Just a small note: There's a parallel implementation of xz called
pixz. It's build atop of liblzma and libarchiv and stands under a 
BSD style license. See: https://github.com/vasi/pixz Maybe it's
possible to reuse most of the code.

-- 
Homepage:  www.yamagi.org
XMPP:  yam...@yamagi.org
GnuPG/GPG: 0xEFBCCBCB


pgp4AZtefgufA.pgp
Description: PGP signature


Re: SMP Version of tar

2012-10-02 Thread Adrian Chadd
.. please keep in mind that embedded platforms (a) don't necessarily
benefit from it, and (b) have a very small footprint. Bloating out the
compression/archival tools for the sake of possible SMP support will
make me very, very sad.



Adrian
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: SMP Version of tar

2012-10-02 Thread Brandon Falk
Don't worry. I'm well known to over-optimize for both size and speed. I 
have an old Pentium 3 800MHz single core that I can use to simulate an 
embedded device (well, a decently powered one), to verify that I'm not 
killing the single-core performance (I could add CPU capability 
detection to help with that). Also, I still need to find time to do all 
this research and development.


-Brandon

On 10/2/2012 3:06 AM, Adrian Chadd wrote:

.. please keep in mind that embedded platforms (a) don't necessarily
benefit from it, and (b) have a very small footprint. Bloating out the
compression/archival tools for the sake of possible SMP support will
make me very, very sad.



Adrian


___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: SMP Version of tar

2012-10-02 Thread Joerg Sonnenberger
On Mon, Oct 01, 2012 at 10:16:53PM -0700, Tim Kientzle wrote:
   * Implement within libarchive directly.  This would benefit tar and
 a handful of other programs that use libarchive, but may not be
 worth the complexity.

The complexity shouldn't actually be that bad. Basically, use a large
internal buffer in the compression filter and make the queueing for
distribution to threads an implementation detail. Alternatively, just
provide a compatible implementation of libz's event interface.

Joerg
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: SMP Version of tar

2012-10-01 Thread Eitan Adler
On 1 October 2012 12:51, Brandon Falk bfalk_...@brandonfa.lk wrote:
 I would be willing to work on a SMP version of tar (initially just gzip or
 something).

 I don't have the best experience in compression, and how to multi-thread it,
 but I think I would be able to learn and help out.

 Note: I would like to make this for *BSD under the BSD license. I am aware
 that there are already tools to do this (under GPL), but I would really like
 to see this existent in the FreeBSD base.

 Anyone interested?

Tim Kientzle (cced) is the person to talk with. Try to keep questions
on the mailing list with him CCed so we all learn. :)



-- 
Eitan Adler
___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org


Re: SMP Version of tar

2012-10-01 Thread Tim Kientzle

On Oct 1, 2012, at 9:51 AM, Brandon Falk wrote:

 I would be willing to work on a SMP version of tar (initially just gzip or 
 something).
 
 I don't have the best experience in compression, and how to multi-thread it, 
 but I think I would be able to learn and help out.
 
 Note: I would like to make this for *BSD under the BSD license. I am aware 
 that there are already tools to do this (under GPL), but I would really like 
 to see this existent in the FreeBSD base.
 
 Anyone interested?

Great!

First rule:  be skeptical.  In particular, tar is so entirely disk-bound that 
many performance optimizations have no impact whatsoever.  If you don't do a 
lot of testing, you can end up wasting a lot of time.

There are a few different parallel command-line compressors and decompressors 
in ports; experiment a lot (with large files being read from and/or written to 
disk) and see what the real effect is.  In particular, some decompression 
algorithms are actually faster than memcpy() when run on a single processor.  
Parallelizing such algorithms is not likely to help much in the real world.

The two popular algorithms I would expect to benefit most are bzip2 compression 
and lzma compression (targeting xz or lzip format).  For decompression, bzip2 
is block-oriented so fits SMP pretty naturally.  Other popular algorithms are 
stream-oriented and less amenable to parallelization.

Take a careful look at pbzip2, which is a parallelized bzip2/bunzip2 
implementation that's already under a BSD license.  You should be able to get a 
lot of ideas about how to implement a parallel compression algorithm.  Better 
yet, you might be able to reuse a lot of the existing pbzip2 code.

Mark Adler's pigz is also worth studying.  It's also license-friendly, and is 
built on top of regular zlib, which is a nice technique when it's feasible.

There are three fundamentally different implementation approaches with 
different complexity/performance issues:

  * Implement as a stand-alone executable similar to pbzip2.  This makes your 
code a lot simpler and makes it reasonably easy for people to reuse your work.  
This could work with tar, though it could be slightly slower than the 
in-process version due to the additional data-copying and process-switch 
overhead.

  * Implement within libarchive directly.  This would benefit tar and a handful 
of other programs that use libarchive, but may not be worth the complexity.

  * Implement as a standalone library with an interface similar to zlib or 
libbz2 or liblzma.

The last would be my personal preference, though it's probably the most complex 
of all.   That would easily support libarchive and you could create a simple 
stand-alone wrapper around it as well, giving you the best of all worlds.

If you could extend the pigz technique, you might be able to build a 
multi-threaded compression library where the actual compression was handled by 
an existing single-threaded library.  Since zlib, bzlib, and liblzma already 
have similar interfaces, your layer might require only a thin adapter to handle 
any of those three.  *That* would be very interesting, indeed.

Sounds like a fun project.  I wish I had time to work on something like this.

Cheers,

Tim

___
freebsd-hackers@freebsd.org mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-hackers
To unsubscribe, send any mail to freebsd-hackers-unsubscr...@freebsd.org