Bug#494169: [Fwd: FW: Bug#494169: libarchive-dev: Please add a way to precompute (non-compressed) archive size]
Thibaut, John Goerzen forwarded your idea to me. You can actually implement this on top of the current libarchive code quite efficiently. Use the low-level archive_write_open() call and provide your own callbacks that just count the write requests. Then go through and write the archive as usual, except skip the write_data() part (for tar and cpio formats, libarchive will automatically pad the entry with NUL bytes). This may sound slow, but it's really not. One of the libarchive unit tests use this approach to write 1TB archives in just a couple of seconds. (Thist test checks libarchive's handling of very large archives with very large entries.) Look at test_tar_large.c for the details of how this particular test works. (test_tar_large.c actually does more than just count the data, but it should give you the general idea.) This will work very well with all of the tar and cpio formats. It won't work well with some other formats where the length does actually depend on the data. Cheers, Tim Kientzle Original Message Date: Thu, 7 Aug 2008 21:31:27 -0500 From: John Goerzen [EMAIL PROTECTED] To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED], [EMAIL PROTECTED] Subject: FW: Bug#494169: libarchive-dev: Please add a way to precompute (non-compressed) archive size Hi Tim, We received the below feature request at Debian. Not sure if it is something you would be interested in implementing, but thought I'd pass it along. -- John - Forwarded message from Thibaut VARENE [EMAIL PROTECTED] - From: Thibaut VARENE [EMAIL PROTECTED] Date: Thu, 07 Aug 2008 17:37:10 +0200 Reply-To: Thibaut VARENE [EMAIL PROTECTED], [EMAIL PROTECTED] To: Debian Bug Tracking System [EMAIL PROTECTED] Subject: Bug#494169: libarchive-dev: Please add a way to precompute (non-compressed) archive size Package: libarchive-dev Severity: wishlist Hi, I thought I already reported this, but apparently I didn't so here's the idea: I'm the author of mod_musicindex, in which I use libarchive to send on-the-fly tar archives to remote clients. Right now, the remote client's browser cannot display any ETA / %complete for the current download since I cannot tell before hand what will be the exact size of the archive I'm sending them. It would be very nice if there were some API allowing for the precomputation of the final size of a non-compressed archive that would allow me to do something like: archive_size = archive_size_header(a); for (filename in file list) { archive_size += archive_size_addfile(filename); /* or using stat() and eg archive_size_addstat() */ } archive_size += archive_size_footer(a); (brainfart pseudo code, I hope you get the idea) so that in the end archive_size will be exactly the size of the output archive (header/padding included), without having to actually read files or write the archive itself. I could thus send the remote client the actual size of the data they're going to be send beforehand. The trick is, this size cannot be approximate: the browser will cut the transfer even if I'm still sending them data if it has received as many bits as it was told. I'm under the impression that since this is about non-compressed archive, and considering the structure of a tar archive, my goal should be feasible without even having to read any input file. Am I wrong? Hope I'm quite clear, thanks for your help T-Bone -- System Information: Debian Release: lenny/sid APT prefers unstable APT policy: (500, 'unstable') Architecture: hppa (parisc64) Kernel: Linux 2.6.22.14 (SMP w/4 CPU cores) Locale: LANG=C, LC_CTYPE=C (charmap=ANSI_X3.4-1968) Shell: /bin/sh linked to /bin/bash - End forwarded message - -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#494169: [Fwd: FW: Bug#494169: libarchive-dev: Please add a way to precompute (non-compressed) archive size]
On Fri, Aug 8, 2008 at 8:42 AM, Tim Kientzle [EMAIL PROTECTED] wrote: Thibaut, John Goerzen forwarded your idea to me. You can actually implement this on top of the current libarchive code quite efficiently. Use the low-level archive_write_open() call and provide your own callbacks that just count the write requests. Then go through and write the archive as usual, except skip the write_data() part (for tar and cpio formats, libarchive will automatically pad the entry with NUL bytes). Hum, I'm not quite sure I get this right... By count the write requests and skip the write_data() part, you mean count the number of bytes that should have been written, without writting them? This may sound slow, but it's really not. One of the libarchive unit tests use this approach to write 1TB archives in just a couple of seconds. (Thist test checks libarchive's handling of very large archives with very large entries.) Look at test_tar_large.c for the details of how this particular test works. (test_tar_large.c actually does more than just count the data, but it should give you the general idea.) I will have to look into that code indeed. If I get this right tho, you're basically suggesting that I read the input files twice: once without writing the data, and the second time writing the data? Arguably the second read would come from the VFS cache, but that's only assuming the server isn't too busy serving hundreds of other files, which is why I'm a bit concerned about optimality... My limited understanding of the tar format made me believe that it was possible to know the space taken by a given file in a tar archive just by looking at its size and adding the necessary padding bytes. Was I wrong? For reference, here's the (relatively short) code I use: http://www.parisc-linux.org/~varenet/musicindex/doc/html/output-tarball_8c-source.html This will work very well with all of the tar and cpio formats. It won't work well with some other formats where the length does actually depend on the data. Yep, that was quite clear indeed ;) Thanks for your input! -- Thibaut VARENE http://www.parisc-linux.org/~varenet/ -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#494169: [Fwd: FW: Bug#494169: libarchive-dev: Please add a way to precompute (non-compressed) archive size]
Thibaut VARENE wrote: On Fri, Aug 8, 2008 at 8:42 AM, Tim Kientzle [EMAIL PROTECTED] wrote: Thibaut, John Goerzen forwarded your idea to me. You can actually implement this on top of the current libarchive code quite efficiently. Use the low-level archive_write_open() call and provide your own callbacks that just count the write requests. Then go through and write the archive as usual, except skip the write_data() part (for tar and cpio formats, libarchive will automatically pad the entry with NUL bytes). Hum, I'm not quite sure I get this right... By count the write requests and skip the write_data() part, you mean count the number of bytes that should have been written, without writting them? Yes. This may sound slow, but it's really not. One of the libarchive unit tests use this approach to write 1TB archives in just a couple of seconds. (Thist test checks libarchive's handling of very large archives with very large entries.) Look at test_tar_large.c for the details of how this particular test works. (test_tar_large.c actually does more than just count the data, but it should give you the general idea.) I will have to look into that code indeed. If I get this right tho, you're basically suggesting that I read the input files twice: once without writing the data, and the second time writing the data? No. I'm suggesting you use three passes: 1) Get the information for all of the files, create archive_entry objects. 2) Create a fake archive using the technique above. You don't need to read the file data here! After you call archive_write_close(), you'll know the size of the complete archive. (This is really just your original idea.) 3) Write the real archive as usual, including reading the actual file data and writing it to the archive. Arguably the second read would come from the VFS cache, but that's only assuming the server isn't too busy serving hundreds of other files, which is why I'm a bit concerned about optimality... My limited understanding of the tar format made me believe that it was possible to know the space taken by a given file in a tar archive just by looking at its size and adding the necessary padding bytes. Was I wrong? You could make this work. If you're using plain ustar (no tar extensions!), then each file has the data padded to a multiple of 512 bytes and there is a 512 byte header for each file. Then you need to round the total result up the a multiple of the block size. (Default is 10240 bytes, you probably should set the block size to 512 bytes.) For reference, here's the (relatively short) code I use: http://www.parisc-linux.org/~varenet/musicindex/doc/html/output-tarball_8c-source.html This will work very well with all of the tar and cpio formats. It won't work well with some other formats where the length does actually depend on the data. Yep, that was quite clear indeed ;) Thanks for your input! -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#494169: libarchive-dev: Please add a way to precompute (non-compressed) archive size
Package: libarchive-dev Severity: wishlist Hi, I thought I already reported this, but apparently I didn't so here's the idea: I'm the author of mod_musicindex, in which I use libarchive to send on-the-fly tar archives to remote clients. Right now, the remote client's browser cannot display any ETA / %complete for the current download since I cannot tell before hand what will be the exact size of the archive I'm sending them. It would be very nice if there were some API allowing for the precomputation of the final size of a non-compressed archive that would allow me to do something like: archive_size = archive_size_header(a); for (filename in file list) { archive_size += archive_size_addfile(filename); /* or using stat() and eg archive_size_addstat() */ } archive_size += archive_size_footer(a); (brainfart pseudo code, I hope you get the idea) so that in the end archive_size will be exactly the size of the output archive (header/padding included), without having to actually read files or write the archive itself. I could thus send the remote client the actual size of the data they're going to be send beforehand. The trick is, this size cannot be approximate: the browser will cut the transfer even if I'm still sending them data if it has received as many bits as it was told. I'm under the impression that since this is about non-compressed archive, and considering the structure of a tar archive, my goal should be feasible without even having to read any input file. Am I wrong? Hope I'm quite clear, thanks for your help T-Bone -- System Information: Debian Release: lenny/sid APT prefers unstable APT policy: (500, 'unstable') Architecture: hppa (parisc64) Kernel: Linux 2.6.22.14 (SMP w/4 CPU cores) Locale: LANG=C, LC_CTYPE=C (charmap=ANSI_X3.4-1968) Shell: /bin/sh linked to /bin/bash -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]
Bug#494169: libarchive-dev: Please add a way to precompute (non-compressed) archive size
Thibaut VARENE wrote: Package: libarchive-dev Severity: wishlist Hi, I thought I already reported this, but apparently I didn't so here's the idea: I'm the author of mod_musicindex, in which I use libarchive to send on-the-fly tar archives to remote clients. Hi Thibaut, This is really an upstream question, but I would suspect that Tim would say it is out of scope of what libarchive is all about. Could you ask him directly, or do you want me to forward this to him? Right now, the remote client's browser cannot display any ETA / %complete for the current download since I cannot tell before hand what will be the exact size of the archive I'm sending them. It would be very nice if there were some API allowing for the precomputation of the final size of a non-compressed archive that would allow me to do something like: archive_size = archive_size_header(a); for (filename in file list) { archive_size += archive_size_addfile(filename); /* or using stat() and eg archive_size_addstat() */ } archive_size += archive_size_footer(a); (brainfart pseudo code, I hope you get the idea) so that in the end archive_size will be exactly the size of the output archive (header/padding included), without having to actually read files or write the archive itself. I could thus send the remote client the actual size of the data they're going to be send beforehand. The trick is, this size cannot be approximate: the browser will cut the transfer even if I'm still sending them data if it has received as many bits as it was told. I'm under the impression that since this is about non-compressed archive, and considering the structure of a tar archive, my goal should be feasible without even having to read any input file. Am I wrong? Hope I'm quite clear, thanks for your help T-Bone -- System Information: Debian Release: lenny/sid APT prefers unstable APT policy: (500, 'unstable') Architecture: hppa (parisc64) Kernel: Linux 2.6.22.14 (SMP w/4 CPU cores) Locale: LANG=C, LC_CTYPE=C (charmap=ANSI_X3.4-1968) Shell: /bin/sh linked to /bin/bash -- To UNSUBSCRIBE, email to [EMAIL PROTECTED] with a subject of unsubscribe. Trouble? Contact [EMAIL PROTECTED]