727 Self Review]

Joerg Schilling Mon, 24 Nov 2008 16:17:06 +0100

Don Cragun <don.cragun at sun.com> wrote:

> >>    PSARC case 2006/331 (Add holey file support to pax) created
> >
> >I did never see such a case and the Sun pax man page does neither 
> >include "hole" nor "sparse".
>
> The case was approved before PSARC cases were handled in cases open to
> people who are not Sun employees.  But, all of the important
> information is included on Sun's current pax(1) man page.  The pax
> utility added an extended header to ustar and pax archive format
> archives as described in the USAGE section, where it says:


I canot speak for _very_ recent Solaris versions but for a case from 2006,
I would expect to see results before the latest (Build89) I checked.

Build 89 comes with a pax man page from 2004

> and in the EXTENDED DESCRIPTION, where it says:
>
>     "SUN.holesdata    A Solaris extension to pax extended  header
>                       keywords. Specifies the data and hole pairs
>                       for a sparse file.
>
>                      "In write or copy modes and when the  xustar
>                       or pax format (see -x format) is specified,
>                       pax  includes  a   SUN.holesdate   extended
>                       header record if the underlying file system
>                       supports the detection of files with  holes
>                       (see  fpathconf(2))  and reports that there
>                       is at least one  hole  in  the  file  being
>                       archived.  value  consists  of  two or more
>                       consecutive entries of the following form:
>
>                         SPACEdata_offsetSPACEhole_offset
>
>
>                      "where the data and  hole  offsets  are  the
>                       long  values  returned by passing SEEK_DATA
>                       and SEEK_HOLE  to  lseek(2),  respectively.
>                       For  example,  the  following  entry  is an
>                       example of the SUN.holesdata entry  in  the
>                       extended   header  for  a  file  with  data
>                       offsets at bytes 0, 24576, and  49152,  and
>                       hole  offsets  at  bytes  8192,  32768, and
>                       49159: 49 SUN.holesdata= 0 8192 24576 32768
>                       49152 49159:
>
>                         49 SUN.holesdata= 0 8192 24576 32768 49152 49159

Looks like it indroduces the same problem as the cpio case.


> >How do you intend to switch between the sparse support mode and the 
> >non-sparse 
> >mode in "copy mode"?
>
> There is no switch in copy mode.  If the source filesystem reports
> holes in a file, the holes will be duplicated in the destination file
> as long as the destination file is seekable.

This should be marked as deficit in the man page.


> >>    In copy out mode (-o) the following new option arguments to the
> >>    cpio -H option will be added to provide sparse file support:
> >>            ascii_sparse    - assumes -c is specified.  Only available
> >>                            in copy out (-o) mode.
> >>            odc_sparse      - assumes -H odc is specified  Only available
> >>                            in copy out (-o) mode.
> >
> >Adding sparse file support does not introduce a new archive format unless you
> >create a new archive format that may be detected by reading the first archive
> >header from a random archive.
>
> Correct.  When using -H ascii_sparse and -H odc_sparse, cpio uses ascii
> and odc format archives, respectively; but it uses a different file
> type when adding a sparse file to the archive.  If an archiver
> understands cpio ascii and odc format archives, it will understand the
> archives.  If an archiver doesn't recognize the extended file types,
> the standards require that it extract the file data as a regular file
> (which wlll contain the data needed to recreate the file contents with
> holes in the proper positions).

Does the code sets bit 17 and clears the file type bits or does it set bit 17 in
addition?

> >
> >If you like to avoid to to introduce a new option, you would need to 
> >document 
> >this as a dirty hack. BTW: Where is the new man page?
>
> Quoting from the references section of this case:
>     5.4 PSARC/2008/727/materials/cpio.1: Updated cpio.1 man page

OK, but I see no description of the archive format in this man page.


> >>    The following will apply when either '-H ascii_sparse' or
> >>    '-H odc_sparse' is specified with -o: 
> >>            - The c_mode field will in the archive header will
> >>              indicate that the file is a sparse file. In the old
> >>              stat structure, the mode field is an unsigned short
> >>              (16 bit) field.  To avoid conflicts with other file
> >>              types, a high order bit (17) in the c_mode field of
> >>              the header will be set.
> >
> >This is beyond the cpio specs. How do you plan to mark the archives 
> >as "Sun cpio" specific to allow to avoid incorrect behavior for non-Sun 
> >archives?
>
> It is indicated by the file type.

As a result of not marking the archive, archivers that carefully implement add 
on features depending on the archive format will not unpack the sparse files.

star and AT&T pax will ignore bit 17, other archivers may include this
bit in the file type with unkown results.

Vendor unique extensions that do not use explicit vendor specific tags
are something we had in the 1980s.


> >>            - A string of the following format will be prepended to
> >>              the compressed file data:
> >>                    "%lu %llu%s", prepended_info_size,
> >>                            expanded_file_size, data/hole_offsets
> >
> >Is this data _inside_ the file data area or is it in conflict with the 
> >cpio extensions from David Korn and Glenn Fowler?
>
> It is inside the file data area as indicated above.  (The file size
> field is the size of this header plus the size of the file contents
> after removing the holes.)

OK; how about marking the archive in the header area past the filename?



> >>            where data/hole_offsets contains 2 or more entries of the
> >>            following format:
> >>                    " %llu %llu", data_offset, hole_offset
> >
> >If you ever like to debug this, I would recommend to use:
> >
> >                     " %llu,%llu", data_offset, hole_offset
> >
> >to make the data parsable by human eyes..
>
> Maybe to European human eyes.  In the U.S., some possible data offset,
> hole offset pairs could look like a single number with a the "," being
> a thousands separator instead of as a pair separator.  Besides that it
> matches the string given as the data in a ustar/pax SUN.holesdata
> extended header.

It seems that you are too US centric and thus do not see the problem of
being unable to see number pairs in a possiblily extremely long data 
stream.


> >But why don't you follow existing other implementations that use 
> >offset/numbytes pairs for data chunks? This results in a lower archive size.
>
> I'm not going to argue decisions that were agreed upon for PSARC
> 2006/361.  But, it follows naturally from the data provided by the
> lseek(2) SEEK_HOLE and SEEK_DATA operations.

I offered my help in special for tar/cpio specific archive format questions
even before this case was aproved. So why didn't you ask me then?

J?rg

-- 
 EMail:joerg at schily.isdn.cs.tu-berlin.de (home) J?rg Schilling D-13353 Berlin
       js at cs.tu-berlin.de                (uni)  
       schilling at fokus.fraunhofer.de     (work) Blog: 
http://schily.blogspot.com/
 URL:  http://cdrecord.berlios.de/private/ ftp://ftp.berlios.de/pub/schily

Add sparse file support to cpio [PSARC/2008/727 Self Review]

Reply via email to