New submission from Valentin Samir <valentin.sa...@crans.org>:

When tarfile open a tar containing a sparse file where actual data is bigger 
than 0o77777777777 bytes (~8GB), it fails listing members after this file. As 
so, it is impossible to access/extract files located after a such file inside 
the archive.

A tar file presenting the issue is available at https://genua.fr/sample.tar.xz
Uncompressed the file is ~16GB.
It containes two files:
* disk.img a 50GB sparse file containing ~16GB of data
* README.txt a simple text file containing "This last file is not properly 
listed"

disk.img was generated using the folowing python script:

GB = 1024**3
buf = b"\xFF" * 1024**2
with open('disk.img', 'wb') as f:
    f.seek(10 * GB)
    wrotten = 0
    while wrotten < 0o77777777777:
        wrotten += f.write(buf)
        f.flush()
        print(wrotten/0o77777777777 * 100, '%')
    f.seek(50 * GB - 1)
    f.write(b'\0')

sample.tar was generated using GNU tar 1.30 on a Debian 10 with the following 
command:

tar --format pax -cvSf sample.tar disk.img README.txt

The following script expose the issue:

import tarfile
t = tarfile.open('sample.tar')
print('members', t.getmembers())
print('offset', t.offset)

Its output is:

members [<TarInfo 'disk.img' at 0x7f5b14242b38>]
offset 17179806208

members should also list README.txt.


I think I have found the root cause of the bug: Because the file is bigger than 
0o77777777777, it's size cannot be specified inside the tar ustar header, so a 
"size" pax extented header is generated. This header contain the full size of 
the file block in the tar.

As the file is sparse, as of sparse format 1.0, the file block contains first a 
sparse mapping, then the file data. So this block size is the size of the 
mapping added to the size of the data.

Because the file is sparse, a GNU.sparse.realsize header is also added 
containing the full expanded file size (here 50GB).

Here 
https://github.com/python/cpython/blob/4dee92b0ad9f4e3ea2fbbbb5253340801bb92dc7/Lib/tarfile.py#L1350
 tarfile set the tarinfo size to GNU.sparse.realsize  (50GB),then, in this 
block 
https://github.com/python/cpython/blob/4dee92b0ad9f4e3ea2fbbbb5253340801bb92dc7/Lib/tarfile.py#L1297
 the file offset is moved forward from GNU.sparse.realsize (50GB) instead of 
pax_headers["size"]. Moreover, the move is done from next.offset_data which is 
set at https://github.com/python/cpython/blob/master/Lib/tarfile.py#L1338 to 
after the sparse mapping.
The move forward in the sparse file should be made from next.offset + BLOCKSIZE.

----------
components: Library (Lib)
messages: 362275
nosy: Nit
priority: normal
severity: normal
status: open
title: tarfile: GNU sparse 1.0 pax tar header offset not properly computed
type: behavior
versions: Python 2.7, Python 3.5, Python 3.6, Python 3.7, Python 3.8, Python 3.9

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue39688>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

Reply via email to