New submission from Jussi Judin:
I managed to create a tarball that brought out quite nasty behavior with
tarfile.TarFile.extract() and tarfile.TarFile.extractall() functions when there
are hard links inside a tarball that point to themselves with a file that is
included in the tarball. In Python 2.7 it leads to an exception and with Python
3.4-3.6 it extracts the same file from the tarball multiple times.
First we create a tarball that causes this behavior:
$ mkdir -p tardata/1/2/3/4/5/6/7/8/9
$ dd if=/dev/zero of=tardata/1/2/3/4/5/6/7/8/9/zeros.data bs=1000000 count=500
# tar by default adds all directories recursively multiple times to the
archive, but duplicates are created as hard links:
$ find tardata | xargs tar cvfz tardata.tar.gz
Then let's extract the tarball with tarfile module
Let following commands demonstrate what happens with the attached tartest.py
file
$ python2.7.13 tartest.py noskip tardata.tar.gz /tmp/tardata-python-2.7.13
...
tardata/1/2/3/4/5/6/7/8/9/zeros.data
...
tardata/1/2/3/4/5/6/7/8/9/zeros.data
Traceback (most recent call last):
File "tartest.py", line 17, in <module>
unarchive(skip, archive, dest)
File "tartest.py", line 12, in unarchive
tar_fd.extract(info, dest)
File "python/2.7.13/lib/python2.7/tarfile.py", line 2118, in extract
self._extract_member(tarinfo, os.path.join(path, tarinfo.name))
File "python/2.7.13/lib/python2.7/tarfile.py", line 2202, in _extract_member
self.makelink(tarinfo, targetpath)
File "python/2.7.13/lib/python2.7/tarfile.py", line 2286, in makelink
os.link(tarinfo._link_target, targetpath)
OSError: [Errno 2] No such file or directory
And with Python 3.6.0 (and earlier Python 3 series based Pythons that I have
tested):
$ time python3.6.0 tartest.py noskip tardata.tar.gz /tmp/tardata-python-3.6.0
...
tardata/1/2/3/4/5/6/7/8/9/zeros.data <-- this is extracted 11 times
...
real 0m42.747s
user 0m17.564s
sys 0m6.144s
If we then make the tarfile skip extraction of hard links that point to
themselves:
$ time python3.6.0 tartest.py skip tardata.tar.gz /tmp/tardata-python-3.6.0
...
tardata/1/2/3/4/5/6/7/8/9/zeros.data <-- this is extracted once
...
Skipping tardata/1/2/3/4/5/6/7/8/9/zeros.data <-- skipped hard links 10 times
...
real 0m2.688s
user 0m1.816s
sys 0m0.532s
>From the used user CPU time it's obvious that there is happening a lot of
>unneeded decompression when we compare Python 3.6 results. If I use
>TarFile.extractall(), it behaves similarly as using TarFile.extract()
>individually on TarInfo objects. GNU tar seems to behave in such fashion that
>it skips over the extraction of the actual file data when it encounters this
>situation.
----------
components: Library (Lib)
files: tartest.py
messages: 288284
nosy: Jussi Judin
priority: normal
severity: normal
status: open
title: TarFile.extract() suffers from hard links inside tarball
type: behavior
versions: Python 2.7, Python 3.4, Python 3.5, Python 3.6
Added file: http://bugs.python.org/file46658/tartest.py
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue29612>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com