David Wilson added the comment:
Per my comment on issue16569, the overhead of performing one seek before each
(raw file data) read is quite minimal. I have attached a new (but incomplete)
patch, on which the following microbenchmarks are based. The patch is
essentially identical to Stepan's 2012 patch, except I haven't yet decided how
best to preserve the semantics of ZipFile.close().
"my.zip" is the same my.zip from issue22842. It contains 10,000 files each
containing 10 bytes over 2 lines.
"my2.zip" contains 8,000 files each containing the same copy of 64kb of
/dev/urandom output. The resulting ZIP is 500mb.
For each test, the first run is the existing zipfile module, and the second run
is with the patch. In summary:
* There is a 35% perf increase in str mode when handling many small files (on
OS X at least)
* There is a 6% perf decrease in file mode when handling small sequential reads.
* There is a 2.4% perf decrease in file mode when handling large sequential
reads.
>From my reading of zipfile.py, it is clear there are _many_ ways to improve
>its performance (probably starting with readline()), and rejection of a
>functional fix should almost certainly be at the bottom of that list.
For each of the tests below, the functions used were:
def a():
"""
Test concurrent line reads to a str mode ZipFile.
"""
zf = zipfile.ZipFile('my2.zip')
members = [zf.open(n) for n in zf.namelist()]
for m in members:
m.readline()
for m in members:
m.readline()
def c():
"""
Test sequential small reads to a str mode ZipFile.
"""
zf = zipfile.ZipFile('my2.zip')
for name in zf.namelist():
with zf.open(name) as zfp:
zfp.read(1000)
def d():
"""
Test sequential small reads to a file mode ZipFile.
"""
fp = open('my2.zip', 'rb')
zf = zipfile.ZipFile(fp)
for name in zf.namelist():
with zf.open(name) as zfp:
zfp.read(1000)
def e():
"""
Test sequential large reads to a file mode ZipFile.
"""
fp = open('my2.zip', 'rb')
zf = zipfile.ZipFile(fp)
for name in zf.namelist():
with zf.open(name) as zfp:
zfp.read()
---- my.zip ----
$ python3.4 -m timeit -s 'import my' 'my.a()'
10 loops, best of 3: 1.47 sec per loop
$ python3.4 -m timeit -s 'import my' 'my.a()'
10 loops, best of 3: 950 msec per loop
---
$ python3.4 -m timeit -s 'import my' 'my.c()'
10 loops, best of 3: 1.3 sec per loop
$ python3.4 -m timeit -s 'import my' 'my.c()'
10 loops, best of 3: 865 msec per loop
---
$ python3.4 -m timeit -s 'import my' 'my.d()'
10 loops, best of 3: 800 msec per loop
$ python3.4 -m timeit -s 'import my' 'my.d()'
10 loops, best of 3: 851 msec per loop
---- my2.zip ----
$ python3.4 -m timeit -s 'import my' 'my.a()'
10 loops, best of 3: 1.46 sec per loop
$ python3.4 -m timeit -s 'import my' 'my.a()'
10 loops, best of 3: 1.16 sec per loop
---
$ python3.4 -m timeit -s 'import my' 'my.c()'
10 loops, best of 3: 1.13 sec per loop
$ python3.4 -m timeit -s 'import my' 'my.c()'
10 loops, best of 3: 892 msec per loop
---
$ python3.4 -m timeit -s 'import my' 'my.d()'
10 loops, best of 3: 842 msec per loop
$ python3.4 -m timeit -s 'import my' 'my.d()'
10 loops, best of 3: 882 msec per loop
---
$ python3.4 -m timeit -s 'import my' 'my.e()'
10 loops, best of 3: 1.65 sec per loop
$ python3.4 -m timeit -s 'import my' 'my.e()'
10 loops, best of 3: 1.69 sec per loop
----------
nosy: +dw
Added file: http://bugs.python.org/file37191/zipfile_concurrent_read_1.diff
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue14099>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com