Tobi995 opened a new pull request, #1847:
URL: https://github.com/apache/libcloud/pull/1847
## Optimize read_in_chunks
### Description
We noticed that utils.files.read_in_chunks has significant memory & CPU cost
when up or downloading large files with mismatching chunk sizes. This PR
attempts to fix that by potentially yielding multiple chunks per read and by
avoiding large slicing ops. This seems to remove some more than quadratic
scaling, bringing massive speed ups.
```python
import tracemalloc
import numpy as np
import time
from libcloud.utils.files import read_in_chunks
import numpy as np
def run_scenario_1(data: bytes):
# similar to calling _upload_multipart_chunks with one large array of
bytes
for a in read_in_chunks(iter([data]), chunk_size=5 * 1024 * 1024,
fill_size=True):
...
def run_scenario_2(data: bytes):
# as in download_object_as_stream
response_chunk = 5 * 1024 * 1024
for a in read_in_chunks(iter([data[i:i+response_chunk] for i in range(0,
len(data), response_chunk)]), chunk_size=8096, fill_size=True):
...
if __name__ == "__main__":
tracemalloc.start()
data = "c".encode("utf-8") * (40 * 1024 * 1024)
times = []
for i in range(10):
start_time = time.time()
run_scenario_1(data)
times.append(time.time() - start_time)
print("scenario 1", np.median(times))
times = []
for i in range(10):
start_time = time.time()
run_scenario_2(data)
times.append(time.time() - start_time)
print("scenario 2", np.median(times))
current_size, peak = tracemalloc.get_traced_memory()
print(f'Current consumption: {current_size / 1024 / 1024}mb, '
f'peak since last log: {peak / 1024 / 1024}mb.')
```
Gives the following stats:
Without this PR:
```
scenario 1 0.08405923843383789
scenario 2 30.04505753517151
Current consumption: 40.009538650512695mb, peak since last log:
159.9015874862671mb.
```
With this PR:
```
scenario 1 0.013306617736816406
scenario 2 0.060915589332580566
Current consumption: 40.01009464263916mb, peak since last log:
85.03302574157715mb.
```
Scenario 2 with different data sizes:
- 20Mb: Before 3.97s, After 0.028s
- 30Mb: Before 8.4s, After 0.04s
- 40Mb: Before 30.0s, After 0.05s
- 50Mb: Before 54s, After 0.07s
- 60Mb: Before 102s, After 0.096s
### Status
done, ready for review
### Checklist (tick everything that applies)
- [X] [Code
linting](http://libcloud.readthedocs.org/en/latest/development.html#code-style-guide)
(required, can be done after the PR checks)
- [ ] Documentation
- [X] [Tests](http://libcloud.readthedocs.org/en/latest/testing.html)
- [ ]
[ICLA](http://libcloud.readthedocs.org/en/latest/development.html#contributing-bigger-changes)
(required for bigger changes)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]