Tobi995 opened a new pull request, #1847: URL: https://github.com/apache/libcloud/pull/1847
## Optimize read_in_chunks ### Description We noticed that utils.files.read_in_chunks has significant memory & CPU cost when up or downloading large files with mismatching chunk sizes. This PR attempts to fix that by potentially yielding multiple chunks per read and by avoiding large slicing ops. This seems to remove some more than quadratic scaling, bringing massive speed ups. ```python import tracemalloc import numpy as np import time from libcloud.utils.files import read_in_chunks import numpy as np def run_scenario_1(data: bytes): # similar to calling _upload_multipart_chunks with one large array of bytes for a in read_in_chunks(iter([data]), chunk_size=5 * 1024 * 1024, fill_size=True): ... def run_scenario_2(data: bytes): # as in download_object_as_stream response_chunk = 5 * 1024 * 1024 for a in read_in_chunks(iter([data[i:i+response_chunk] for i in range(0, len(data), response_chunk)]), chunk_size=8096, fill_size=True): ... if __name__ == "__main__": tracemalloc.start() data = "c".encode("utf-8") * (40 * 1024 * 1024) times = [] for i in range(10): start_time = time.time() run_scenario_1(data) times.append(time.time() - start_time) print("scenario 1", np.median(times)) times = [] for i in range(10): start_time = time.time() run_scenario_2(data) times.append(time.time() - start_time) print("scenario 2", np.median(times)) current_size, peak = tracemalloc.get_traced_memory() print(f'Current consumption: {current_size / 1024 / 1024}mb, ' f'peak since last log: {peak / 1024 / 1024}mb.') ``` Gives the following stats: Without this PR: ``` scenario 1 0.08405923843383789 scenario 2 30.04505753517151 Current consumption: 40.009538650512695mb, peak since last log: 159.9015874862671mb. ``` With this PR: ``` scenario 1 0.013306617736816406 scenario 2 0.060915589332580566 Current consumption: 40.01009464263916mb, peak since last log: 85.03302574157715mb. ``` Scenario 2 with different data sizes: - 20Mb: Before 3.97s, After 0.028s - 30Mb: Before 8.4s, After 0.04s - 40Mb: Before 30.0s, After 0.05s - 50Mb: Before 54s, After 0.07s - 60Mb: Before 102s, After 0.096s ### Status done, ready for review ### Checklist (tick everything that applies) - [X] [Code linting](http://libcloud.readthedocs.org/en/latest/development.html#code-style-guide) (required, can be done after the PR checks) - [ ] Documentation - [X] [Tests](http://libcloud.readthedocs.org/en/latest/testing.html) - [ ] [ICLA](http://libcloud.readthedocs.org/en/latest/development.html#contributing-bigger-changes) (required for bigger changes) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: notifications-unsubscr...@libcloud.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org