[GitHub] [libcloud] Tobi995 opened a new pull request, #1847: optimize read_in_chunks

GitBox Tue, 17 Jan 2023 07:02:29 -0800


Tobi995 opened a new pull request, #1847:
URL: https://github.com/apache/libcloud/pull/1847


   ## Optimize read_in_chunks
   
   ### Description
   
   We noticed that utils.files.read_in_chunks has significant memory & CPU cost 
when up or downloading large files with mismatching chunk sizes. This PR 
attempts to fix that by potentially yielding multiple chunks per read and by 
avoiding large slicing ops. This seems to remove some more than quadratic 
scaling, bringing massive speed ups. 
   
   ```python
   import tracemalloc
   
   import numpy as np
   import time
   
   from libcloud.utils.files import read_in_chunks
   import numpy as np
   
   def run_scenario_1(data: bytes):
       # similar to calling _upload_multipart_chunks with one large array of 
bytes 
       for a in read_in_chunks(iter([data]), chunk_size=5 * 1024 * 1024, 
fill_size=True):
           ...
   
   def run_scenario_2(data: bytes):
       # as in download_object_as_stream
       response_chunk = 5 * 1024 * 1024
       for a in read_in_chunks(iter([data[i:i+response_chunk] for i in range(0, 
len(data), response_chunk)]), chunk_size=8096, fill_size=True):
           ...
   
   if __name__ == "__main__":
       tracemalloc.start()
       data = "c".encode("utf-8") * (40 * 1024 * 1024)
       times = []
       for i in range(10):
           start_time = time.time()
           run_scenario_1(data)
           times.append(time.time() - start_time)
       print("scenario 1", np.median(times))
   
       times = []
       for i in range(10):
           start_time = time.time()
           run_scenario_2(data)
           times.append(time.time() - start_time)
       print("scenario 2", np.median(times))
   
       current_size, peak = tracemalloc.get_traced_memory()
       print(f'Current consumption: {current_size / 1024 / 1024}mb, '
                   f'peak since last log: {peak / 1024 / 1024}mb.')
   ``` 
   Gives the following stats:
   Without this PR:
   ```
   scenario 1 0.08405923843383789
   scenario 2 30.04505753517151
   Current consumption: 40.009538650512695mb, peak since last log: 
159.9015874862671mb.
   ```
   With this PR:
   ```
   scenario 1 0.013306617736816406
   scenario 2 0.060915589332580566
   Current consumption: 40.01009464263916mb, peak since last log: 
85.03302574157715mb.
   ```
   Scenario 2 with different data sizes:
   
   - 20Mb: Before 3.97s, After 0.028s
   - 30Mb: Before 8.4s, After 0.04s
   - 40Mb: Before 30.0s, After 0.05s
   - 50Mb: Before 54s, After 0.07s
   - 60Mb: Before 102s, After 0.096s
   
   ### Status
   done, ready for review
   
   ### Checklist (tick everything that applies)
   
   - [X] [Code 
linting](http://libcloud.readthedocs.org/en/latest/development.html#code-style-guide)
 (required, can be done after the PR checks)
   - [ ] Documentation
   - [X] [Tests](http://libcloud.readthedocs.org/en/latest/testing.html)
   - [ ] 
[ICLA](http://libcloud.readthedocs.org/en/latest/development.html#contributing-bigger-changes)
 (required for bigger changes)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@libcloud.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [libcloud] Tobi995 opened a new pull request, #1847: optimize read_in_chunks

Reply via email to