[jira] [Work logged] (BEAM-6027) Slow DownloaderStream when reading from GCS

ASF GitHub Bot (JIRA) Thu, 23 May 2019 07:16:09 -0700


     [ 
https://issues.apache.org/jira/browse/BEAM-6027?focusedWorklogId=247425&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-247425
 ]


ASF GitHub Bot logged work on BEAM-6027:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 23/May/19 14:15
            Start Date: 23/May/19 14:15
    Worklog Time Spent: 10m 
      Work Description: fabito commented on issue #8553: [BEAM-6027] Fix slow 
downloads when reading from GCS
URL: https://github.com/apache/beam/pull/8553#issuecomment-495237471
 
 
   Hi @udim ,
   
   Using this snippet:
   
   ```python
   import tempfile
   import timeit
   
   from apache_beam.io.filesystems import FileSystems
   from apache_beam.io.gcp import gcsio
   from apache_beam.io.filesystemio import DownloaderStream
   
   
   # https://issues.apache.org/jira/browse/BEAM-6027
   def downloader_stream_readall(self):
       res = []
       while True:
           data = self.read(gcsio.DEFAULT_READ_BUFFER_SIZE)
           if not data:
               break
           res.append(data)
       return b''.join(res)
   
   
   original_read_all = DownloaderStream.readall
   
   
   if __name__ == '__main__':
       test_file = 'gs://cloud-samples-tests/vision/saigon.mp4'
       num_executions = 1
   
       def test_original():
           DownloaderStream.readall = original_read_all
           with FileSystems.open(test_file) as audio_file:
               with tempfile.NamedTemporaryFile(mode='w+b') as temp:
                   temp.write(audio_file.read())
   
       def test_refactored():
           DownloaderStream.readall = downloader_stream_readall
           with FileSystems.open(test_file) as audio_file:
               with tempfile.NamedTemporaryFile(mode='w+b') as temp:
                   temp.write(audio_file.read())
   
       print(timeit.timeit("test_original()", setup="from __main__ import 
test_original", number=num_executions))
       print(timeit.timeit("test_refactored()", setup="from __main__ import 
test_refactored", number=num_executions))
   ```
   
   I got the following output:
   
   ```
   120.99772120200214
   1.0684915780002484
   ```
   
   Hope that helps
   
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 247425)
    Time Spent: 40m  (was: 0.5h)

> Slow DownloaderStream when reading from GCS
> -------------------------------------------
>
>                 Key: BEAM-6027
>                 URL: https://issues.apache.org/jira/browse/BEAM-6027
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py-core
>            Reporter: Andreas Jansson
>            Priority: Minor
>          Time Spent: 40m
>  Remaining Estimate: 0h
>
> DownloaderStream inherits io.RawIOBase, which by defaults reads 
> io.DEFAULT_BUFFER_SIZE chunks in .readall(). This is causing extremely slow 
> performance when invoking read() on handles returned by GcsIO().open().
> The following code can take ~60 seconds to download a single 2MB file:
> {code:python}
> gcs = GcsIO()
> t = time.time()
> path = 'gs://my-bucket/my-2MB-file'
> with gcs.open(path) as f:
>     f.read()
> duration = time.time() - t
> {code}
> This monkey patch makes the same download code take <1 second:
> {code:python}
> from apache_beam.io.gcp import gcsio
> from apache_beam.io.filesystemio import DownloaderStream
> def downloader_stream_readall(self):
>     """Read until EOF, using multiple read() call."""
>     res = bytearray()
>     while True:
>         data = self.read(gcsio.DEFAULT_READ_BUFFER_SIZE)
>         if not data:
>             break
>         res += data
>     if res:
>         return bytes(res)
>     else:
>         return data
> DownloaderStream.readall = downloader_stream_readall
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Work logged] (BEAM-6027) Slow DownloaderStream when reading from GCS

Reply via email to