[
https://issues.apache.org/jira/browse/BEAM-6027?focusedWorklogId=246312&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-246312
]
ASF GitHub Bot logged work on BEAM-6027:
----------------------------------------
Author: ASF GitHub Bot
Created on: 21/May/19 19:00
Start Date: 21/May/19 19:00
Worklog Time Spent: 10m
Work Description: udim commented on pull request #8553: [BEAM-6027] Fix
slow downloads when reading from GCS
URL: https://github.com/apache/beam/pull/8553#discussion_r286177674
##########
File path: sdks/python/apache_beam/io/filesystemio.py
##########
@@ -80,16 +80,18 @@ def finish(self):
class DownloaderStream(io.RawIOBase):
"""Provides a stream interface for Downloader objects."""
- def __init__(self, downloader, mode='rb'):
+ def __init__(self, downloader, read_buffer_size=io.DEFAULT_BUFFER_SIZE,
mode='rb'):
Review comment:
Line too long. You should be able to run lint checks locally using
`./gradlew lint` in the root of the git repo.
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
Issue Time Tracking
-------------------
Worklog Id: (was: 246312)
Time Spent: 10m
Remaining Estimate: 0h
> Slow DownloaderStream when reading from GCS
> -------------------------------------------
>
> Key: BEAM-6027
> URL: https://issues.apache.org/jira/browse/BEAM-6027
> Project: Beam
> Issue Type: Bug
> Components: sdk-py-core
> Reporter: Andreas Jansson
> Priority: Minor
> Time Spent: 10m
> Remaining Estimate: 0h
>
> DownloaderStream inherits io.RawIOBase, which by defaults reads
> io.DEFAULT_BUFFER_SIZE chunks in .readall(). This is causing extremely slow
> performance when invoking read() on handles returned by GcsIO().open().
> The following code can take ~60 seconds to download a single 2MB file:
> {code:python}
> gcs = GcsIO()
> t = time.time()
> path = 'gs://my-bucket/my-2MB-file'
> with gcs.open(path) as f:
> f.read()
> duration = time.time() - t
> {code}
> This monkey patch makes the same download code take <1 second:
> {code:python}
> from apache_beam.io.gcp import gcsio
> from apache_beam.io.filesystemio import DownloaderStream
> def downloader_stream_readall(self):
> """Read until EOF, using multiple read() call."""
> res = bytearray()
> while True:
> data = self.read(gcsio.DEFAULT_READ_BUFFER_SIZE)
> if not data:
> break
> res += data
> if res:
> return bytes(res)
> else:
> return data
> DownloaderStream.readall = downloader_stream_readall
> {code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)