Mark Liu created BEAM-6154:
------------------------------

             Summary: Gcsio batch delete broken in Python 3
                 Key: BEAM-6154
                 URL: https://issues.apache.org/jira/browse/BEAM-6154
             Project: Beam
          Issue Type: Bug
          Components: sdk-py-core
            Reporter: Mark Liu
            Assignee: Ahmet Altay


I'm running Python SDK agianst GCP in Python 3.5 and got following gcsio error 
while deleting files:

{code}
  File "/usr/local/lib/python3.5/site-packages/apache_beam/io/iobase.py", line 
1077, in <genexpr>
    window.TimestampedValue(v, timestamp.MAX_TIMESTAMP) for v in outputs)
  File 
"/usr/local/lib/python3.5/site-packages/apache_beam/io/filebasedsink.py", line 
315, in finalize_write
    num_threads)
  File "/usr/local/lib/python3.5/site-packages/apache_beam/internal/util.py", 
line 145, in run_using_threadpool
    return pool.map(fn_to_execute, inputs)
  File "/usr/local/lib/python3.5/multiprocessing/pool.py", line 266, in map
    return self._map_async(func, iterable, mapstar, chunksize).get()
  File "/usr/local/lib/python3.5/multiprocessing/pool.py", line 644, in get
    raise self._value
  File "/usr/local/lib/python3.5/multiprocessing/pool.py", line 119, in worker
    result = (True, func(*args, **kwds))
  File "/usr/local/lib/python3.5/multiprocessing/pool.py", line 44, in mapstar
    return list(map(*args))
  File 
"/usr/local/lib/python3.5/site-packages/apache_beam/io/filebasedsink.py", line 
299, in _rename_batch
    FileSystems.rename(source_files, destination_files)
  File "/usr/local/lib/python3.5/site-packages/apache_beam/io/filesystems.py", 
line 252, in rename
    return filesystem.rename(source_file_names, destination_file_names)
  File 
"/usr/local/lib/python3.5/site-packages/apache_beam/io/gcp/gcsfilesystem.py", 
line 229, in rename
    copy_statuses = gcsio.GcsIO().copy_batch(batch)
  File "/usr/local/lib/python3.5/site-packages/apache_beam/io/gcp/gcsio.py", 
line 322, in copy_batch
    api_calls = batch_request.Execute(self.client._http)  # pylint: 
disable=protected-access
  File "/usr/local/lib/python3.5/site-packages/apitools/base/py/batch.py", line 
222, in Execute
    batch_http_request.Execute(http)
  File "/usr/local/lib/python3.5/site-packages/apitools/base/py/batch.py", line 
480, in Execute
    self._Execute(http)
  File "/usr/local/lib/python3.5/site-packages/apitools/base/py/batch.py", line 
450, in _Execute
    mime_response = parser.parsestr(header + response.content)
TypeError: Can't convert 'bytes' object to str implicitly
{code} 

After looking into related code in apitools library, I found response.content 
that's returned via http request to gcs is bytes and apitools didn't handle 
this scenario. This can be a blocker to any pipeline depending on gcsio and 
apparently blocks all Dataflow job in Python 3.

This could be another case that moving off apitools dependency in 
[BEAM-4850|https://issues.apache.org/jira/browse/BEAM-4850].



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to