[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support

ASF GitHub Bot (Jira) Mon, 13 Jan 2020 20:17:37 -0800


     [ 
https://issues.apache.org/jira/browse/BEAM-9029?focusedWorklogId=371343&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-371343
 ]


ASF GitHub Bot logged work on BEAM-9029:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 14/Jan/20 04:16
            Start Date: 14/Jan/20 04:16
    Worklog Time Spent: 10m 
      Work Description: icemoon1987 commented on pull request #10459: 
[BEAM-9029]Fix two bugs in Python SDK S3 filesystem support
URL: https://github.com/apache/beam/pull/10459#discussion_r366145805
 
 

 ##########
 File path: sdks/python/apache_beam/io/aws/s3io.py
 ##########
 @@ -118,7 +118,18 @@ def list_prefix(self, path):
     logging.info("Starting the size estimation of the input")
 
     while True:
-      response = self.client.list(request)
+
+      #The list operation will raise an exception when trying to list a 
nonexistent S3 path. 
+      #This should not be an issue here. 
+      #Ignore this exception or it will break the procedure.
+      try:
+        response = self.client.list(request)
+      except messages.S3ClientError as e:
+        if e.code == 404:
+          break
 
 Review comment:
   Hi pabloem,
   
   Thank you for reviewing :)
   
   This "break" will exit the "while" loop start at line 120. The 
"response.items" operation in line 133 is also in the "while" loop in line 120. 
So the "response.items" (line 133) will not be run and will not raise an 
exception. The variable "response" is only used in the "while" loop start at 
line 120.
   
   The "break" here will stop the loop and the function: list_prefix will 
return the variable "file_sizes" in line 147. Since the variable "file_sizes" 
is initialized as an empty dict in line 114. The list_prefix function will 
return an empty dict. It should be safe. 
 
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

            Worklog Id:     (was: 371343)
    Remaining Estimate: 22h 20m  (was: 22.5h)
            Time Spent: 1h 40m  (was: 1.5h)

> Two bugs in Python SDK S3 filesystem support
> --------------------------------------------
>
>                 Key: BEAM-9029
>                 URL: https://issues.apache.org/jira/browse/BEAM-9029
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py-core
>            Reporter: Wenhai Pan
>            Assignee: Wenhai Pan
>            Priority: Major
>              Labels: pull-request-available
>   Original Estimate: 24h
>          Time Spent: 1h 40m
>  Remaining Estimate: 22h 20m
>
> Hi :)
> There seem to be 2 bugs in the S3 filesystem support.
> I tried to use S3 storage for a simple wordcount demo with DirectRunner.
> The demo script:
> {code:java}
> def main():
> options = PipelineOptions().view_as(StandardOptions)
>  options.runner = 'DirectRunner'
> pipeline = beam.Pipeline(options = options)
> (
>  pipeline
>  | ReadFromText("s3://mx-machine-learning/panwenhai/beam_test/test_data")
>  | "extract_words" >> beam.FlatMap(lambda x: re.findall(r" [A-Za-z\']+", x))
>  | beam.combiners.Count.PerElement()
>  | beam.MapTuple(lambda word, count: "%s: %s" % (word, count))
>  | WriteToText("s3://mx-machine-learning/panwenhai/beam_test/output")
>  )
> result = pipeline.run()
>  result.wait_until_finish()
> return
> {code}
>  
> Error message 1:
> {noformat}
> apache_beam.io.filesystem.BeamIOError: Match operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-*-of-00001': 
> BeamIOError("List operation failed with exceptions 
> {'s3://mx-machine-learning/panwenhai/beam_test/output-': S3ClientError('Tried 
> to list nonexistent S3 path: 
> s3://mx-machine-learning/panwenhai/beam_test/output-', 404)}")} [while 
> running 'WriteToText/Write/WriteImpl/PreFinalize'] with exceptions 
> None{noformat}
>  
> After digging into the code, it seems the Boto3 client's list function will 
> raise an exception when trying to list a nonexistent S3 path 
> (beam/sdks/pythonapache_beam/io/aws/clients/s3/boto3_client.py line 111). And 
> the S3IO class does not handle this exception in list_prefix function 
> (beam/sdks/python/apache_beam/io/aws/s3io.py line 121).
> When the runner tries to list and delete the existing output file, if there 
> is no existing output file, it will try to list a nonexistent S3 path and 
> will trigger the exception.
> This should not be an issue here. I think we can ignore this exception safely 
> in the S3IO list_prefix function.
> Error Message 2:
> {noformat}
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in delete
> exceptions = {path: error for (path, error) in results
> File 
> "/Users/wenhai.pan/venvs/tfx/lib/python3.7/site-packages/apache_beam-2.19.0.dev0-py3.7.egg/apache_beam/io/aws/s3filesystem.py",
>  line 272, in <dictcomp>
> exceptions = {path: error for (path, error) in results
> ValueError: too many values to unpack (expected 2) [while running 
> 'WriteToText/Write/WriteImpl/FinalizeWrite']{noformat}
>  
> When the runner tries to delete the temporary output directory, it will 
> trigger this exception. This exception is caused by parsing (path, error) 
> directly from the "results" which is a dict 
> (beam/sdks/python/apache_beam/io/aws/s3filesystem.py line 272). I think we 
> should use results.items() here.
> I have submitted a patch for these 2 bugs: 
> https://github.com/apache/beam/pull/10459
>  
> Thank you.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-9029) Two bugs in Python SDK S3 filesystem support

Reply via email to