[ 
https://issues.apache.org/jira/browse/BEAM-4011?focusedWorklogId=124874&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-124874
 ]

ASF GitHub Bot logged work on BEAM-4011:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 19/Jul/18 09:18
            Start Date: 19/Jul/18 09:18
    Worklog Time Spent: 10m 
      Work Description: knub commented on issue #5024: [BEAM-4011] Unify Python 
IO glob implementation.
URL: https://github.com/apache/beam/pull/5024#issuecomment-406211816
 
 
   Hi udim,
   
   thanks for your quick response! For future reference, here's the full 
replacement with the same interface as before:
   
   ```python
   from apache_beam.io.filesystems import FileSystems
   
   def glob(path):
       match_result = FileSystems.match([path])[0]
       file_metadata_objects = match_result.metadata_list
       return [fm.path for fm in file_metadata_objects]
   ```
   
   ---
   
   Side-note: I think there's a bug in the `FileSystems.match` implementation 
for certain globs, see the following snippet:
   
   ```ipynb
   In [1]: from apache_beam.io.filesystems import FileSystems
   
   In [2]: FileSystems.match(["gs://<bucket name>/*"])
   /usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/gcsio.py:176: 
DeprecationWarning: object() takes no parameters
     super(GcsIO, cls).__new__(cls, storage_client))
   WARNING:root:Retry with exponential backoff: waiting for 4.3505648468 
seconds before retrying list_prefix because we caught exception: ValueError: 
GCS path must be in the form gs://<bucket>/<object>.
    Traceback for above exception (most recent call last):
     File "/usr/local/lib/python2.7/dist-packages/apache_beam/utils/retry.py", 
line 180, in wrapper
       return fun(*args, **kwargs)
     File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/gcsio.py", 
line 423, in list_prefix
       bucket, prefix = parse_gcs_path(path)
     File "/usr/local/lib/python2.7/dist-packages/apache_beam/io/gcp/gcsio.py", 
line 147, in parse_gcs_path
       raise ValueError('GCS path must be in the form gs://<bucket>/<object>.')
   ```
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 124874)
    Time Spent: 3h 10m  (was: 3h)

> Python SDK: add glob support for HDFS
> -------------------------------------
>
>                 Key: BEAM-4011
>                 URL: https://issues.apache.org/jira/browse/BEAM-4011
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py-core
>            Reporter: Udi Meiri
>            Assignee: Udi Meiri
>            Priority: Major
>             Fix For: 2.5.0
>
>          Time Spent: 3h 10m
>  Remaining Estimate: 0h
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to