[ 
https://issues.apache.org/jira/browse/BEAM-5417?focusedWorklogId=150795&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-150795
 ]

ASF GitHub Bot logged work on BEAM-5417:
----------------------------------------

                Author: ASF GitHub Bot
            Created on: 03/Oct/18 16:00
            Start Date: 03/Oct/18 16:00
    Worklog Time Spent: 10m 
      Work Description: chamikaramj closed pull request #6423: [BEAM-5417] 
Parity between GCS and local match
URL: https://github.com/apache/beam/pull/6423
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/LICENSE b/LICENSE
index a2e93e2819c..44c35bb1a01 100644
--- a/LICENSE
+++ b/LICENSE
@@ -402,3 +402,262 @@
     LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION
     OF CONTRACT, TORT OR OTHERWISE,  ARISING FROM, OUT OF OR IN CONNECTION
     WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE
+
+################################################################################
+CPython LICENSE. Source:
+https://github.com/python/cpython/blob/81574b80e92554adf75c13fa42415beb8be383cb/LICENSE
+
+A. HISTORY OF THE SOFTWARE
+==========================
+
+Python was created in the early 1990s by Guido van Rossum at Stichting
+Mathematisch Centrum (CWI, see http://www.cwi.nl) in the Netherlands
+as a successor of a language called ABC.  Guido remains Python's
+principal author, although it includes many contributions from others.
+
+In 1995, Guido continued his work on Python at the Corporation for
+National Research Initiatives (CNRI, see http://www.cnri.reston.va.us)
+in Reston, Virginia where he released several versions of the
+software.
+
+In May 2000, Guido and the Python core development team moved to
+BeOpen.com to form the BeOpen PythonLabs team.  In October of the same
+year, the PythonLabs team moved to Digital Creations, which became
+Zope Corporation.  In 2001, the Python Software Foundation (PSF, see
+https://www.python.org/psf/) was formed, a non-profit organization
+created specifically to own Python-related Intellectual Property.
+Zope Corporation was a sponsoring member of the PSF.
+
+All Python releases are Open Source (see http://www.opensource.org for
+the Open Source Definition).  Historically, most, but not all, Python
+releases have also been GPL-compatible; the table below summarizes
+the various releases.
+
+    Release         Derived     Year        Owner       GPL-
+                    from                                compatible? (1)
+
+    0.9.0 thru 1.2              1991-1995   CWI         yes
+    1.3 thru 1.5.2  1.2         1995-1999   CNRI        yes
+    1.6             1.5.2       2000        CNRI        no
+    2.0             1.6         2000        BeOpen.com  no
+    1.6.1           1.6         2001        CNRI        yes (2)
+    2.1             2.0+1.6.1   2001        PSF         no
+    2.0.1           2.0+1.6.1   2001        PSF         yes
+    2.1.1           2.1+2.0.1   2001        PSF         yes
+    2.1.2           2.1.1       2002        PSF         yes
+    2.1.3           2.1.2       2002        PSF         yes
+    2.2 and above   2.1.1       2001-now    PSF         yes
+
+Footnotes:
+
+(1) GPL-compatible doesn't mean that we're distributing Python under
+    the GPL.  All Python licenses, unlike the GPL, let you distribute
+    a modified version without making your changes open source.  The
+    GPL-compatible licenses make it possible to combine Python with
+    other software that is released under the GPL; the others don't.
+
+(2) According to Richard Stallman, 1.6.1 is not GPL-compatible,
+    because its license has a choice of law clause.  According to
+    CNRI, however, Stallman's lawyer has told CNRI's lawyer that 1.6.1
+    is "not incompatible" with the GPL.
+
+Thanks to the many outside volunteers who have worked under Guido's
+direction to make these releases possible.
+
+
+B. TERMS AND CONDITIONS FOR ACCESSING OR OTHERWISE USING PYTHON
+===============================================================
+
+PYTHON SOFTWARE FOUNDATION LICENSE VERSION 2
+--------------------------------------------
+
+1. This LICENSE AGREEMENT is between the Python Software Foundation
+("PSF"), and the Individual or Organization ("Licensee") accessing and
+otherwise using this software ("Python") in source or binary form and
+its associated documentation.
+
+2. Subject to the terms and conditions of this License Agreement, PSF hereby
+grants Licensee a nonexclusive, royalty-free, world-wide license to reproduce,
+analyze, test, perform and/or display publicly, prepare derivative works,
+distribute, and otherwise use Python alone or in any derivative version,
+provided, however, that PSF's License Agreement and PSF's notice of copyright,
+i.e., "Copyright (c) 2001, 2002, 2003, 2004, 2005, 2006, 2007, 2008, 2009, 
2010,
+2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018 Python Software Foundation; All
+Rights Reserved" are retained in Python alone or in any derivative version
+prepared by Licensee.
+
+3. In the event Licensee prepares a derivative work that is based on
+or incorporates Python or any part thereof, and wants to make
+the derivative work available to others as provided herein, then
+Licensee hereby agrees to include in any such work a brief summary of
+the changes made to Python.
+
+4. PSF is making Python available to Licensee on an "AS IS"
+basis.  PSF MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
+IMPLIED.  BY WAY OF EXAMPLE, BUT NOT LIMITATION, PSF MAKES NO AND
+DISCLAIMS ANY REPRESENTATION OR WARRANTY OF MERCHANTABILITY OR FITNESS
+FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF PYTHON WILL NOT
+INFRINGE ANY THIRD PARTY RIGHTS.
+
+5. PSF SHALL NOT BE LIABLE TO LICENSEE OR ANY OTHER USERS OF PYTHON
+FOR ANY INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES OR LOSS AS
+A RESULT OF MODIFYING, DISTRIBUTING, OR OTHERWISE USING PYTHON,
+OR ANY DERIVATIVE THEREOF, EVEN IF ADVISED OF THE POSSIBILITY THEREOF.
+
+6. This License Agreement will automatically terminate upon a material
+breach of its terms and conditions.
+
+7. Nothing in this License Agreement shall be deemed to create any
+relationship of agency, partnership, or joint venture between PSF and
+Licensee.  This License Agreement does not grant permission to use PSF
+trademarks or trade name in a trademark sense to endorse or promote
+products or services of Licensee, or any third party.
+
+8. By copying, installing or otherwise using Python, Licensee
+agrees to be bound by the terms and conditions of this License
+Agreement.
+
+
+BEOPEN.COM LICENSE AGREEMENT FOR PYTHON 2.0
+-------------------------------------------
+
+BEOPEN PYTHON OPEN SOURCE LICENSE AGREEMENT VERSION 1
+
+1. This LICENSE AGREEMENT is between BeOpen.com ("BeOpen"), having an
+office at 160 Saratoga Avenue, Santa Clara, CA 95051, and the
+Individual or Organization ("Licensee") accessing and otherwise using
+this software in source or binary form and its associated
+documentation ("the Software").
+
+2. Subject to the terms and conditions of this BeOpen Python License
+Agreement, BeOpen hereby grants Licensee a non-exclusive,
+royalty-free, world-wide license to reproduce, analyze, test, perform
+and/or display publicly, prepare derivative works, distribute, and
+otherwise use the Software alone or in any derivative version,
+provided, however, that the BeOpen Python License is retained in the
+Software, alone or in any derivative version prepared by Licensee.
+
+3. BeOpen is making the Software available to Licensee on an "AS IS"
+basis.  BEOPEN MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
+IMPLIED.  BY WAY OF EXAMPLE, BUT NOT LIMITATION, BEOPEN MAKES NO AND
+DISCLAIMS ANY REPRESENTATION OR WARRANTY OF MERCHANTABILITY OR FITNESS
+FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF THE SOFTWARE WILL NOT
+INFRINGE ANY THIRD PARTY RIGHTS.
+
+4. BEOPEN SHALL NOT BE LIABLE TO LICENSEE OR ANY OTHER USERS OF THE
+SOFTWARE FOR ANY INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES OR LOSS
+AS A RESULT OF USING, MODIFYING OR DISTRIBUTING THE SOFTWARE, OR ANY
+DERIVATIVE THEREOF, EVEN IF ADVISED OF THE POSSIBILITY THEREOF.
+
+5. This License Agreement will automatically terminate upon a material
+breach of its terms and conditions.
+
+6. This License Agreement shall be governed by and interpreted in all
+respects by the law of the State of California, excluding conflict of
+law provisions.  Nothing in this License Agreement shall be deemed to
+create any relationship of agency, partnership, or joint venture
+between BeOpen and Licensee.  This License Agreement does not grant
+permission to use BeOpen trademarks or trade names in a trademark
+sense to endorse or promote products or services of Licensee, or any
+third party.  As an exception, the "BeOpen Python" logos available at
+http://www.pythonlabs.com/logos.html may be used according to the
+permissions granted on that web page.
+
+7. By copying, installing or otherwise using the software, Licensee
+agrees to be bound by the terms and conditions of this License
+Agreement.
+
+
+CNRI LICENSE AGREEMENT FOR PYTHON 1.6.1
+---------------------------------------
+
+1. This LICENSE AGREEMENT is between the Corporation for National
+Research Initiatives, having an office at 1895 Preston White Drive,
+Reston, VA 20191 ("CNRI"), and the Individual or Organization
+("Licensee") accessing and otherwise using Python 1.6.1 software in
+source or binary form and its associated documentation.
+
+2. Subject to the terms and conditions of this License Agreement, CNRI
+hereby grants Licensee a nonexclusive, royalty-free, world-wide
+license to reproduce, analyze, test, perform and/or display publicly,
+prepare derivative works, distribute, and otherwise use Python 1.6.1
+alone or in any derivative version, provided, however, that CNRI's
+License Agreement and CNRI's notice of copyright, i.e., "Copyright (c)
+1995-2001 Corporation for National Research Initiatives; All Rights
+Reserved" are retained in Python 1.6.1 alone or in any derivative
+version prepared by Licensee.  Alternately, in lieu of CNRI's License
+Agreement, Licensee may substitute the following text (omitting the
+quotes): "Python 1.6.1 is made available subject to the terms and
+conditions in CNRI's License Agreement.  This Agreement together with
+Python 1.6.1 may be located on the Internet using the following
+unique, persistent identifier (known as a handle): 1895.22/1013.  This
+Agreement may also be obtained from a proxy server on the Internet
+using the following URL: http://hdl.handle.net/1895.22/1013";.
+
+3. In the event Licensee prepares a derivative work that is based on
+or incorporates Python 1.6.1 or any part thereof, and wants to make
+the derivative work available to others as provided herein, then
+Licensee hereby agrees to include in any such work a brief summary of
+the changes made to Python 1.6.1.
+
+4. CNRI is making Python 1.6.1 available to Licensee on an "AS IS"
+basis.  CNRI MAKES NO REPRESENTATIONS OR WARRANTIES, EXPRESS OR
+IMPLIED.  BY WAY OF EXAMPLE, BUT NOT LIMITATION, CNRI MAKES NO AND
+DISCLAIMS ANY REPRESENTATION OR WARRANTY OF MERCHANTABILITY OR FITNESS
+FOR ANY PARTICULAR PURPOSE OR THAT THE USE OF PYTHON 1.6.1 WILL NOT
+INFRINGE ANY THIRD PARTY RIGHTS.
+
+5. CNRI SHALL NOT BE LIABLE TO LICENSEE OR ANY OTHER USERS OF PYTHON
+1.6.1 FOR ANY INCIDENTAL, SPECIAL, OR CONSEQUENTIAL DAMAGES OR LOSS AS
+A RESULT OF MODIFYING, DISTRIBUTING, OR OTHERWISE USING PYTHON 1.6.1,
+OR ANY DERIVATIVE THEREOF, EVEN IF ADVISED OF THE POSSIBILITY THEREOF.
+
+6. This License Agreement will automatically terminate upon a material
+breach of its terms and conditions.
+
+7. This License Agreement shall be governed by the federal
+intellectual property law of the United States, including without
+limitation the federal copyright law, and, to the extent such
+U.S. federal law does not apply, by the law of the Commonwealth of
+Virginia, excluding Virginia's conflict of law provisions.
+Notwithstanding the foregoing, with regard to derivative works based
+on Python 1.6.1 that incorporate non-separable material that was
+previously distributed under the GNU General Public License (GPL), the
+law of the Commonwealth of Virginia shall govern this License
+Agreement only as to issues arising under or with respect to
+Paragraphs 4, 5, and 7 of this License Agreement.  Nothing in this
+License Agreement shall be deemed to create any relationship of
+agency, partnership, or joint venture between CNRI and Licensee.  This
+License Agreement does not grant permission to use CNRI trademarks or
+trade name in a trademark sense to endorse or promote products or
+services of Licensee, or any third party.
+
+8. By clicking on the "ACCEPT" button where indicated, or by copying,
+installing or otherwise using Python 1.6.1, Licensee agrees to be
+bound by the terms and conditions of this License Agreement.
+
+        ACCEPT
+
+
+CWI LICENSE AGREEMENT FOR PYTHON 0.9.0 THROUGH 1.2
+--------------------------------------------------
+
+Copyright (c) 1991 - 1995, Stichting Mathematisch Centrum Amsterdam,
+The Netherlands.  All rights reserved.
+
+Permission to use, copy, modify, and distribute this software and its
+documentation for any purpose and without fee is hereby granted,
+provided that the above copyright notice appear in all copies and that
+both that copyright notice and this permission notice appear in
+supporting documentation, and that the name of Stichting Mathematisch
+Centrum or CWI not be used in advertising or publicity pertaining to
+distribution of the software without specific, written prior
+permission.
+
+STICHTING MATHEMATISCH CENTRUM DISCLAIMS ALL WARRANTIES WITH REGARD TO
+THIS SOFTWARE, INCLUDING ALL IMPLIED WARRANTIES OF MERCHANTABILITY AND
+FITNESS, IN NO EVENT SHALL STICHTING MATHEMATISCH CENTRUM BE LIABLE
+FOR ANY SPECIAL, INDIRECT OR CONSEQUENTIAL DAMAGES OR ANY DAMAGES
+WHATSOEVER RESULTING FROM LOSS OF USE, DATA OR PROFITS, WHETHER IN AN
+ACTION OF CONTRACT, NEGLIGENCE OR OTHER TORTIOUS ACTION, ARISING OUT
+OF OR IN CONNECTION WITH THE USE OR PERFORMANCE OF THIS SOFTWARE.
diff --git a/sdks/python/apache_beam/io/filesystem.py 
b/sdks/python/apache_beam/io/filesystem.py
index ac19c517804..4b761b3f3a1 100644
--- a/sdks/python/apache_beam/io/filesystem.py
+++ b/sdks/python/apache_beam/io/filesystem.py
@@ -26,7 +26,6 @@
 
 import abc
 import bz2
-import fnmatch
 import io
 import logging
 import os
@@ -531,24 +530,117 @@ def _list(self, dir_or_prefix):
     """
     raise NotImplementedError
 
+  @staticmethod
+  def _split_scheme(url_or_path):
+    match = re.match(r'(^[a-z]+)://(.*)', url_or_path)
+    if match is not None:
+      return match.groups()
+    return None, url_or_path
+
+  @staticmethod
+  def _combine_scheme(scheme, path):
+    if scheme is None:
+      return path
+    return '{}://{}'.format(scheme, path)
+
   def _url_dirname(self, url_or_path):
     """Like posixpath.dirname, but preserves scheme:// prefix.
 
     Args:
       url_or_path: A string in the form of scheme://some/path OR /some/path.
     """
-    match = re.match(r'([a-z]+://)(.*)', url_or_path)
-    if match is None:
-      return posixpath.dirname(url_or_path)
-    url_prefix, path = match.groups()
-    return url_prefix + posixpath.dirname(path)
+    scheme, path = self._split_scheme(url_or_path)
+    return self._combine_scheme(scheme, posixpath.dirname(path))
+
+  def match_files(self, file_metas, pattern):
+    """Filter :class:`FileMetadata` objects by *pattern*
+
+    Args:
+      file_metas (list of :class:`FileMetadata`):
+        Files to consider when matching
+      pattern (str): File pattern
+
+    See Also:
+      :meth:`translate_pattern`
+
+    Returns:
+      Generator of matching :class:`FileMetadata`
+    """
+    re_pattern = re.compile(self.translate_pattern(pattern))
+    match = re_pattern.match
+    for file_metadata in file_metas:
+      if match(file_metadata.path):
+        yield file_metadata
+
+  @staticmethod
+  def translate_pattern(pattern):
+    """
+    Translate a *pattern* to a regular expression.
+    There is no way to quote meta-characters.
+
+    Pattern syntax:
+      The pattern syntax is based on the fnmatch_ syntax, with the following
+      differences:
+
+      -   ``*`` Is equivalent to ``[^/\\]*`` rather than ``.*``.
+      -   ``**`` Is equivalent to ``.*``.
+
+    See also:
+      :meth:`match` uses this method
+
+    This method is based on `Python 2.7's fnmatch.translate`_.
+    The code in this method is licensed under
+    PYTHON SOFTWARE FOUNDATION LICENSE VERSION 2.
+
+    .. _`fnmatch`: https://docs.python.org/2/library/fnmatch.html
+
+    .. _`Python 2.7's fnmatch.translate`: https://github.com/python/cpython\
+/blob/170ea8ccd4235d28538ab713041502d07ad1cacd/Lib/fnmatch.py#L85-L120
+    """
+    i, n = 0, len(pattern)
+    res = ''
+    while i < n:
+      c = pattern[i]
+      i = i + 1
+      if c == '*':
+        # One char lookahead for "**"
+        if i < n and pattern[i] == "*":
+          res = res + '.*'
+          i = i + 1
+        else:
+          res = res + r'[^/\\]*'
+      elif c == '?':
+        res = res + '.'
+      elif c == '[':
+        j = i
+        if j < n and pattern[j] == '!':
+          j = j + 1
+        if j < n and pattern[j] == ']':
+          j = j + 1
+        while j < n and pattern[j] != ']':
+          j = j + 1
+        if j >= n:
+          res = res + r'\['
+        else:
+          stuff = pattern[i:j].replace('\\', '\\\\')
+          i = j + 1
+          if stuff[0] == '!':
+            stuff = '^' + stuff[1:]
+          elif stuff[0] == '^':
+            stuff = '\\' + stuff
+          res = '%s[%s]' % (res, stuff)
+      else:
+        res = res + re.escape(c)
+
+    logger.debug('translate_pattern: %r -> %r', pattern, res)
+    return res + r'\Z(?ms)'
 
   def match(self, patterns, limits=None):
     """Find all matching paths to the patterns provided.
 
-    Pattern matching is done using fnmatch.fnmatch.
-    For filesystems that have directories, matching is not recursive. Patterns
-    like scheme://path/*/foo will not match anything.
+    See Also:
+      :meth:`translate_pattern`
+
     Patterns ending with '/' will be appended with '*'.
 
     Args:
@@ -583,15 +675,20 @@ def _match(pattern, limit):
           file_metadatas = [FileMetadata(pattern, self.size(pattern))]
       else:
         if self.has_dirs():
-          prefix_or_dir = self._url_dirname(prefix_or_dir)
+          prefix_dirname = self._url_dirname(prefix_or_dir)
+          if not prefix_dirname == prefix_or_dir:
+            logger.debug("Changed prefix_or_dir %r -> %r",
+                         prefix_or_dir, prefix_dirname)
+            prefix_or_dir = prefix_dirname
+
+        logger.debug("Listing files in %r", prefix_or_dir)
         file_metadatas = self._list(prefix_or_dir)
 
       metadata_list = []
-      for file_metadata in file_metadatas:
+      for file_metadata in self.match_files(file_metadatas, pattern):
         if limit is not None and len(metadata_list) >= limit:
           break
-        if fnmatch.fnmatch(file_metadata.path, pattern):
-          metadata_list.append(file_metadata)
+        metadata_list.append(file_metadata)
 
       return MatchResult(pattern, metadata_list)
 
diff --git a/sdks/python/apache_beam/io/filesystem_test.py 
b/sdks/python/apache_beam/io/filesystem_test.py
index 2954626069c..876ba7a38b8 100644
--- a/sdks/python/apache_beam/io/filesystem_test.py
+++ b/sdks/python/apache_beam/io/filesystem_test.py
@@ -23,13 +23,17 @@
 import bz2
 import gzip
 import logging
+import ntpath
 import os
+import posixpath
 import tempfile
 import unittest
 from builtins import range
 from io import BytesIO
 
 from future.utils import iteritems
+from parameterized import param
+from parameterized import parameterized
 
 from apache_beam.io.filesystem import CompressedFile
 from apache_beam.io.filesystem import CompressionTypes
@@ -109,8 +113,61 @@ def _flatten_match(self, match_results):
             for match_result in match_results
             for file_metadata in match_result.metadata_list]
 
-  def test_match_glob(self):
-    bucket_name = 'gcsio-test'
+  @parameterized.expand([
+      ('gs://gcsio-test/**', all),
+      # Does not match root-level files
+      ('gs://gcsio-test/**/*', lambda n, i: n not in ['cat.png']),
+      # Only matches root-level files
+      ('gs://gcsio-test/*', [
+          ('cat.png', 19)
+      ]),
+      ('gs://gcsio-test/cow/**', [
+          ('cow/cat/fish', 2),
+          ('cow/cat/blubber', 3),
+          ('cow/dog/blubber', 4),
+      ]),
+      ('gs://gcsio-test/cow/ca**', [
+          ('cow/cat/fish', 2),
+          ('cow/cat/blubber', 3),
+      ]),
+      ('gs://gcsio-test/apple/[df]ish/ca*', [
+          ('apple/fish/cat', 10),
+          ('apple/fish/cart', 11),
+          ('apple/fish/carl', 12),
+          ('apple/dish/cat', 14),
+          ('apple/dish/carl', 15),
+      ]),
+      ('gs://gcsio-test/apple/?ish/?a?', [
+          ('apple/fish/cat', 10),
+          ('apple/dish/bat', 13),
+          ('apple/dish/cat', 14),
+      ]),
+      ('gs://gcsio-test/apple/fish/car?', [
+          ('apple/fish/cart', 11),
+          ('apple/fish/carl', 12),
+      ]),
+      ('gs://gcsio-test/apple/fish/b*', [
+          ('apple/fish/blubber', 6),
+          ('apple/fish/blowfish', 7),
+          ('apple/fish/bambi', 8),
+          ('apple/fish/balloon', 9),
+      ]),
+      ('gs://gcsio-test/apple/f*/b*', [
+          ('apple/fish/blubber', 6),
+          ('apple/fish/blowfish', 7),
+          ('apple/fish/bambi', 8),
+          ('apple/fish/balloon', 9),
+      ]),
+      ('gs://gcsio-test/apple/dish/[cb]at', [
+          ('apple/dish/bat', 13),
+          ('apple/dish/cat', 14),
+      ]),
+      ('gs://gcsio-test/banana/cyrano.m?', [
+          ('banana/cyrano.md', 17),
+          ('banana/cyrano.mb', 18),
+      ]),
+  ])
+  def test_match_glob(self, file_pattern, expected_object_names):
     objects = [
         ('cow/cat/fish', 2),
         ('cow/cat/blubber', 3),
@@ -126,66 +183,69 @@ def test_match_glob(self):
         ('apple/dish/bat', 13),
         ('apple/dish/cat', 14),
         ('apple/dish/carl', 15),
+        ('banana/cat', 16),
+        ('banana/cyrano.md', 17),
+        ('banana/cyrano.mb', 18),
+        ('cat.png', 19)
     ]
-    for (object_name, size) in objects:
+    bucket_name = 'gcsio-test'
+
+    if callable(expected_object_names):
+      # A hack around the fact that the parameters do not have access to
+      # the "objects" list.
+
+      if expected_object_names is all:
+        # It's a placeholder for "all" objects
+        expected_object_names = objects
+      else:
+        # It's a filter function of type (str, int) -> bool
+        # that returns true for expected objects
+        filter_func = expected_object_names
+        expected_object_names = [
+            (short_path, size) for short_path, size in objects
+            if filter_func(short_path, size)
+        ]
+
+    for object_name, size in objects:
       file_name = 'gs://%s/%s' % (bucket_name, object_name)
       self.fs._insert_random_file(file_name, size)
-    test_cases = [
-        ('gs://*', objects),
-        ('gs://gcsio-test/*', objects),
-        ('gs://gcsio-test/cow/*', [
-            ('cow/cat/fish', 2),
-            ('cow/cat/blubber', 3),
-            ('cow/dog/blubber', 4),
-        ]),
-        ('gs://gcsio-test/cow/ca*', [
-            ('cow/cat/fish', 2),
-            ('cow/cat/blubber', 3),
-        ]),
-        ('gs://gcsio-test/apple/[df]ish/ca*', [
-            ('apple/fish/cat', 10),
-            ('apple/fish/cart', 11),
-            ('apple/fish/carl', 12),
-            ('apple/dish/cat', 14),
-            ('apple/dish/carl', 15),
-        ]),
-        ('gs://gcsio-test/apple/fish/car?', [
-            ('apple/fish/cart', 11),
-            ('apple/fish/carl', 12),
-        ]),
-        ('gs://gcsio-test/apple/fish/b*', [
-            ('apple/fish/blubber', 6),
-            ('apple/fish/blowfish', 7),
-            ('apple/fish/bambi', 8),
-            ('apple/fish/balloon', 9),
-        ]),
-        ('gs://gcsio-test/apple/f*/b*', [
-            ('apple/fish/blubber', 6),
-            ('apple/fish/blowfish', 7),
-            ('apple/fish/bambi', 8),
-            ('apple/fish/balloon', 9),
-        ]),
-        ('gs://gcsio-test/apple/dish/[cb]at', [
-            ('apple/dish/bat', 13),
-            ('apple/dish/cat', 14),
-        ]),
+
+    expected_file_names = [('gs://%s/%s' % (bucket_name, object_name), size)
+                           for object_name, size in expected_object_names]
+    actual_file_names = [
+        (file_metadata.path, file_metadata.size_in_bytes)
+        for file_metadata in self._flatten_match(self.fs.match([file_pattern]))
     ]
-    for file_pattern, expected_object_names in test_cases:
-      expected_file_names = [('gs://%s/%s' % (bucket_name, object_name), size)
-                             for (object_name, size) in expected_object_names]
-      self.assertEqual(
-          set([(file_metadata.path, file_metadata.size_in_bytes)
-               for file_metadata in
-               self._flatten_match(self.fs.match([file_pattern]))]),
-          set(expected_file_names))
+
+    self.assertEqual(set(actual_file_names), set(expected_file_names))
 
     # Check if limits are followed correctly
     limit = 3
-    for file_pattern, expected_object_names in test_cases:
-      expected_num_items = min(len(expected_object_names), limit)
-      self.assertEqual(
-          len(self._flatten_match(self.fs.match([file_pattern], [limit]))),
-          expected_num_items)
+    expected_num_items = min(len(expected_object_names), limit)
+    self.assertEqual(
+        len(self._flatten_match(self.fs.match([file_pattern], [limit]))),
+        expected_num_items)
+
+  @parameterized.expand([
+      param(os_path=posixpath, sep_re='\\/'),
+      param(os_path=ntpath, sep_re='\\\\'),
+  ])
+  def test_translate_pattern(self, os_path, sep_re):
+    star = r'[^/\\]*'
+    double_star = r'.*'
+    join = os_path.join
+
+    sep = os_path.sep
+    pattern__expected = [
+        (join('a', '*'), sep_re.join(['a', star])),
+        (join('b', '*') + sep, sep_re.join(['b', star]) + sep_re),
+        (r'*[abc\]', star + r'[abc\\]'),
+        (join('d', '**', '*'), sep_re.join(['d', double_star, star])),
+    ]
+    for pattern, expected in pattern__expected:
+      expected += r'\Z(?ms)'
+      result = self.fs.translate_pattern(pattern)
+      self.assertEqual(result, expected)
 
 
 class TestFileSystemWithDirs(TestFileSystem):
diff --git a/sdks/python/apache_beam/io/filesystems.py 
b/sdks/python/apache_beam/io/filesystems.py
index 5d9c1fe98b3..d8b3a4a5454 100644
--- a/sdks/python/apache_beam/io/filesystems.py
+++ b/sdks/python/apache_beam/io/filesystems.py
@@ -151,10 +151,26 @@ def mkdirs(path):
   def match(patterns, limits=None):
     """Find all matching paths to the patterns provided.
 
-    Pattern matching is done using fnmatch.fnmatch.
-    For filesystems that have directories, matching is not recursive. Patterns
-    like scheme://path/*/foo will not match anything.
-    Patterns ending with '/' will be appended with '*'.
+    Pattern matching is done using each filesystem's ``match`` method (e.g.
+    :meth:`.filesystem.FileSystem.match`).
+
+    .. note::
+      - Depending on the :class:`.FileSystem` implementation, file listings
+        (the ``.FileSystem._list`` method) may not be recursive.
+      - If the file listing is not recursive, a pattern like
+        ``scheme://path/*/foo`` will not be able to mach any files.
+
+    See Also:
+      :meth:`.filesystem.FileSystem.match`
+
+    Pattern syntax:
+      The pattern syntax is based on the fnmatch_ syntax, with the following
+      differences:
+
+      -   ``*`` Is equivalent to ``[^/\\]*`` rather than ``.*``.
+      -   ``**`` Is equivalent to ``.*``.
+
+    .. _`fnmatch`: https://docs.python.org/2/library/fnmatch.html
 
     Args:
       patterns: list of string for the file path pattern to match against
diff --git a/sdks/python/apache_beam/io/localfilesystem.py 
b/sdks/python/apache_beam/io/localfilesystem.py
index ddd5022b2bb..11457746774 100644
--- a/sdks/python/apache_beam/io/localfilesystem.py
+++ b/sdks/python/apache_beam/io/localfilesystem.py
@@ -111,9 +111,13 @@ def _list(self, dir_or_prefix):
     if not self.exists(dir_or_prefix):
       return
 
+    def list_files(root):
+      for dirpath, _, files in os.walk(root):
+        for filename in files:
+          yield self.join(dirpath, filename)
+
     try:
-      for f in os.listdir(dir_or_prefix):
-        f = self.join(dir_or_prefix, f)
+      for f in list_files(dir_or_prefix):
         try:
           yield FileMetadata(f, os.path.getsize(f))
         except OSError:
diff --git a/sdks/python/apache_beam/io/localfilesystem_test.py 
b/sdks/python/apache_beam/io/localfilesystem_test.py
index bc45e4221dd..cb23e0e265c 100644
--- a/sdks/python/apache_beam/io/localfilesystem_test.py
+++ b/sdks/python/apache_beam/io/localfilesystem_test.py
@@ -28,6 +28,8 @@
 import unittest
 
 import mock
+from parameterized import param
+from parameterized import parameterized
 
 from apache_beam.io import localfilesystem
 from apache_beam.io.filesystem import BeamIOError
@@ -150,17 +152,40 @@ def test_match_file_exception(self):
       self.fs.match([None])
     self.assertEqual(list(error.exception.exception_details.keys()), [None])
 
-  def test_match_glob(self):
-    path1 = os.path.join(self.tmpdir, 'f1')
-    path2 = os.path.join(self.tmpdir, 'f2')
-    open(path1, 'a').close()
-    open(path2, 'a').close()
+  @parameterized.expand([
+      param('*',
+            files=['a', 'b', 'c/x'],
+            expected=['a', 'b']),
+      param('**',
+            files=['a', 'b/x', 'c/x'],
+            expected=['a', 'b/x', 'c/x']),
+      param('*/*',
+            files=['a', 'b/x', 'c/x', 'd/x/y'],
+            expected=['b/x', 'c/x']),
+      param('**/*',
+            files=['a', 'b/x', 'c/x', 'd/x/y'],
+            expected=['b/x', 'c/x', 'd/x/y']),
+  ])
+  def test_match_glob(self, pattern, files, expected):
+    for filename in files:
+      full_path = os.path.join(self.tmpdir, filename)
+      dirname = os.path.dirname(full_path)
+      if not dirname == full_path:
+        # Make sure we don't go outside the tmpdir
+        assert os.path.commonprefix([self.tmpdir, full_path]) == self.tmpdir
+        try:
+          self.fs.mkdirs(dirname)
+        except IOError:
+          # Directory exists
+          pass
+
+      open(full_path, 'a').close()  # create empty file
 
     # Match both the files in the directory
-    path = os.path.join(self.tmpdir, '*')
-    result = self.fs.match([path])[0]
-    files = [f.path for f in result.metadata_list]
-    self.assertItemsEqual(files, [path1, path2])
+    full_pattern = os.path.join(self.tmpdir, pattern)
+    result = self.fs.match([full_pattern])[0]
+    files = [os.path.relpath(f.path, self.tmpdir) for f in 
result.metadata_list]
+    self.assertItemsEqual(files, expected)
 
   def test_match_directory(self):
     result = self.fs.match([self.tmpdir])[0]
diff --git a/sdks/python/setup.py b/sdks/python/setup.py
index 22e1adeacb9..2e955242701 100644
--- a/sdks/python/setup.py
+++ b/sdks/python/setup.py
@@ -131,6 +131,7 @@ def get_version():
 
 REQUIRED_TEST_PACKAGES = [
     'nose>=1.3.7',
+    'parameterized~=0.6.0',
     'numpy>=1.14.3,<2',
     'pyhamcrest>=1.9,<2.0',
     ]


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 150795)
    Time Spent: 6.5h  (was: 6h 20m)

> FileSystems.match behaviour diff between GCS and local file system
> ------------------------------------------------------------------
>
>                 Key: BEAM-5417
>                 URL: https://issues.apache.org/jira/browse/BEAM-5417
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py-core
>    Affects Versions: 2.5.0, 2.6.0
>            Reporter: Joar Wandborg
>            Assignee: Chamikara Jayalath
>            Priority: Major
>          Time Spent: 6.5h
>  Remaining Estimate: 0h
>
> Given the directory structure:
>  
> {noformat}
> .
> ├── filesystem-match-test
> │   ├── a
> │   │   └── file.txt
> │   └── b
> │       └── file.txt
> └── filesystem-match-test.py
> {noformat}
>  
> Where {{filesystem-match-test.py}} contains:
> {code:python}
> from __future__ import print_function
> import os
> import posixpath
> from apache_beam.io.filesystem import MatchResult
> from apache_beam.io.filesystems import FileSystems
> BASES = [
>     os.path.join(os.path.dirname(__file__), "./"),
>     "gs://my-bucket/test/",
> ]
> pattern = "filesystem-match-test/*/file.txt"
> for base_path in BASES:
>     full_pattern = posixpath.join(base_path, pattern)
>     print("full_pattern: {}".format(full_pattern))
>     match_result = FileSystems.match([full_pattern])[0]  # type: MatchResult
>     print("metadata list: {}".format(match_result.metadata_list))
> {code}
> Running {{python filesystem-match-test.py}} does not match any files locally, 
> but does match files on GCS:
> {noformat}
> full_pattern: ./filesystem-match-test/*/file.txt
> metadata list: []
> full_pattern: gs://my-bucket/test/filesystem-match-test/*/file.txt
> metadata list: 
> [FileMetadata(gs://my-bucket/test/filesystem-match-test/a/file.txt, 6), 
> FileMetadata(gs://my-bucket/test/filesystem-match-test/b/file.txt, 6)]
> {noformat}
> The expected result is that a/file.txt and b/file.txt should be matched for 
> both patterns.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to