[
https://issues.apache.org/jira/browse/BEAM-2338?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16021557#comment-16021557
]
ASF GitHub Bot commented on BEAM-2338:
--------------------------------------
GitHub user sb2nov opened a pull request:
https://github.com/apache/beam/pull/3210
[BEAM-2338] Fix the limit counter in gcsio reads
Be sure to do all of the following to help us incorporate your contribution
quickly and easily:
- [ ] Make sure the PR title is formatted like:
`[BEAM-<Jira issue #>] Description of pull request`
- [ ] Make sure tests pass via `mvn clean verify`.
- [ ] Replace `<Jira issue #>` in the title with the actual Jira issue
number, if there is one.
- [ ] If this contribution is large, please file an Apache
[Individual Contributor License
Agreement](https://www.apache.org/licenses/icla.pdf).
---
R: @chamikaramj PTAL
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/sb2nov/beam
BEAM-2338-fix-limit-counter-in-gcsio
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/beam/pull/3210.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3210
----
commit b5e380f16520fed813c1e0fa78128948381a1760
Author: Sourabh Bajaj <[email protected]>
Date: 2017-05-23T18:02:59Z
[BEAM-2338] Fix the limit counter in gcsio reads
----
> GCS filepattern wildcard broken in Python SDK
> ---------------------------------------------
>
> Key: BEAM-2338
> URL: https://issues.apache.org/jira/browse/BEAM-2338
> Project: Beam
> Issue Type: Bug
> Components: beam-model
> Affects Versions: 2.0.0
> Reporter: Vilhelm von Ehrenheim
> Assignee: Frances Perry
>
> Validation of file patterns containing wildcard (`*`) in GCS directories does
> not always work.
> Some kinds of patterns generates an error from here during validation:
> https://github.com/apache/beam/blob/v2.0.0/sdks/python/apache_beam/io/filebasedsource.py#L168
> I've tried a few different FileSystems match commands which confuses be a bit.
> Full path works:
> {noformat}
> >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN00/LC80440342016259LGN00_B1.TIF'],
> >>> limits=[1])[0].metadata_list
> [FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN00/LC80440342016259LGN00_B1.TIF,
> 74721736)]
> {noformat}
> Glob star on directory does not
> {noformat}
> >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/*/LC80440342016259LGN00_B1.TIF'],
> >>> limits=[1])[0].metadata_list
> []
> {noformat}
> If adding a star on the file level only searching for TIF files it works (all
> tough we match a different file but that is fine)
> {noformat}
> >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/*/*.TIF'],
> >>> limits=[1])[0].metadata_list
> [FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342013106LGN01/LC80440342013106LGN01_B1.TIF,
> 65862791)]
> {noformat}
> Ok, Here comes the even more strange case.
> Looking for the same file we found with the patterns that but with a star on
> the dir we find it!!
> {noformat}
> >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/*/LC80440342013106LGN01_B1.TIF'],
> >>> limits=[1])[0].metadata_list
> [FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342013106LGN01/LC80440342013106LGN01_B1.TIF,
> 65862791)]
> {noformat}
> Also looking at the first case again we will match if the star is placed late
> enough in the pattern to make the directory unique.
> {noformat}
> >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN*/LC80440342016259LGN00_B1.TIF'],
> >>> limits=[1])[0].metadata_list
> [FileMetadata(gs://gcp-public-data-landsat/LC08/PRE/044/034/LC80440342016259LGN00/LC80440342016259LGN00_B1.TIF,
> 74721736)]
> {noformat}
> but not if further up in the name
> {noformat}
> >>> FileSystems.match(['gs://gcp-public-data-landsat/LC08/PRE/044/034/LC8044034201*/LC80440342016259LGN00_B1.TIF'],
> >>> limits=[1])[0].metadata_list
> []
> {noformat}
> My guess is that some folders are dropped from the list of matched
> directories or something which is a bit concerning.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)