[
https://issues.apache.org/jira/browse/BEAM-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15996790#comment-15996790
]
Devon Meunier commented on BEAM-2150:
-------------------------------------
This is what I went off:
https://github.com/GoogleCloudPlatform/gsutil/blob/cfe99899910375bf3695b7e4a119ee0074110259/gslib/addlhelp/wildcards.py
And then I ran some tests with {{gsutil}} using the {{-D}} flag to see the
kinds of requests that were being made.
When you're doing normal globbing, {{delimiter=%2F}} is passed along which
truncates the results from the API so you get
{{prefix[^delimiter]*[delimiter]}} However this isn't actually what Beam is
doing, and if you work across a bucket with a lot of files in the prefix, you
can see that it actually pages through every single object key and then filters
down. The end-behaviour is the same because of how we filter by regex, but it
actually made it really easy to relax the regex.
A approach would be to specify a delimiter
[here|https://github.com/apache/beam/blob/f43b61af4d5a3ee77a610d8b11ef80d421c34501/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/util/GcsUtil.java#L371]
so that we actually see some efficiency gains when not using recursive
globbing, but I figured that could happen in a followup PR.
> Support for recursive wildcards in GcsPath
> ------------------------------------------
>
> Key: BEAM-2150
> URL: https://issues.apache.org/jira/browse/BEAM-2150
> Project: Beam
> Issue Type: New Feature
> Components: sdk-java-core, sdk-java-gcp
> Reporter: Devon Meunier
> Assignee: Devon Meunier
> Priority: Minor
>
> When working with heavily nested folder structures in Google Cloud Storage,
> it's great to make use of recursive wildcards, which the current API
> explicitly does not support.
> This code hasn't been touched in 2 years so it's likely that simply no one's
> gotten around to it yet.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)