[ 
https://issues.apache.org/jira/browse/BEAM-2150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15996790#comment-15996790
 ] 

Devon Meunier commented on BEAM-2150:
-------------------------------------

This is what I went off: 
https://github.com/GoogleCloudPlatform/gsutil/blob/cfe99899910375bf3695b7e4a119ee0074110259/gslib/addlhelp/wildcards.py

And then I ran some tests with {{gsutil}} using the {{-D}} flag to see the 
kinds of requests that were being made.

When you're doing normal globbing, {{delimiter=%2F}} is passed along which 
truncates the results from the API so you get 
{{prefix[^delimiter]*[delimiter]}} However this isn't actually what Beam is 
doing, and if you work across a bucket with a lot of files in the prefix, you 
can see that it actually pages through every single object key and then filters 
down. The end-behaviour is the same because of how we filter by regex, but it 
actually made it really easy to relax the regex.

A approach would be to specify a delimiter 
[here|https://github.com/apache/beam/blob/f43b61af4d5a3ee77a610d8b11ef80d421c34501/sdks/java/extensions/google-cloud-platform-core/src/main/java/org/apache/beam/sdk/util/GcsUtil.java#L371]
 so that we actually see some efficiency gains when not using recursive 
globbing, but I figured that could happen in a followup PR.

> Support for recursive wildcards in GcsPath
> ------------------------------------------
>
>                 Key: BEAM-2150
>                 URL: https://issues.apache.org/jira/browse/BEAM-2150
>             Project: Beam
>          Issue Type: New Feature
>          Components: sdk-java-core, sdk-java-gcp
>            Reporter: Devon Meunier
>            Assignee: Devon Meunier
>            Priority: Minor
>
> When working with heavily nested folder structures in Google Cloud Storage, 
> it's great to make use of recursive wildcards, which the current API 
> explicitly does not support.
> This code hasn't been touched in 2 years so it's likely that simply no one's 
> gotten around to it yet.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to