[GitHub] [druid] gianm commented on pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

GitBox Mon, 12 Sep 2022 17:16:40 -0700


gianm commented on PR #13027:
URL: https://github.com/apache/druid/pull/13027#issuecomment-1244734899


   > @gianm I would say it is quite common when people are massaging data using 
Spark into Iceberg.
   
   @didip would you mind giving an example of how people would use the filter 
glob to read Iceberg data? I'm not familiar with how Iceberg stores data, so 
this would help me understand how the feature is likely to be used.
   
   I continue to be concerned about the confusingness of whole-path globs, so, 
I do think if we ship the feature then we should be really clear about how it 
works. Docs should explain what string is used as the path for the match. For 
example, if your prefix is `s3://mybucket/myprefix`, and there is an object 
`s3://mybucket/myprefix/foo/bar.txt` then is the path `/foo/bar.txt` (the part 
after the prefix) or is it `foo/bar.txt` (the part after the prefix, with 
leading `/` stripped), or is it `myprefix/foo/bar.txt` (the entire S3 object 
key)?
   
   We could also offer both `filter` (name glob) and `pathFilter` (whole-path 
glob) options. That way, consistency with `local` input source is preserved 
(where its `filter` applies to filenames only), and also, users with simple use 
cases that don't involve Iceberg integration can have a more-intuitive 
name-based matching.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [druid] gianm commented on pull request #13027: Use standard library to correctly glob and stop at the correct folder structure when filtering cloud objects

Reply via email to