gianm commented on code in PR #13027:
URL: https://github.com/apache/druid/pull/13027#discussion_r992875793
##########
core/src/main/java/org/apache/druid/data/input/impl/CloudObjectInputSource.java:
##########
@@ -108,9 +112,40 @@ public List<CloudObjectLocation> getObjects()
@Nullable
@JsonProperty
@JsonInclude(JsonInclude.Include.NON_NULL)
- public String getFilter()
+ public String getObjectGlob()
{
- return filter;
+ return objectGlob;
+ }
+
+ /**
+ * Strips out blob store's protocol and bucket from {@link #getObjectGlob}.
+ * This is to reduce user errors when crafting a objectGlob.
Review Comment:
I think we shouldn't remove the bucket without verifying that it actually
matches. This leads to surprising outcomes like the glob `s3://foo/*.txt`
matching the file `s3://bar/x.txt`. I think we have a couple of options:
- Don't remove the protocol and bucket: if the user provides the glob
`s3://foo/*.txt` then it matches nothing.
- Be "smart": apply the glob to the entire URI if it starts with a protocol;
else apply it only to the object.
IMO, the first option is best. My experience with "smart" features is they
end up being more confusing than they're worth.
##########
core/src/test/java/org/apache/druid/data/input/impl/CloudObjectInputSourceTest.java:
##########
@@ -200,7 +205,72 @@ public void testWithObjects()
List<CloudObjectLocation> returnedLocations =
splits.map(InputSplit::get).collect(Collectors.toList()).get(0);
- Assert.assertEquals(null, inputSource.getFilter());
+ Assert.assertEquals(null, inputSource.getObjectGlob());
Assert.assertEquals(OBJECTS, returnedLocations);
}
+
+ @Test
+ public void testGlobSubdirectories()
+ {
+ PathMatcher m = FileSystems.getDefault().getPathMatcher("glob:**.parquet");
+ Assert.assertTrue(m.matches(Paths.get("db/date=2022-08-01/001.parquet")));
+ Assert.assertTrue(m.matches(Paths.get("db/date=2022-08-01/002.parquet")));
+
+ PathMatcher m2 =
FileSystems.getDefault().getPathMatcher("glob:db/date=2022-08-01/*.parquet");
+ Assert.assertTrue(m2.matches(Paths.get("db/date=2022-08-01/001.parquet")));
+
Assert.assertFalse(m2.matches(Paths.get("db/date=2022-08-01/_junk/0/001.parquet")));
+ }
+
+ @Test
+ public void testGlobSubdirectories2()
Review Comment:
This test cast isn't needed: we don't need to test this library that we
aren't using.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]