khandelwal-prateek commented on code in PR #4069: URL: https://github.com/apache/gobblin/pull/4069#discussion_r1809763385
########## gobblin-data-management/src/test/java/org/apache/gobblin/data/management/copy/hive/HiveDatasetFinderTest.java: ########## @@ -215,6 +215,32 @@ public void testDatasetConfig() throws Exception { } + @Test + public void testHiveTableFolderFilter() throws Exception { + List<HiveDatasetFinder.DbAndTable> dbAndTables = Lists.newArrayList(); + dbAndTables.add(new HiveDatasetFinder.DbAndTable("db1", "table1")); + HiveMetastoreClientPool pool = getTestPool(dbAndTables); + + Properties properties = new Properties(); + properties.put(HiveDatasetFinder.HIVE_DATASET_PREFIX + "." + WhitelistBlacklist.WHITELIST, ""); + // Try a regex with multiple groups + properties.put(HiveDatasetFinder.TABLE_FOLDER_ALLOWLIST_FILTER, "(/tmp/|a).*"); + + HiveDatasetFinder finder = new TestHiveDatasetFinder(FileSystem.getLocal(new Configuration()), properties, pool); + List<HiveDataset> datasets = Lists.newArrayList(finder.getDatasetsIterator()); + + Assert.assertEquals(datasets.size(), 1); + + properties.put(HiveDatasetFinder.HIVE_DATASET_PREFIX + "." + WhitelistBlacklist.WHITELIST, ""); + // The table located at /tmp/test should be filtered Review Comment: it would be helpful to call out that the dataset table is created at path `/tmp/test` on line 221 since there is another assertion above using a different regex which doesn't filter the table ########## gobblin-data-management/src/test/java/org/apache/gobblin/data/management/copy/hive/HiveDatasetFinderTest.java: ########## @@ -215,6 +215,32 @@ public void testDatasetConfig() throws Exception { } + @Test + public void testHiveTableFolderFilter() throws Exception { Review Comment: can be renamed to `testHiveTableFolderAllowlistFilter` ########## gobblin-data-management/src/main/java/org/apache/gobblin/data/management/copy/hive/HiveDatasetFinder.java: ########## @@ -294,6 +304,12 @@ protected HiveDataset computeNext() { }; } + protected static boolean shouldAllowTableLocation(Optional<String> regex, Table table) { + if (!regex.isPresent()) { + return true; + } + return Pattern.compile(regex.get()).matcher(table.getSd().getLocation()).matches(); Review Comment: Pattern.compile(regex.get()) is called every time the method shouldAllowTableLocation is executed when the regex is present. Compiling a regex every time can be inefficient, we can compile the pattern once and store it, eg.: `private static Pattern compiledAllowlistPattern = regex.map(Pattern::compile).orElse(null);` ########## gobblin-data-management/src/test/java/org/apache/gobblin/data/management/copy/hive/HiveDatasetFinderTest.java: ########## @@ -215,6 +215,32 @@ public void testDatasetConfig() throws Exception { } + @Test + public void testHiveTableFolderFilter() throws Exception { + List<HiveDatasetFinder.DbAndTable> dbAndTables = Lists.newArrayList(); + dbAndTables.add(new HiveDatasetFinder.DbAndTable("db1", "table1")); + HiveMetastoreClientPool pool = getTestPool(dbAndTables); + + Properties properties = new Properties(); + properties.put(HiveDatasetFinder.HIVE_DATASET_PREFIX + "." + WhitelistBlacklist.WHITELIST, ""); + // Try a regex with multiple groups + properties.put(HiveDatasetFinder.TABLE_FOLDER_ALLOWLIST_FILTER, "(/tmp/|a).*"); + + HiveDatasetFinder finder = new TestHiveDatasetFinder(FileSystem.getLocal(new Configuration()), properties, pool); + List<HiveDataset> datasets = Lists.newArrayList(finder.getDatasetsIterator()); + + Assert.assertEquals(datasets.size(), 1); + + properties.put(HiveDatasetFinder.HIVE_DATASET_PREFIX + "." + WhitelistBlacklist.WHITELIST, ""); + // The table located at /tmp/test should be filtered + properties.put(HiveDatasetFinder.TABLE_FOLDER_ALLOWLIST_FILTER, "/a/b"); + + finder = new TestHiveDatasetFinder(FileSystem.getLocal(new Configuration()), properties, pool); + datasets = Lists.newArrayList(finder.getDatasetsIterator()); + + Assert.assertEquals(datasets.size(), 0); Review Comment: can also add a test for the case where the regex is empty or null -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: dev-unsubscr...@gobblin.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org