[ https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781301#comment-17781301 ]
ASF GitHub Bot commented on NUTCH-3017: --------------------------------------- sebastian-nagel commented on code in PR #793: URL: https://github.com/apache/nutch/pull/793#discussion_r1377375552 ########## src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java: ########## @@ -181,9 +186,23 @@ public String filter(String url) { public void reloadRules() throws IOException { String fileRules = conf.get(URLFILTER_FAST_FILE); - try (Reader reader = conf.getConfResourceAsReader(fileRules)) { - reloadRules(reader); + + InputStream is; + + Path fileRulesPath = new Path(fileRules); + if (fileRulesPath.toUri().getScheme() != null) { + FileSystem fs = fileRulesPath.getFileSystem(conf); + is = fs.open(fileRulesPath); + } Review Comment: Since we have Hadoop, could try all supported compression codecs (gzip, bzip2, zstd, etc.). Something such as (not tested): ```java CompressionCodecFactory cf = new CompressionCodecFactory(conf); CompressionCodec codec = cf.getCodec(fileRulesPath); if (codec != null) { is = codec.createInputStream(is); } ``` See [cf.getCodec(...)](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/io/compress/CompressionCodecFactory.html#getCodec-org.apache.hadoop.fs.Path-) and [codec.createInputStream(...)](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/io/compress/CompressionCodec.html#createInputStream-java.io.InputStream-). If the rules file is contained in the job jar, it shouldn't be compressed anyway. > Allow fast-urlfilter to load from HDFS/S3 and support gzipped input > ------------------------------------------------------------------- > > Key: NUTCH-3017 > URL: https://issues.apache.org/jira/browse/NUTCH-3017 > Project: Nutch > Issue Type: Improvement > Components: plugin, urlfilter > Affects Versions: 1.19 > Reporter: Julien Nioche > Priority: Minor > Fix For: 1.20 > > > This provide an easier way to refresh the resources since no rebuild of the > jar will be needed. The path can point to either HDFS or S3. Additionally, > .gz files should be handled automatically -- This message was sent by Atlassian Jira (v8.20.10#820010)