[
https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781301#comment-17781301
]
ASF GitHub Bot commented on NUTCH-3017:
---------------------------------------
sebastian-nagel commented on code in PR #793:
URL: https://github.com/apache/nutch/pull/793#discussion_r1377375552
##########
src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java:
##########
@@ -181,9 +186,23 @@ public String filter(String url) {
public void reloadRules() throws IOException {
String fileRules = conf.get(URLFILTER_FAST_FILE);
- try (Reader reader = conf.getConfResourceAsReader(fileRules)) {
- reloadRules(reader);
+
+ InputStream is;
+
+ Path fileRulesPath = new Path(fileRules);
+ if (fileRulesPath.toUri().getScheme() != null) {
+ FileSystem fs = fileRulesPath.getFileSystem(conf);
+ is = fs.open(fileRulesPath);
+ }
Review Comment:
Since we have Hadoop, could try all supported compression codecs (gzip,
bzip2, zstd, etc.). Something such as (not tested):
```java
CompressionCodecFactory cf = new CompressionCodecFactory(conf);
CompressionCodec codec = cf.getCodec(fileRulesPath);
if (codec != null) {
is = codec.createInputStream(is);
}
```
See
[cf.getCodec(...)](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/io/compress/CompressionCodecFactory.html#getCodec-org.apache.hadoop.fs.Path-)
and
[codec.createInputStream(...)](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/io/compress/CompressionCodec.html#createInputStream-java.io.InputStream-).
If the rules file is contained in the job jar, it shouldn't be compressed
anyway.
> Allow fast-urlfilter to load from HDFS/S3 and support gzipped input
> -------------------------------------------------------------------
>
> Key: NUTCH-3017
> URL: https://issues.apache.org/jira/browse/NUTCH-3017
> Project: Nutch
> Issue Type: Improvement
> Components: plugin, urlfilter
> Affects Versions: 1.19
> Reporter: Julien Nioche
> Priority: Minor
> Fix For: 1.20
>
>
> This provide an easier way to refresh the resources since no rebuild of the
> jar will be needed. The path can point to either HDFS or S3. Additionally,
> .gz files should be handled automatically
--
This message was sent by Atlassian Jira
(v8.20.10#820010)