[jira] [Commented] (NUTCH-3017) Allow fast-urlfilter to load from HDFS/S3 and support gzipped input

ASF GitHub Bot (Jira) Tue, 31 Oct 2023 03:31:08 -0700


    [ 
https://issues.apache.org/jira/browse/NUTCH-3017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17781301#comment-17781301
 ]


ASF GitHub Bot commented on NUTCH-3017:
---------------------------------------

sebastian-nagel commented on code in PR #793:
URL: https://github.com/apache/nutch/pull/793#discussion_r1377375552


##########
src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java:
##########
@@ -181,9 +186,23 @@ public String filter(String url) {
 
   public void reloadRules() throws IOException {
     String fileRules = conf.get(URLFILTER_FAST_FILE);
-    try (Reader reader = conf.getConfResourceAsReader(fileRules)) {
-      reloadRules(reader);
+
+    InputStream is;
+
+    Path fileRulesPath = new Path(fileRules);
+    if (fileRulesPath.toUri().getScheme() != null) {
+      FileSystem fs = fileRulesPath.getFileSystem(conf);
+      is = fs.open(fileRulesPath);
+    }

Review Comment:
   Since we have Hadoop, could try all supported compression codecs (gzip, 
bzip2, zstd, etc.). Something such as (not tested):
   ```java
   CompressionCodecFactory cf = new CompressionCodecFactory(conf);
   CompressionCodec codec = cf.getCodec(fileRulesPath);
   if (codec != null) {
      is = codec.createInputStream(is);
   }
   ```
   See 
[cf.getCodec(...)](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/io/compress/CompressionCodecFactory.html#getCodec-org.apache.hadoop.fs.Path-)
 and 
[codec.createInputStream(...)](https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/io/compress/CompressionCodec.html#createInputStream-java.io.InputStream-).
   
   If the rules file is contained in the job jar, it shouldn't be compressed 
anyway.





> Allow fast-urlfilter to load from HDFS/S3 and support gzipped input
> -------------------------------------------------------------------
>
>                 Key: NUTCH-3017
>                 URL: https://issues.apache.org/jira/browse/NUTCH-3017
>             Project: Nutch
>          Issue Type: Improvement
>          Components: plugin, urlfilter
>    Affects Versions: 1.19
>            Reporter: Julien Nioche
>            Priority: Minor
>             Fix For: 1.20
>
>
> This provide an easier way to refresh the resources since no rebuild of the 
> jar will be needed. The path can point to either HDFS or S3. Additionally, 
> .gz files should be handled automatically



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (NUTCH-3017) Allow fast-urlfilter to load from HDFS/S3 and support gzipped input

Reply via email to