[ 
https://issues.apache.org/jira/browse/NUTCH-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17047883#comment-17047883
 ] 

Sebastian Nagel edited comment on NUTCH-2770 at 2/28/20 7:01 PM:
-----------------------------------------------------------------

Hi [~jt55401], unfortunately the patch does not apply because it clashes with 
changes made for NUTCH-2692. I'll open a PR, same logic, slightly different 
code. I'll commit it soon. Thanks!


was (Author: wastl-nagel):
Hi [~jt55401], unfortunately the patch does apply because it clashes with 
changes made for NUTCH-2692. I'll open a PR, same logic, slightly different 
code. I'll commit it soon. Thanks!

> Subcollection logic allows empty string as a whitelist value, thus matching 
> every incoming document.
> ----------------------------------------------------------------------------------------------------
>
>                 Key: NUTCH-2770
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2770
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer, plugin
>    Affects Versions: 1.16
>            Reporter: Jason Grey
>            Priority: Minor
>             Fix For: 1.17
>
>         Attachments: NUTCH-2770.patch
>
>
> If subcollections.xml whitelist element contains empty lines at the end (ie: 
> because the XML was formatted nicely) those lines can become an empty string 
> in the string matching logic. That logic uses String.contains, and that in 
> turn returns TRUE for an empty string as input.
> This then causes that subcollection to be tagged on EVERY incoming document.
> Here is a POC to show the issue in isolation, since I do not yet have a dev 
> environment setup for nutch yet.
> {code:java}
> /**
> This is a snippet that does the same logic as Subcollection.java in nutch.
> https://github.com/apache/nutch/blob/fdee94d8e0894384f1fca7c9f16c7593a5bc928c/src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java
> **/
> import java.lang.Math; 
> import java.util.StringTokenizer;
> public class HelloWorld
> {
>   public static void main(String[] args)
>   {
>     String urlToTest = "https://www.example.com/test/url/here";;
>     String text = "\r\n\t//research.xyz.com/\r\n\t/research/\r\n\t";
>     StringTokenizer st = new StringTokenizer(text, "\n\r");
>     while (st.hasMoreElements()) {
>       String line = ((String) st.nextElement()).trim();
>       boolean matched = urlToTest.contains(line);
>       System.out.println("line: [" + line + "] = " + matched);
>     }
>   }
> }
> /**
> output:
> line: [//research.xyz.com/] = false
> line: [/research/] = false
> line: [] = true
> as we can see, for the text in our XML config, it's outputting an extra line 
> which is matching on EVERYTHING...
> **/   
> {code}
> There is a workaround, if you collapse the whitespace in the XML file, but I 
> think we should fix this anyway. I will try to do so and submit a patch soon 
> which will filter out empty string.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to