[ https://issues.apache.org/jira/browse/NUTCH-2770?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17046543#comment-17046543 ]
Sebastian Nagel commented on NUTCH-2770: ---------------------------------------- Thanks, [~jt55401]! Let us know if you need help. > Subcollection logic allows empty string as a whitelist value, thus matching > every incoming document. > ---------------------------------------------------------------------------------------------------- > > Key: NUTCH-2770 > URL: https://issues.apache.org/jira/browse/NUTCH-2770 > Project: Nutch > Issue Type: Bug > Affects Versions: 1.16 > Reporter: Jason Grey > Priority: Minor > > If subcollections.xml whitelist element contains empty lines at the end (ie: > because the XML was formatted nicely) those lines can become an empty string > in the string matching logic. That logic uses String.contains, and that in > turn returns TRUE for an empty string as input. > This then causes that subcollection to be tagged on EVERY incoming document. > Here is a POC to show the issue in isolation, since I do not yet have a dev > environment setup for nutch yet. > {code:java} > /** > This is a snippet that does the same logic as Subcollection.java in nutch. > https://github.com/apache/nutch/blob/fdee94d8e0894384f1fca7c9f16c7593a5bc928c/src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java > **/ > import java.lang.Math; > import java.util.StringTokenizer; > public class HelloWorld > { > public static void main(String[] args) > { > String urlToTest = "https://www.example.com/test/url/here"; > String text = "\r\n\t//research.xyz.com/\r\n\t/research/\r\n\t"; > StringTokenizer st = new StringTokenizer(text, "\n\r"); > while (st.hasMoreElements()) { > String line = ((String) st.nextElement()).trim(); > boolean matched = urlToTest.contains(line); > System.out.println("line: [" + line + "] = " + matched); > } > } > } > /** > output: > line: [//research.xyz.com/] = false > line: [/research/] = false > line: [] = true > as we can see, for the text in our XML config, it's outputting an extra line > which is matching on EVERYTHING... > **/ > {code} > There is a workaround, if you collapse the whitespace in the XML file, but I > think we should fix this anyway. I will try to do so and submit a patch soon > which will filter out empty string. -- This message was sent by Atlassian Jira (v8.3.4#803005)