[
https://issues.apache.org/jira/browse/NUTCH-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16648859#comment-16648859
]
ASF GitHub Bot commented on NUTCH-1678:
---------------------------------------
sebastian-nagel closed pull request #390: NUTCH-1678 Remove dependency on
org.apache.oro
URL: https://github.com/apache/nutch/pull/390
This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:
As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):
diff --git a/LICENSE.txt b/LICENSE.txt
index 1b7a967d2..9badcdad6 100644
--- a/LICENSE.txt
+++ b/LICENSE.txt
@@ -1079,62 +1079,6 @@ http://www.python.org. Full license is here:
http://www.python.org/download/releases/2.4.2/license/
-lib/jakarta-oro-2.0.8.jar
-
-/* ====================================================================
- * The Apache Software License, Version 1.1
- *
- * Copyright (c) 2000-2002 The Apache Software Foundation. All rights
- * reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- *
- * 1. Redistributions of source code must retain the above copyright
- * notice, this list of conditions and the following disclaimer.
- *
- * 2. Redistributions in binary form must reproduce the above copyright
- * notice, this list of conditions and the following disclaimer in
- * the documentation and/or other materials provided with the
- * distribution.
- *
- * 3. The end-user documentation included with the redistribution,
- * if any, must include the following acknowledgment:
- * "This product includes software developed by the
- * Apache Software Foundation (http://www.apache.org/)."
- * Alternately, this acknowledgment may appear in the software itself,
- * if and wherever such third-party acknowledgments normally appear.
- *
- * 4. The names "Apache" and "Apache Software Foundation", "Jakarta-Oro"
- * must not be used to endorse or promote products derived from this
- * software without prior written permission. For written
- * permission, please contact [email protected].
- *
- * 5. Products derived from this software may not be called "Apache"
- * or "Jakarta-Oro", nor may "Apache" or "Jakarta-Oro" appear in their
- * name, without prior written permission of the Apache Software Foundation.
- *
- * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
- * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
- * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED. IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
- * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
- * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
- * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
- * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
- * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
- * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
- * SUCH DAMAGE.
- * ====================================================================
- *
- * This software consists of voluntary contributions made by many
- * individuals on behalf of the Apache Software Foundation. For more
- * information on the Apache Software Foundation, please see
- * <http://www.apache.org/>.
- */
-
lib/jetty-ext/commons-el.jar
/*
diff --git a/conf/parse-plugins.xml b/conf/parse-plugins.xml
index 5b20be6e1..6e3069897 100644
--- a/conf/parse-plugins.xml
+++ b/conf/parse-plugins.xml
@@ -51,6 +51,10 @@
<plugin id="parse-zip" />
</mimeType>
+ <mimeType name="application/javascript">
+ <plugin id="parse-js" />
+ </mimeType>
+
<mimeType name="application/x-javascript">
<plugin id="parse-js" />
</mimeType>
diff --git a/conf/regex-normalize.xml.template
b/conf/regex-normalize.xml.template
index ec60c1081..ed3f7108a 100644
--- a/conf/regex-normalize.xml.template
+++ b/conf/regex-normalize.xml.template
@@ -17,7 +17,8 @@
-->
<!-- This is the configuration file for the RegexUrlNormalize Class.
This is intended so that users can specify substitutions to be
- done on URLs. The regex engine that is used is Perl5 compatible.
+ done on URLs using the Java regex syntax, see
+ https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
The rules are applied to URLs in the order they occur in this file. -->
<!-- WATCH OUT: an xml parser reads this file an ampersands must be
diff --git a/ivy/ivy.xml b/ivy/ivy.xml
index 1b8d71494..4b2250a24 100644
--- a/ivy/ivy.xml
+++ b/ivy/ivy.xml
@@ -69,7 +69,6 @@
<dependency org="xerces" name="xercesImpl" rev="2.9.1" />
<dependency org="xerces" name="xmlParserAPIs" rev="2.6.2" />
<dependency org="xalan" name="serializer" rev="2.7.1" />
- <dependency org="oro" name="oro" rev="2.0.8" />
<dependency org="org.jdom" name="jdom" rev="1.1" conf="*->default" />
@@ -137,7 +136,7 @@
<!-- Uncomment this to use MongoDB as Gora backend. -->
<!--
<dependency org="org.apache.gora" name="gora-mongodb" rev="0.8"
conf="*->default" />
- -->
+ -->
<!-- Uncomment this to use OrientDB as Gora backend. -->
<!--
<dependency org="org.apache.gora" name="gora-orientdb" rev="0.8"
conf="*->default" />
diff --git a/src/java/org/apache/nutch/parse/OutlinkExtractor.java
b/src/java/org/apache/nutch/parse/OutlinkExtractor.java
index b4214b506..c0b61d489 100644
--- a/src/java/org/apache/nutch/parse/OutlinkExtractor.java
+++ b/src/java/org/apache/nutch/parse/OutlinkExtractor.java
@@ -21,19 +21,14 @@
import java.net.MalformedURLException;
import java.util.ArrayList;
import java.util.List;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+import org.apache.hadoop.conf.Configuration;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
-import org.apache.hadoop.conf.Configuration;
-import org.apache.oro.text.regex.MatchResult;
-import org.apache.oro.text.regex.Pattern;
-import org.apache.oro.text.regex.PatternCompiler;
-import org.apache.oro.text.regex.PatternMatcher;
-import org.apache.oro.text.regex.PatternMatcherInput;
-import org.apache.oro.text.regex.Perl5Compiler;
-import org.apache.oro.text.regex.Perl5Matcher;
-
/**
* Extractor to extract {@link org.apache.nutch.parse.Outlink}s / URLs from
* plain text using Regular Expressions.
@@ -60,7 +55,8 @@
* </a>
*/
- private static final String URL_PATTERN =
"([A-Za-z][A-Za-z0-9+.-]{1,120}:[A-Za-z0-9/](([A-Za-z0-9$_.+!*,;/?:@&~=-])|%[A-Fa-f0-9]{2}){1,333}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;/?:@&~=%-]{0,1000}))?)";
+ private static final Pattern URL_PATTERN = Pattern.compile(
+
"([A-Za-z][A-Za-z0-9+.-]{1,120}:[A-Za-z0-9/](([A-Za-z0-9$_.+!*,;/?:@&~=-])|%[A-Fa-f0-9]{2}){1,333}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;/?:@&~=%-]{0,1000}))?)");
/**
* Extracts <code>Outlink</code> from given plain text. Applying this method
@@ -72,7 +68,8 @@
*
* @return Array of <code>Outlink</code>s within found in plainText
*/
- public static Outlink[] getOutlinks(final String plainText, Configuration
conf) {
+ public static Outlink[] getOutlinks(final String plainText,
+ Configuration conf) {
return OutlinkExtractor.getOutlinks(plainText, "", conf);
}
@@ -89,23 +86,20 @@
*/
public static Outlink[] getOutlinks(final String plainText, String anchor,
Configuration conf) {
- long start = System.currentTimeMillis();
- final List<Outlink> outlinks = new ArrayList<Outlink>();
- try {
- final PatternCompiler cp = new Perl5Compiler();
- final Pattern pattern = cp.compile(URL_PATTERN,
- Perl5Compiler.CASE_INSENSITIVE_MASK | Perl5Compiler.READ_ONLY_MASK
- | Perl5Compiler.MULTILINE_MASK);
- final PatternMatcher matcher = new Perl5Matcher();
+ if (plainText == null) {
+ return new Outlink[0];
+ }
- final PatternMatcherInput input = new PatternMatcherInput(plainText);
+ long start = System.currentTimeMillis();
+ final List<Outlink> outlinks = new ArrayList<>();
- MatchResult result;
+ try {
+ Matcher matcher = URL_PATTERN.matcher(plainText);
String url;
- // loop the matches
- while (matcher.contains(input, pattern)) {
+ // Check for stuff!
+ while (matcher.find()) {
// if this is taking too long, stop matching
// (SHOULD really check cpu time used so that heavily loaded systems
// do not unnecessarily hit this limit.)
@@ -115,8 +109,9 @@
}
break;
}
- result = matcher.getMatch();
- url = result.group(0);
+
+ url = matcher.group().trim();
+
try {
outlinks.add(new Outlink(url, anchor));
} catch (MalformedURLException mue) {
diff --git
a/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
b/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
index 230373d35..9faccd2e8 100644
---
a/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
+++
b/src/plugin/index-more/src/java/org/apache/nutch/indexer/more/MoreIndexingFilter.java
@@ -5,6 +5,9 @@
import java.util.Collection;
import java.util.Date;
import java.util.HashSet;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+import java.util.regex.PatternSyntaxException;
import org.apache.avro.util.Utf8;
import org.apache.commons.lang.time.DateUtils;
@@ -17,12 +20,6 @@
import org.apache.nutch.storage.WebPage;
import org.apache.nutch.storage.WebPage.Field;
import org.apache.nutch.util.MimeUtil;
-import org.apache.oro.text.regex.MalformedPatternException;
-import org.apache.oro.text.regex.MatchResult;
-import org.apache.oro.text.regex.PatternMatcher;
-import org.apache.oro.text.regex.Perl5Compiler;
-import org.apache.oro.text.regex.Perl5Matcher;
-import org.apache.oro.text.regex.Perl5Pattern;
import org.apache.solr.common.util.DateUtil;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
@@ -224,18 +221,16 @@ private NutchDocument addType(NutchDocument doc, WebPage
page, String url) {
// Patterns used to extract filename from possible non-standard
// HTTP header "Content-Disposition". Typically it looks like:
// Content-Disposition: inline; filename="foo.ppt"
- private PatternMatcher matcher = new Perl5Matcher();
-
private Configuration conf;
- static Perl5Pattern patterns[] = { null, null };
+
+ static Pattern patterns[] = { null, null };
+
static {
- Perl5Compiler compiler = new Perl5Compiler();
try {
// order here is important
- patterns[0] = (Perl5Pattern) compiler
- .compile("\\bfilename=['\"](.+)['\"]");
- patterns[1] = (Perl5Pattern) compiler.compile("\\bfilename=(\\S+)\\b");
- } catch (MalformedPatternException e) {
+ patterns[0] = Pattern.compile("\\bfilename=['\"](.+)['\"]");
+ patterns[1] = Pattern.compile("\\bfilename=(\\S+)\\b");
+ } catch (PatternSyntaxException e) {
// just ignore
}
}
@@ -246,12 +241,10 @@ private NutchDocument resetTitle(NutchDocument doc,
WebPage page, String url) {
if (contentDisposition == null)
return doc;
- MatchResult result;
for (int i = 0; i < patterns.length; i++) {
- if (matcher.contains(contentDisposition.toString(), patterns[i])) {
- result = matcher.getMatch();
- doc.removeField("title");
- doc.add("title", result.group(1));
+ Matcher matcher = patterns[i].matcher(contentDisposition);
+ if (matcher.find()) {
+ doc.add("title", matcher.group(1));
break;
}
}
diff --git a/src/plugin/parse-js/plugin.xml b/src/plugin/parse-js/plugin.xml
index ae1c608f3..c8abba3b4 100644
--- a/src/plugin/parse-js/plugin.xml
+++ b/src/plugin/parse-js/plugin.xml
@@ -36,7 +36,7 @@
point="org.apache.nutch.parse.Parser">
<implementation id="JSParser"
class="org.apache.nutch.parse.js.JSParseFilter">
- <parameter name="contentType" value="application/x-javascript"/>
+ <parameter name="contentType"
value="application/x-javascript|application/javascript"/>
<parameter name="pathSuffix" value="js"/>
</implementation>
</extension>
@@ -45,7 +45,7 @@
point="org.apache.nutch.parse.ParseFilter">
<implementation id="JSParseFilter"
class="org.apache.nutch.parse.js.JSParseFilter">
- <parameter name="contentType" value="application/x-javascript"/>
+ <parameter name="contentType"
value="application/x-javascript|application/javascript"/>
<parameter name="pathSuffix" value=""/>
</implementation>
</extension>
diff --git a/src/plugin/parse-js/sample/parse_pure_js_test.js
b/src/plugin/parse-js/sample/parse_pure_js_test.js
new file mode 100644
index 000000000..f196313f8
--- /dev/null
+++ b/src/plugin/parse-js/sample/parse_pure_js_test.js
@@ -0,0 +1,24 @@
+// test data for link extraction from "pure" JavaScript
+
+function selectProvider(form) {
+ provider = form.elements['searchProvider'].value;
+ if (provider == "any") {
+ if (Math.random() > 0.5) {
+ provider = "lucid";
+ } else {
+ provider = "sl";
+ }
+ }
+
+ if (provider == "lucid") {
+ form.action = "http://search.lucidimagination.com/p:nutch";
+ } else if (provider == "sl") {
+ form.action = "http://search-lucene.com/nutch";
+ }
+
+ days = 90; // cookie will be valid for 90 days
+ date = new Date();
+ date.setTime(date.getTime() + (days * 24 * 60 * 60 * 1000));
+ expires = "; expires=" + date.toGMTString();
+ document.cookie = "searchProvider=" + provider + expires + "; path=/";
+}
diff --git
a/src/plugin/parse-js/src/java/org/apache/nutch/parse/js/JSParseFilter.java
b/src/plugin/parse-js/src/java/org/apache/nutch/parse/js/JSParseFilter.java
index 5f8e4b18a..bc967dc3c 100644
--- a/src/plugin/parse-js/src/java/org/apache/nutch/parse/js/JSParseFilter.java
+++ b/src/plugin/parse-js/src/java/org/apache/nutch/parse/js/JSParseFilter.java
@@ -16,11 +16,11 @@
*/
package org.apache.nutch.parse.js;
-import java.lang.invoke.MethodHandles;
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.InputStream;
import java.io.InputStreamReader;
+import java.lang.invoke.MethodHandles;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
@@ -28,29 +28,22 @@
import java.util.Collection;
import java.util.List;
import java.util.Locale;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
-import org.slf4j.Logger;
-import org.slf4j.LoggerFactory;
import org.apache.hadoop.conf.Configuration;
import org.apache.nutch.parse.HTMLMetaTags;
-import org.apache.nutch.parse.ParseFilter;
import org.apache.nutch.parse.Outlink;
import org.apache.nutch.parse.Parse;
-import org.apache.nutch.parse.ParseStatusCodes;
+import org.apache.nutch.parse.ParseFilter;
import org.apache.nutch.parse.ParseStatusUtils;
import org.apache.nutch.parse.Parser;
import org.apache.nutch.storage.ParseStatus;
import org.apache.nutch.storage.WebPage;
import org.apache.nutch.util.Bytes;
import org.apache.nutch.util.NutchConfiguration;
-import org.apache.nutch.util.TableUtil;
-import org.apache.oro.text.regex.MatchResult;
-import org.apache.oro.text.regex.Pattern;
-import org.apache.oro.text.regex.PatternCompiler;
-import org.apache.oro.text.regex.PatternMatcher;
-import org.apache.oro.text.regex.PatternMatcherInput;
-import org.apache.oro.text.regex.Perl5Compiler;
-import org.apache.oro.text.regex.Perl5Matcher;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
import org.w3c.dom.DocumentFragment;
import org.w3c.dom.Element;
import org.w3c.dom.NamedNodeMap;
@@ -85,7 +78,7 @@
* within the {@link HTMLMetaTags}
* @param doc
* The {@link DocumentFragment} object
- * @return parse the actual {@link Parse} object
+ * @return parse the actual {@link Parse} object with additional outlinks
from JavaScript
*/
@Override
public Parse filter(String url, WebPage page, Parse parse,
@@ -169,7 +162,7 @@ private void walk(Node n, Parse parse, HTMLMetaTags
metaTags, String base,
}
/**
- * Set the {@link Configuration} object
+ * Parse a JavaScript file and extract outlinks
*
* @param url
* URL of the {@link WebPage} which is parsed
@@ -179,12 +172,6 @@ private void walk(Node n, Parse parse, HTMLMetaTags
metaTags, String base,
*/
@Override
public Parse getParse(String url, WebPage page) {
- String type = TableUtil.toString(page.getContentType());
- if (type != null && !type.trim().equals("")
- &&
!type.toLowerCase(Locale.ROOT).startsWith("application/x-javascript"))
- return ParseStatusUtils.getEmptyParse(
- ParseStatusCodes.FAILED_INVALID_FORMAT, "Content not JavaScript: '"
- + type + "'", getConf());
String script = Bytes.toString(page.getContent());
Outlink[] outlinks = getJSLinks(script, "", url);
if (outlinks == null)
@@ -205,9 +192,13 @@ public Parse getParse(String url, WebPage page) {
return parse;
}
- private static final String STRING_PATTERN =
"(\\\\*(?:\"|\'))([^\\s\"\']+?)(?:\\1)";
+ private static final Pattern STRING_PATTERN = Pattern.compile(
+ "(\\\\*(?:\"|\'))([^\\s\"\']+?)(?:\\1)",
+ Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
// A simple pattern. This allows also invalid URL characters.
- private static final String URI_PATTERN =
"(^|\\s*?)/?\\S+?[/\\.]\\S+($|\\s*)";
+ private static final Pattern URI_PATTERN = Pattern.compile(
+ "(^|\\s*?)/?\\S+?[/\\.]\\S+($|\\s*)",
+ Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
// Alternative pattern, which limits valid url characters.
// private static final String URI_PATTERN =
@@ -230,30 +221,15 @@ public Parse getParse(String url, WebPage page) {
}
try {
- final PatternCompiler cp = new Perl5Compiler();
- final Pattern pattern = cp.compile(STRING_PATTERN,
- Perl5Compiler.CASE_INSENSITIVE_MASK | Perl5Compiler.READ_ONLY_MASK
- | Perl5Compiler.MULTILINE_MASK);
- final Pattern pattern1 = cp.compile(URI_PATTERN,
- Perl5Compiler.CASE_INSENSITIVE_MASK | Perl5Compiler.READ_ONLY_MASK
- | Perl5Compiler.MULTILINE_MASK);
- final PatternMatcher matcher = new Perl5Matcher();
- final PatternMatcher matcher1 = new Perl5Matcher();
- final PatternMatcherInput input = new PatternMatcherInput(plainText);
+ Matcher matcher = STRING_PATTERN.matcher(plainText);
- MatchResult result;
String url;
- // loop the matches
- while (matcher.contains(input, pattern)) {
- result = matcher.getMatch();
- url = result.group(2);
- PatternMatcherInput input1 = new PatternMatcherInput(url);
- if (!matcher1.matches(input1, pattern1)) {
- if (LOG.isTraceEnabled()) {
- LOG.trace(" - invalid '" + url + "'");
- }
+ while (matcher.find()) {
+ url = matcher.group(2);
+ Matcher matcherUri = URI_PATTERN.matcher(url);
+ if (!matcherUri.matches()) {
continue;
}
if (url.startsWith("www.")) {
@@ -316,6 +292,8 @@ public static void main(String[] args) throws Exception {
String line = null;
while ((line = br.readLine()) != null)
sb.append(line + "\n");
+ br.close();
+
JSParseFilter parseFilter = new JSParseFilter();
parseFilter.setConf(NutchConfiguration.create());
Outlink[] links = parseFilter.getJSLinks(sb.toString(), "", args[1]);
diff --git
a/src/plugin/parse-js/src/test/org/apache/nutch/parse/js/TestJSParseFilter.java
b/src/plugin/parse-js/src/test/org/apache/nutch/parse/js/TestJSParseFilter.java
index c8f943140..7aa4b788e 100644
---
a/src/plugin/parse-js/src/test/org/apache/nutch/parse/js/TestJSParseFilter.java
+++
b/src/plugin/parse-js/src/test/org/apache/nutch/parse/js/TestJSParseFilter.java
@@ -28,55 +28,64 @@
import org.apache.nutch.util.NutchConfiguration;
import org.junit.Before;
import org.junit.Test;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
import java.io.DataInputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
+import java.lang.invoke.MethodHandles;
import java.nio.ByteBuffer;
+import java.util.Arrays;
+import java.util.Set;
+import java.util.TreeSet;
import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
/**
- * JUnit test case for {@link JSParseFilter} which tests 1. That 5 outlinks are
- * extracted from JavaScript snippets embedded in HTML 2. That X outlinks are
- * extracted from a pure JavaScript file (this is temporarily disabled)
- *
- * @author lewismc
+ * JUnit test case for {@link JSParseFilter} which tests
+ * <ol>
+ * <li>That 2 outlinks are extracted from JavaScript snippets embedded in
+ * HTML</li>
+ * <li>That 2 outlinks are extracted from a pure JavaScript file.</li>
+ * </ol>
*/
-
public class TestJSParseFilter {
+ private static final Logger LOG = LoggerFactory
+ .getLogger(MethodHandles.lookup().lookupClass());
+
private String fileSeparator = System.getProperty("file.separator");
// This system property is defined in ./src/plugin/build-plugin.xml
private String sampleDir = System.getProperty("test.data", ".");
- // Make sure sample files are copied to "test.data" as specified in
- // ./src/plugin/parse-js/build.xml during plugin compilation.
- private String[] sampleFiles = { "parse_pure_js_test.js",
- "parse_embedded_js_test.html" };
-
private Configuration conf;
@Before
public void setUp() {
conf = NutchConfiguration.create();
- conf.set("file.content.limit", "-1");
+ conf.set("plugin.includes", "parse-(html|js)");
}
- public Outlink[] getOutlinks(String[] sampleFiles) throws ProtocolException,
- ParseException, IOException {
- String urlString;
+ public Outlink[] getOutlinks(String sampleFile)
+ throws ProtocolException, ParseException, IOException {
+ String urlString, fileName;
Parse parse;
- urlString = "file:" + sampleDir + fileSeparator + sampleFiles;
- File file = new File(urlString);
+ fileName = sampleDir + fileSeparator + sampleFile;
+ urlString = "file:" + fileName;
+
+ urlString = "file:" + sampleDir + fileSeparator + sampleFile;
+ File file = new File(fileName);
byte[] bytes = new byte[(int) file.length()];
DataInputStream dip = new DataInputStream(new FileInputStream(file));
dip.readFully(bytes);
dip.close();
+ LOG.info("Parsing {}", urlString);
WebPage page = WebPage.newBuilder().build();
page.setBaseUrl(new Utf8(urlString));
page.setContent(ByteBuffer.wrap(bytes));
@@ -85,24 +94,34 @@ public void setUp() {
page.setContentType(new Utf8(mime));
parse = new ParseUtil(conf).parse(urlString, page);
+ LOG.info("Parsed {} with {} outlinks: {}", urlString,
+ parse.getOutlinks().length, Arrays.toString(parse.getOutlinks()));
return parse.getOutlinks();
}
@Test
- public void testOutlinkExtraction() throws ProtocolException, ParseException,
- IOException {
+ public void testJavaScriptOutlinkExtraction()
+ throws ProtocolException, ParseException, IOException {
String[] filenames = new File(sampleDir).list();
for (int i = 0; i < filenames.length; i++) {
- if (filenames[i].endsWith(".js") == true) {
- assertEquals("number of outlinks in .js test file should be 5", 5,
- getOutlinks(sampleFiles));
- // temporarily disabled as a suitable pure JS file could not be be
- // found.
- // } else {
- // assertEquals("number of outlinks in .html file should be X", 5,
- // getOutlinks(sampleFiles));
+ Outlink[] outlinks = getOutlinks(filenames[i]);
+ if (filenames[i].endsWith("parse_pure_js_test.js")) {
+ assertEquals("number of outlinks in .js test file should be X", 2,
+ outlinks.length);
+ assertEquals("http://search.lucidimagination.com/p:nutch",
outlinks[0].getToUrl());
+ assertEquals("http://search-lucene.com/nutch", outlinks[1].getToUrl());
+ } else {
+ assertTrue("number of outlinks in .html file should be at least 2",
outlinks.length >= 2);
+ Set<String> outlinkSet = new TreeSet<>();
+ for (Outlink o : outlinks) {
+ outlinkSet.add(o.getToUrl());
+ }
+ assertTrue("http://search.lucidimagination.com/p:nutch not in
outlinks",
+ outlinkSet.contains("http://search.lucidimagination.com/p:nutch"));
+ assertTrue("http://search-lucene.com/nutch not in outlinks",
+ outlinkSet.contains("http://search-lucene.com/nutch"));
}
}
}
-}
\ No newline at end of file
+}
diff --git a/src/plugin/urlnormalizer-regex/sample/regex-normalize-default.xml
b/src/plugin/urlnormalizer-regex/sample/regex-normalize-default.xml
index 4d6eabcf2..3d1f7186c 100644
--- a/src/plugin/urlnormalizer-regex/sample/regex-normalize-default.xml
+++ b/src/plugin/urlnormalizer-regex/sample/regex-normalize-default.xml
@@ -1,7 +1,24 @@
<?xml version="1.0"?>
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
<!-- This is the configuration file for the RegexUrlNormalize Class.
This is intended so that users can specify substitutions to be
- done on URLs. The regex engine that is used is Perl5 compatible.
+ done on URLs using the Java regex syntax, see
+ https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
The rules are applied to URLs in the order they occur in this file. -->
<!-- WATCH OUT: an xml parser reads this file an ampersands must be
diff --git a/src/plugin/urlnormalizer-regex/sample/regex-normalize-scope1.xml
b/src/plugin/urlnormalizer-regex/sample/regex-normalize-scope1.xml
index 369896878..fc8e05e2d 100644
--- a/src/plugin/urlnormalizer-regex/sample/regex-normalize-scope1.xml
+++ b/src/plugin/urlnormalizer-regex/sample/regex-normalize-scope1.xml
@@ -1,7 +1,24 @@
<?xml version="1.0"?>
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
<!-- This is the configuration file for the RegexUrlNormalize Class.
This is intended so that users can specify substitutions to be
- done on URLs. The regex engine that is used is Perl5 compatible.
+ done on URLs using the Java regex syntax, see
+ https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
The rules are applied to URLs in the order they occur in this file. -->
<!-- WATCH OUT: an xml parser reads this file an ampersands must be
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
> Remove dependency on org.apache.oro
> -----------------------------------
>
> Key: NUTCH-1678
> URL: https://issues.apache.org/jira/browse/NUTCH-1678
> Project: Nutch
> Issue Type: Improvement
> Components: parser
> Affects Versions: 2.2
> Reporter: James Sullivan
> Priority: Minor
> Labels: newbie, patch
> Fix For: 2.5
>
> Attachments: 2.x.patch
>
>
> org.apache.oro has been archived for three years and it may be good to remove
> the dependency as Java has had built in regexes for quite some time now.
> There don't seem to have been any specific Perl5 functionality needed in the
> regexes so unless there are specific threading or performance reasons for
> continuing to use oro it may be time to lose the dependency. Attached patch
> needs to be checked thoroughly as I am rusty with Java and the unit tests are
> sparse.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)