[jira] [Commented] (NUTCH-2192) Get rid of oro

ASF GitHub Bot (JIRA) Sat, 13 Oct 2018 02:53:42 -0700


    [ 
https://issues.apache.org/jira/browse/NUTCH-2192?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16648856#comment-16648856
 ]


ASF GitHub Bot commented on NUTCH-2192:
---------------------------------------

sebastian-nagel closed pull request #389: NUTCH-2192 Migrate from Apache ORO to 
java.util.regex
URL: https://github.com/apache/nutch/pull/389
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/LICENSE.txt b/LICENSE.txt
index 1b7a967d2..9badcdad6 100644
--- a/LICENSE.txt
+++ b/LICENSE.txt
@@ -1079,62 +1079,6 @@ http://www.python.org. Full license is here:
 
   http://www.python.org/download/releases/2.4.2/license/
 
-lib/jakarta-oro-2.0.8.jar
-
-/* ====================================================================
- * The Apache Software License, Version 1.1
- *
- * Copyright (c) 2000-2002 The Apache Software Foundation.  All rights
- * reserved.
- *
- * Redistribution and use in source and binary forms, with or without
- * modification, are permitted provided that the following conditions
- * are met:
- *
- * 1. Redistributions of source code must retain the above copyright
- *    notice, this list of conditions and the following disclaimer.
- *
- * 2. Redistributions in binary form must reproduce the above copyright
- *    notice, this list of conditions and the following disclaimer in
- *    the documentation and/or other materials provided with the
- *    distribution.
- *
- * 3. The end-user documentation included with the redistribution,
- *    if any, must include the following acknowledgment:
- *       "This product includes software developed by the
- *        Apache Software Foundation (http://www.apache.org/)."
- *    Alternately, this acknowledgment may appear in the software itself,
- *    if and wherever such third-party acknowledgments normally appear.
- *
- * 4. The names "Apache" and "Apache Software Foundation", "Jakarta-Oro" 
- *    must not be used to endorse or promote products derived from this
- *    software without prior written permission. For written
- *    permission, please contact [email protected].
- *
- * 5. Products derived from this software may not be called "Apache" 
- *    or "Jakarta-Oro", nor may "Apache" or "Jakarta-Oro" appear in their 
- *    name, without prior written permission of the Apache Software Foundation.
- *
- * THIS SOFTWARE IS PROVIDED ``AS IS'' AND ANY EXPRESSED OR IMPLIED
- * WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES
- * OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
- * DISCLAIMED.  IN NO EVENT SHALL THE APACHE SOFTWARE FOUNDATION OR
- * ITS CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
- * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
- * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF
- * USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
- * ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY,
- * OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
- * OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
- * SUCH DAMAGE.
- * ====================================================================
- *
- * This software consists of voluntary contributions made by many
- * individuals on behalf of the Apache Software Foundation.  For more
- * information on the Apache Software Foundation, please see
- * <http://www.apache.org/>.
- */
-
 lib/jetty-ext/commons-el.jar
 
 /*
diff --git a/conf/parse-plugins.xml b/conf/parse-plugins.xml
index 20c8724a9..2507976ec 100644
--- a/conf/parse-plugins.xml
+++ b/conf/parse-plugins.xml
@@ -43,6 +43,10 @@
                <plugin id="parse-zip" />
        </mimeType>
 
+       <mimeType name="application/javascript">
+               <plugin id="parse-js" />
+       </mimeType>
+
        <mimeType name="application/x-javascript">
                <plugin id="parse-js" />
        </mimeType>
diff --git a/conf/regex-normalize.xml.template 
b/conf/regex-normalize.xml.template
index ec60c1081..ed3f7108a 100644
--- a/conf/regex-normalize.xml.template
+++ b/conf/regex-normalize.xml.template
@@ -17,7 +17,8 @@
 -->
 <!-- This is the configuration file for the RegexUrlNormalize Class.
      This is intended so that users can specify substitutions to be
-     done on URLs. The regex engine that is used is Perl5 compatible.
+     done on URLs using the Java regex syntax, see
+     https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
      The rules are applied to URLs in the order they occur in this file.  -->
 
 <!-- WATCH OUT: an xml parser reads this file an ampersands must be
diff --git a/ivy/ivy.xml b/ivy/ivy.xml
index 112975ab9..5272de6cb 100644
--- a/ivy/ivy.xml
+++ b/ivy/ivy.xml
@@ -70,7 +70,6 @@
 
                <dependency org="xerces" name="xercesImpl" rev="2.11.0" />
                <dependency org="xerces" name="xmlParserAPIs" rev="2.6.2" />
-               <dependency org="oro" name="oro" rev="2.0.8" />
 
                <dependency org="com.google.guava" name="guava" rev="25.0-jre" 
/>
 
diff --git a/src/java/org/apache/nutch/parse/OutlinkExtractor.java 
b/src/java/org/apache/nutch/parse/OutlinkExtractor.java
index ce9a614e7..c0b61d489 100644
--- a/src/java/org/apache/nutch/parse/OutlinkExtractor.java
+++ b/src/java/org/apache/nutch/parse/OutlinkExtractor.java
@@ -21,19 +21,14 @@
 import java.net.MalformedURLException;
 import java.util.ArrayList;
 import java.util.List;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+import org.apache.hadoop.conf.Configuration;
 
 import org.slf4j.Logger;
 import org.slf4j.LoggerFactory;
 
-import org.apache.hadoop.conf.Configuration;
-import org.apache.oro.text.regex.MatchResult;
-import org.apache.oro.text.regex.Pattern;
-import org.apache.oro.text.regex.PatternCompiler;
-import org.apache.oro.text.regex.PatternMatcher;
-import org.apache.oro.text.regex.PatternMatcherInput;
-import org.apache.oro.text.regex.Perl5Compiler;
-import org.apache.oro.text.regex.Perl5Matcher;
-
 /**
  * Extractor to extract {@link org.apache.nutch.parse.Outlink}s / URLs from
  * plain text using Regular Expressions.
@@ -60,7 +55,8 @@
 
    *      </a>
    */
-  private static final String URL_PATTERN = 
"([A-Za-z][A-Za-z0-9+.-]{1,120}:[A-Za-z0-9/](([A-Za-z0-9$_.+!*,;/?:@&~=-])|%[A-Fa-f0-9]{2}){1,333}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;/?:@&~=%-]{0,1000}))?)";
+  private static final Pattern URL_PATTERN = Pattern.compile(
+      
"([A-Za-z][A-Za-z0-9+.-]{1,120}:[A-Za-z0-9/](([A-Za-z0-9$_.+!*,;/?:@&~=-])|%[A-Fa-f0-9]{2}){1,333}(#([a-zA-Z0-9][a-zA-Z0-9$_.+!*,;/?:@&~=%-]{0,1000}))?)");
 
   /**
    * Extracts <code>Outlink</code> from given plain text. Applying this method
@@ -72,7 +68,8 @@
    * 
    * @return Array of <code>Outlink</code>s within found in plainText
    */
-  public static Outlink[] getOutlinks(final String plainText, Configuration 
conf) {
+  public static Outlink[] getOutlinks(final String plainText,
+      Configuration conf) {
     return OutlinkExtractor.getOutlinks(plainText, "", conf);
   }
 
@@ -89,23 +86,20 @@
    */
   public static Outlink[] getOutlinks(final String plainText, String anchor,
       Configuration conf) {
+
+    if (plainText == null) {
+      return new Outlink[0];
+    }
+
     long start = System.currentTimeMillis();
     final List<Outlink> outlinks = new ArrayList<>();
 
     try {
-      final PatternCompiler cp = new Perl5Compiler();
-      final Pattern pattern = cp.compile(URL_PATTERN,
-          Perl5Compiler.CASE_INSENSITIVE_MASK | Perl5Compiler.READ_ONLY_MASK
-              | Perl5Compiler.MULTILINE_MASK);
-      final PatternMatcher matcher = new Perl5Matcher();
-
-      final PatternMatcherInput input = new PatternMatcherInput(plainText);
-
-      MatchResult result;
+      Matcher matcher = URL_PATTERN.matcher(plainText);
       String url;
 
-      // loop the matches
-      while (matcher.contains(input, pattern)) {
+      // Check for stuff!
+      while (matcher.find()) {
         // if this is taking too long, stop matching
         // (SHOULD really check cpu time used so that heavily loaded systems
         // do not unnecessarily hit this limit.)
@@ -115,8 +109,9 @@
           }
           break;
         }
-        result = matcher.getMatch();
-        url = result.group(0);
+
+        url = matcher.group().trim();
+
         try {
           outlinks.add(new Outlink(url, anchor));
         } catch (MalformedURLException mue) {
diff --git a/src/plugin/parse-js/build.xml b/src/plugin/parse-js/build.xml
index d9c21463f..549373abd 100644
--- a/src/plugin/parse-js/build.xml
+++ b/src/plugin/parse-js/build.xml
@@ -19,4 +19,18 @@
 
   <import file="../build-plugin.xml"/>
 
+  <!-- Deploy Unit test dependencies -->
+  <target name="deps-test">
+    <ant target="deploy" inheritall="false" dir="../nutch-extensionpoints"/>
+    <ant target="deploy" inheritall="false" dir="../protocol-file"/>
+  </target>
+
+  <!-- for junit test -->
+  <mkdir dir="${build.test}/data"/>
+  <copy todir="${build.test}/data">
+    <fileset dir="sample">
+      <include name="*.html"/>
+      <include name="*.js"/>
+    </fileset>
+  </copy>
 </project>
diff --git a/src/plugin/parse-js/plugin.xml b/src/plugin/parse-js/plugin.xml
index 9c06c2acd..e55195a46 100644
--- a/src/plugin/parse-js/plugin.xml
+++ b/src/plugin/parse-js/plugin.xml
@@ -36,7 +36,7 @@
               point="org.apache.nutch.parse.Parser">
       <implementation id="JSParser"
          class="org.apache.nutch.parse.js.JSParseFilter">
-        <parameter name="contentType" value="application/x-javascript"/>
+        <parameter name="contentType" 
value="application/x-javascript|application/javascript"/>
         <parameter name="pathSuffix"  value="js"/>
       </implementation>
    </extension>
@@ -45,7 +45,7 @@
               point="org.apache.nutch.parse.HtmlParseFilter">
       <implementation id="JSParseFilter"
          class="org.apache.nutch.parse.js.JSParseFilter">
-        <parameter name="contentType" value="application/x-javascript"/>
+        <parameter name="contentType" 
value="application/x-javascript|application/javascript"/>
         <parameter name="pathSuffix"  value=""/>
       </implementation>
    </extension>
diff --git a/src/plugin/parse-js/sample/parse_embedded_js_test.html 
b/src/plugin/parse-js/sample/parse_embedded_js_test.html
new file mode 100644
index 000000000..351beacc3
--- /dev/null
+++ b/src/plugin/parse-js/sample/parse_embedded_js_test.html
@@ -0,0 +1,316 @@
+<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" 
"http://www.w3.org/TR/html4/loose.dtd";>
+<html style="font-size: 16px;"><head>
+<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
+<meta content="Apache Forrest" name="Generator">
+<meta name="Forrest-version" content="0.9">
+<meta name="Forrest-skin-name" content="nutch">
+<title>About Apache Nutch</title>
+<link type="text/css" href="about_files/basic.css" rel="stylesheet">
+<link media="screen" type="text/css" href="about_files/screen.css" 
rel="stylesheet">
+<link media="print" type="text/css" href="about_files/print.css" 
rel="stylesheet">
+<link type="text/css" href="about_files/profile.css" rel="stylesheet">
+<script src="about_files/getBlank.js" language="javascript" 
type="text/javascript"></script><script src="about_files/getMenu.js" 
language="javascript" type="text/javascript"></script><style 
type="text/css">.menuitemgroup{display: none;}</style><script 
src="about_files/fontsize.js" language="javascript" 
type="text/javascript"></script>
+<link rel="shortcut icon" href="http://nutch.apache.org/images/favicon.ico";>
+</head>
+<body style="font-size: 16px;" onload="init()">
+<script type="text/javascript">ndeSetTextSize();</script>
+<div id="top">
+<!--+
+    |breadtrail
+    +-->
+<div class="breadtrail">
+<a href="http://www.apache.org/";>Apache</a> &gt; <a 
href="http://nutch.apache.org/";>Nutch</a> &gt; <a 
href="http://nutch.apache.org/";>Home</a><script 
src="about_files/breadcrumbs.js" language="JavaScript" 
type="text/javascript"></script> &gt; 
+</div>
+<!--+
+    |header
+    +-->
+<div class="header">
+<!--+
+    |start group logo
+    +-->
+<div class="grouplogo">
+<a href="http://www.apache.org/";><img class="logoImage" alt="Apache" 
src="about_files/feather-small.gif" title="Apache Software Foundation "></a>
+</div>
+<!--+
+    |end group logo
+    +-->
+<!--+
+    |start Project Logo
+    +-->
+<div class="projectlogo">
+<a href="http://nutch.apache.org/";><img class="logoImage" alt="Nutch" 
src="about_files/nutch_logo_tm.gif" title="Open Source Web Search Software"></a>
+</div>
+<!--+
+    |end Project Logo
+    +-->
+<!--+
+    |start Search
+    +-->
+<div class="searchbox">
+<script type="text/javascript">
+                      function selectProvider(form) {
+                        provider = form.elements['searchProvider'].value;
+                        if (provider == "any") {
+                          if (Math.random() > 0.5) {
+                            provider = "lucid";
+                          } else {
+                            provider = "sl";
+                          }
+                        }
+
+                        if (provider == "lucid") {
+                          form.action = 
"http://search.lucidimagination.com/p:nutch";;
+                        } else if (provider == "sl") {
+                          form.action = "http://search-lucene.com/nutch";;
+                        }
+
+                        days = 90; // cookie will be valid for 90 days
+                        date = new Date();
+                        date.setTime(date.getTime() + (days * 24 * 60 * 60 * 
1000));
+                        expires = "; expires=" + date.toGMTString();
+                        document.cookie = "searchProvider=" + provider + 
expires + "; path=/";
+                      }
+                    </script>
+<form id="searchform" action="http://search.lucidimagination.com/p:nutch"; 
method="get" class="roundtopsmall">
+<input onfocus="getBlank (this, 'Search the site with Solr');" size="25" 
name="q" id="query" value="Search the site with Solr" type="text">&nbsp; 
+                    <input onclick="selectProvider(this.form)" name="Search" 
value="Search" type="submit">
+                      @
+                      <select id="searchProvider" 
name="searchProvider"><option selected="selected" value="any">select 
provider</option><option value="lucid">Lucid Find</option><option 
value="sl">Search-Lucene</option></select><script type="text/javascript">
+                        if (document.cookie.length>0) {
+                          cStart=document.cookie.indexOf("searchProvider=");
+                          if (cStart!=-1) {
+                            cStart=cStart + "searchProvider=".length;
+                            cEnd=document.cookie.indexOf(";", cStart);
+                            if (cEnd==-1) {
+                              cEnd=document.cookie.length;
+                            }
+                            provider = 
unescape(document.cookie.substring(cStart,cEnd));
+                            
document.forms['searchform'].elements['searchProvider'].value = provider;
+                          }
+                        }
+                      </script>
+</form>
+</div>
+<!--+
+    |end search
+    +-->
+<!--+
+    |start Tabs
+    +-->
+<ul id="tabs">
+<li class="current">
+<a class="selected" href="http://nutch.apache.org/index.html";>Main</a>
+</li>
+<li>
+<a class="unselected" href="http://nutch.apache.org/wiki.html";>Wiki</a>
+</li>
+<li>
+<a class="unselected" 
href="http://issues.apache.org/jira/browse/NUTCH";>Jira</a>
+</li>
+</ul>
+<!--+
+    |end Tabs
+    +-->
+</div>
+</div>
+<div id="main">
+<div id="publishedStrip">
+<!--+
+    |start Subtabs
+    +-->
+<div id="level2tabs"></div>
+<!--+
+    |end Endtabs
+    +-->
+<script type="text/javascript"><!--
+document.write("Last Published: " + document.lastModified);
+//  --></script>Last Published: 07/10/2012 15:39:10
+</div>
+<!--+
+    |breadtrail
+    +-->
+<div class="breadtrail">
+
+             &nbsp;
+           </div>
+<!--+
+    |start Menu, mainarea
+    +-->
+<!--+
+    |start Menu
+    +-->
+<div id="menu">
+<div onclick="SwitchMenu('menu_selected_1.1', 'skin/')" 
id="menu_selected_1.1Title" class="menutitle" style="background-image: 
url('skin/images/chapter_open.gif');">Project</div>
+<div id="menu_selected_1.1" class="selectedmenuitemgroup" style="display: 
block;">
+<div class="menuitem">
+<a href="http://nutch.apache.org/index.html";>News</a>
+</div>
+<div class="menupage">
+<div class="menupagetitle">About</div>
+</div>
+<div class="menuitem">
+<a href="http://nutch.apache.org/credits.html";>Credits</a>
+</div>
+<div class="menuitem">
+<a href="http://www.apache.org/foundation/thanks.html";>Thanks</a>
+</div>
+<div class="menuitem">
+<a href="http://www.cafepress.com/nutch/";>Buy Stuff</a>
+</div>
+<div class="menuitem">
+<a href="http://www.apache.org/foundation/sponsorship.html";>Sponsorship</a>
+</div>
+<div class="menuitem">
+<a href="http://www.apache.org/licenses/";>License</a>
+</div>
+<div class="menuitem">
+<a href="http://www.apache.org/security/";>Security</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.2', 'skin/')" id="menu_1.2Title" 
class="menutitle">Documentation</div>
+<div id="menu_1.2" class="menuitemgroup">
+<div class="menuitem">
+<a href="http://nutch.apache.org/faq.html";>FAQ</a>
+</div>
+<div class="menuitem">
+<a href="http://nutch.apache.org/wiki.html";>Wiki</a>
+</div>
+<div class="menuitem">
+<a href="http://nutch.apache.org/tutorial.html";>Tutorial</a>
+</div>
+<div class="menuitem">
+<a href="http://nutch.apache.org/bot.html";>Robot     </a>
+</div>
+<div class="menuitem">
+<a href="http://nutch.apache.org/apidocs-2.0/index.html";>API Docs (2.0)</a>
+</div>
+<div class="menuitem">
+<a href="http://nutch.apache.org/apidocs-1.5/index.html";>API Docs (1.5.1)</a>
+</div>
+<div class="menuitem">
+<a href="https://builds.apache.org/job/Nutch-trunk/javadoc/";>API Docs 
(trunk-nightly)</a>
+</div>
+<div class="menuitem">
+<a href="https://builds.apache.org/job/Nutch-nutchgora/javadoc/";>API Docs 
(2.0-Dev-nightly)</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.3', 'skin/')" id="menu_1.3Title" 
class="menutitle">Resources</div>
+<div id="menu_1.3" class="menuitemgroup">
+<div class="menuitem">
+<a href="http://www.apache.org/dyn/closer.cgi/nutch/";>Download</a>
+</div>
+<div class="menuitem">
+<a href="http://nutch.apache.org/nightly.html";>Nightly builds</a>
+</div>
+<div class="menuitem">
+<a href="http://nutch.apache.org/sonar.html";>Sonar Analysis</a>
+</div>
+<div class="menuitem">
+<a href="http://nutch.apache.org/mailing_lists.html";>Mailing Lists</a>
+</div>
+<div class="menuitem">
+<a href="http://nutch.apache.org/issue_tracking.html";>Issue Tracking</a>
+</div>
+<div class="menuitem">
+<a href="http://nutch.apache.org/version_control.html";>Version Control</a>
+</div>
+<div class="menuitem">
+<a href="http://nutch.apache.org/old_downloads.html";>Older Downloads</a>
+</div>
+</div>
+<div onclick="SwitchMenu('menu_1.4', 'skin/')" id="menu_1.4Title" 
class="menutitle">Related Projects</div>
+<div id="menu_1.4" class="menuitemgroup">
+<div class="menuitem">
+<a href="http://lucene.apache.org/java/";>Lucene</a>
+</div>
+<div class="menuitem">
+<a href="http://hadoop.apache.org/";>Hadoop</a>
+</div>
+<div class="menuitem">
+<a href="http://lucene.apache.org/solr/";>Solr</a>
+</div>
+<div class="menuitem">
+<a href="http://tika.apache.org/";>Tika</a>
+</div>
+<div class="menuitem">
+<a href="http://gora.apache.org/";>Gora</a>
+</div>
+</div>
+<div id="credit"></div>
+<div id="roundbottom">
+<img style="display: none" class="corner" alt="" 
src="about_files/rc-b-l-15-1body-2menu-3menu.png" height="15" width="15"></div>
+<!--+
+  |alternative credits
+  +-->
+<div id="credit2"></div>
+</div>
+<!--+
+    |end Menu
+    +-->
+<!--+
+    |start content
+    +-->
+<div id="content">
+<div title="Portable Document Format" class="pdflink">
+<a class="dida" href="http://nutch.apache.org/about.pdf";><img alt="PDF -icon" 
src="about_files/pdfdoc.gif" class="skin"><br>
+        PDF</a>
+</div>
+<h1>About Apache Nutch</h1>
+<div id="minitoc-area">
+<ul class="minitoc">
+<li>
+<a href="#Overview">Overview</a>
+</li>
+</ul>
+</div> 
+
+    
+<a name="N1000E"></a><a name="Overview"></a>
+<h2 class="h3">Overview</h2>
+<div class="section">
+<p>Apache Nutch is an open source web-search
+      software project.  Stemming from <a 
href="http://lucene.apache.org/java/";>Apache Lucene</a>, it now builds 
+      on <a href="http://lucene.apache.org/solr/";>Apache Solr</a> adding 
web-specifics, such as a crawler, 
+      a link-graph database and parsing support handled by <a 
href="http://tika.apache.org/";>Apache Tika</a>
+      for HTML and and array other document formats.</p>
+<p>Apache Nutch can run on a single machine, but gains a lot of its
+      strength from running in a <a href="http://hadoop.apache.org/";>Hadoop 
cluster</a>
+</p>
+<p>The system can be enhanced (eg other document formats can be 
+      parsed) using a highly flexible, easily extensible and thoroughly 
maintained
+       plugin infrastructure.</p>
+<p>For more information about Apache Nutch, please see the <a 
href="http://wiki.apache.org/nutch/";>Nutch wiki.</a>
+</p>
+</div>
+
+  
+</div>
+<!--+
+    |end content
+    +-->
+<div class="clearboth">&nbsp;</div>
+</div>
+<div id="footer">
+<!--+
+    |start bottomstrip
+    +-->
+<div class="lastmodified">
+<script type="text/javascript"><!--
+document.write("Last Published: " + document.lastModified);
+//  --></script>Last Published: 07/10/2012 15:39:10
+</div>
+<div class="copyright">
+        Copyright ©
+         2005-2011 <a href="http://www.apache.org/licenses/";>The Apache 
+Software Foundation.  
+Apache Nutch, Nutch, Apache, the Apache feather logo, and the Apache 
+Nutch project logo are trademarks of The Apache Software Foundation.
+  </a>
+</div>
+<!--+
+    |end bottomstrip
+    +-->
+</div>
+
+
+</body></html>
\ No newline at end of file
diff --git a/src/plugin/parse-js/sample/parse_pure_js_test.js 
b/src/plugin/parse-js/sample/parse_pure_js_test.js
new file mode 100644
index 000000000..f196313f8
--- /dev/null
+++ b/src/plugin/parse-js/sample/parse_pure_js_test.js
@@ -0,0 +1,24 @@
+// test data for link extraction from "pure" JavaScript
+
+function selectProvider(form) {
+    provider = form.elements['searchProvider'].value;
+    if (provider == "any") {
+        if (Math.random() > 0.5) {
+            provider = "lucid";
+        } else {
+            provider = "sl";
+        }
+    }
+
+    if (provider == "lucid") {
+        form.action = "http://search.lucidimagination.com/p:nutch";;
+    } else if (provider == "sl") {
+        form.action = "http://search-lucene.com/nutch";;
+    }
+
+    days = 90; // cookie will be valid for 90 days
+    date = new Date();
+    date.setTime(date.getTime() + (days * 24 * 60 * 60 * 1000));
+    expires = "; expires=" + date.toGMTString();
+    document.cookie = "searchProvider=" + provider + expires + "; path=/";
+}
diff --git 
a/src/plugin/parse-js/src/java/org/apache/nutch/parse/js/JSParseFilter.java 
b/src/plugin/parse-js/src/java/org/apache/nutch/parse/js/JSParseFilter.java
index 7f6d10543..d2bb42e01 100644
--- a/src/plugin/parse-js/src/java/org/apache/nutch/parse/js/JSParseFilter.java
+++ b/src/plugin/parse-js/src/java/org/apache/nutch/parse/js/JSParseFilter.java
@@ -26,9 +26,8 @@
 import java.util.ArrayList;
 import java.util.Arrays;
 import java.util.List;
-
-import org.slf4j.Logger;
-import org.slf4j.LoggerFactory;
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
 
 import org.apache.nutch.parse.HTMLMetaTags;
 import org.apache.nutch.parse.HtmlParseFilter;
@@ -43,19 +42,15 @@
 import org.apache.nutch.protocol.Content;
 import org.apache.nutch.util.NutchConfiguration;
 import org.apache.hadoop.conf.Configuration;
-import org.apache.oro.text.regex.MatchResult;
-import org.apache.oro.text.regex.Pattern;
-import org.apache.oro.text.regex.PatternCompiler;
-import org.apache.oro.text.regex.PatternMatcher;
-import org.apache.oro.text.regex.PatternMatcherInput;
-import org.apache.oro.text.regex.Perl5Compiler;
-import org.apache.oro.text.regex.Perl5Matcher;
 import org.w3c.dom.DocumentFragment;
 import org.w3c.dom.Element;
 import org.w3c.dom.NamedNodeMap;
 import org.w3c.dom.Node;
 import org.w3c.dom.NodeList;
 
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
 /**
  * This class is a heuristic link extractor for JavaScript files and code
  * snippets. The general idea of a two-pass regex matching comes from Heritrix.
@@ -69,6 +64,20 @@
 
   private Configuration conf;
 
+  /**
+   * Scan the JavaScript fragments of a HTML page looking for possible {@link 
Outlink}'s
+   * 
+   * @param content
+   *          page content
+   * @param parseResult
+   *          parsed content, result of running the HTML parser
+   * @param metaTags
+   *          within the {@link HTMLMetaTags}
+   * @param doc
+   *          The {@link DocumentFragment} object
+   * @return parse the actual {@link ParseResult} object with additional 
outlinks from JavaScript
+   */
+  @Override
   public ParseResult filter(Content content, ParseResult parseResult,
       HTMLMetaTags metaTags, DocumentFragment doc) {
 
@@ -154,13 +163,15 @@ private void walk(Node n, Parse parse, HTMLMetaTags 
metaTags, String base,
     }
   }
 
+  /**
+   * Parse a JavaScript file and extract outlinks
+   * 
+   * @param c
+   *          page content
+   * @return parse the actual {@link Parse} object
+   */
+  @Override
   public ParseResult getParse(Content c) {
-    String type = c.getContentType();
-    if (type != null && !type.trim().equals("")
-        && !type.toLowerCase().startsWith("application/x-javascript"))
-      return new ParseStatus(ParseStatus.FAILED_INVALID_FORMAT,
-          "Content not JavaScript: '" + type + "'").getEmptyParseResult(
-          c.getUrl(), getConf());
     String script = new String(c.getContent());
     Outlink[] outlinks = getJSLinks(script, "", c.getUrl());
     if (outlinks == null)
@@ -181,9 +192,13 @@ public ParseResult getParse(Content c) {
     return ParseResult.createParseResult(c.getUrl(), new ParseImpl(script, 
pd));
   }
 
-  private static final String STRING_PATTERN = 
"(\\\\*(?:\"|\'))([^\\s\"\']+?)(?:\\1)";
+  private static final Pattern STRING_PATTERN = Pattern.compile(
+      "(\\\\*(?:\"|\'))([^\\s\"\']+?)(?:\\1)",
+      Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
   // A simple pattern. This allows also invalid URL characters.
-  private static final String URI_PATTERN = 
"(^|\\s*?)/?\\S+?[/\\.]\\S+($|\\s*)";
+  private static final Pattern URI_PATTERN = Pattern.compile(
+      "(^|\\s*?)/?\\S+?[/\\.]\\S+($|\\s*)",
+      Pattern.CASE_INSENSITIVE | Pattern.MULTILINE);
 
   // Alternative pattern, which limits valid url characters.
   // private static final String URI_PATTERN =
@@ -201,34 +216,20 @@ public ParseResult getParse(Content c) {
       baseURL = new URL(base);
     } catch (Exception e) {
       if (LOG.isErrorEnabled()) {
-        LOG.error("getJSLinks", e);
+        LOG.error("error assigning base URL", e);
       }
     }
 
     try {
-      final PatternCompiler cp = new Perl5Compiler();
-      final Pattern pattern = cp.compile(STRING_PATTERN,
-          Perl5Compiler.CASE_INSENSITIVE_MASK | Perl5Compiler.READ_ONLY_MASK
-              | Perl5Compiler.MULTILINE_MASK);
-      final Pattern pattern1 = cp.compile(URI_PATTERN,
-          Perl5Compiler.CASE_INSENSITIVE_MASK | Perl5Compiler.READ_ONLY_MASK
-              | Perl5Compiler.MULTILINE_MASK);
-      final PatternMatcher matcher = new Perl5Matcher();
 
-      final PatternMatcher matcher1 = new Perl5Matcher();
-      final PatternMatcherInput input = new PatternMatcherInput(plainText);
+      Matcher matcher = STRING_PATTERN.matcher(plainText);
 
-      MatchResult result;
       String url;
 
-      // loop the matches
-      while (matcher.contains(input, pattern)) {
-        result = matcher.getMatch();
-        url = result.group(2);
-        PatternMatcherInput input1 = new PatternMatcherInput(url);
-        if (!matcher1.matches(input1, pattern1)) {
-          // if (LOG.isTraceEnabled()) { LOG.trace(" - invalid '" + url + "'");
-          // }
+      while (matcher.find()) {
+        url = matcher.group(2);
+        Matcher matcherUri = URI_PATTERN.matcher(url);
+        if (!matcherUri.matches()) {
           continue;
         }
         if (url.startsWith("www.")) {
@@ -256,7 +257,7 @@ public ParseResult getParse(Content c) {
       // if it is a malformed URL we just throw it away and continue with
       // extraction.
       if (LOG.isErrorEnabled()) {
-        LOG.error("getJSLinks", ex);
+        LOG.error(" - invalid or malformed URL", ex);
       }
     }
 
@@ -264,7 +265,7 @@ public ParseResult getParse(Content c) {
 
     // create array of the Outlinks
     if (outlinks != null && outlinks.size() > 0) {
-      retval = (Outlink[]) outlinks.toArray(new Outlink[0]);
+      retval = outlinks.toArray(new Outlink[0]);
     } else {
       retval = new Outlink[0];
     }
@@ -272,6 +273,14 @@ public ParseResult getParse(Content c) {
     return retval;
   }
 
+  /**
+   * Main method which can be run from command line with the plugin option. The
+   * method takes two arguments e.g. o.a.n.parse.js.JSParseFilter file.js
+   * baseURL
+   * 
+   * @param args
+   * @throws Exception
+   */
   public static void main(String[] args) throws Exception {
     if (args.length < 2) {
       System.err.println(JSParseFilter.class.getName() + " file.js baseURL");
diff --git 
a/src/plugin/parse-js/src/test/org/apache/nutch/parse/js/TestJSParseFilter.java 
b/src/plugin/parse-js/src/test/org/apache/nutch/parse/js/TestJSParseFilter.java
new file mode 100644
index 000000000..01bee1321
--- /dev/null
+++ 
b/src/plugin/parse-js/src/test/org/apache/nutch/parse/js/TestJSParseFilter.java
@@ -0,0 +1,116 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.nutch.parse.js;
+
+import static org.junit.Assert.assertEquals;
+import static org.junit.Assert.assertTrue;
+
+import java.io.File;
+import java.io.IOException;
+import java.lang.invoke.MethodHandles;
+import java.util.Set;
+import java.util.TreeSet;
+
+import org.apache.hadoop.conf.Configuration;
+import org.apache.hadoop.io.Text;
+import org.apache.nutch.crawl.CrawlDatum;
+import org.apache.nutch.parse.Outlink;
+import org.apache.nutch.parse.Parse;
+import org.apache.nutch.parse.ParseException;
+import org.apache.nutch.parse.ParseUtil;
+import org.apache.nutch.protocol.Content;
+import org.apache.nutch.protocol.Protocol;
+import org.apache.nutch.protocol.ProtocolException;
+import org.apache.nutch.protocol.ProtocolFactory;
+import org.apache.nutch.util.NutchConfiguration;
+import org.junit.Before;
+import org.junit.Test;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+
+/**
+ * JUnit test case for {@link JSParseFilter} which tests
+ * <ol>
+ * <li>That 2 outlinks are extracted from JavaScript snippets embedded in
+ * HTML</li>
+ * <li>That X outlinks are extracted from a pure JavaScript file (this is
+ * temporarily disabled)</li>
+ * </ol>
+ */
+public class TestJSParseFilter {
+
+  private static final Logger LOG = LoggerFactory
+      .getLogger(MethodHandles.lookup().lookupClass());
+
+  private String fileSeparator = System.getProperty("file.separator");
+
+  // This system property is defined in ./src/plugin/build-plugin.xml
+  private String sampleDir = System.getProperty("test.data", ".");
+
+  // Make sure sample files are copied to "test.data" as specified in
+  // ./src/plugin/parse-js/build.xml during plugin compilation.
+
+  private Configuration conf;
+
+  @Before
+  public void setUp() {
+    conf = NutchConfiguration.create();
+    conf.set("file.content.limit", "-1");
+    conf.set("plugin.includes", "protocol-file|parse-(html|js)");
+  }
+
+  public Outlink[] getOutlinks(String sampleFile)
+      throws ProtocolException, ParseException, IOException {
+    String urlString;
+    Parse parse;
+
+    urlString = "file:" + sampleDir + fileSeparator + sampleFile;
+    LOG.info("Parsing {}", urlString);
+    Protocol protocol = new ProtocolFactory(conf).getProtocol(urlString);
+    Content content = protocol
+        .getProtocolOutput(new Text(urlString), new CrawlDatum()).getContent();
+    parse = new ParseUtil(conf).parse(content).get(content.getUrl());
+    LOG.info(parse.getData().toString());
+    return parse.getData().getOutlinks();
+  }
+
+  @Test
+  public void testJavaScriptOutlinkExtraction()
+      throws ProtocolException, ParseException, IOException {
+    String[] filenames = new File(sampleDir).list();
+    for (int i = 0; i < filenames.length; i++) {
+      Outlink[] outlinks = getOutlinks(filenames[i]);
+      if (filenames[i].endsWith("parse_pure_js_test.js")) {
+        assertEquals("number of outlinks in .js test file should be X", 2,
+            outlinks.length);
+        assertEquals("http://search.lucidimagination.com/p:nutch";, 
outlinks[0].getToUrl());
+        assertEquals("http://search-lucene.com/nutch";, outlinks[1].getToUrl());
+      } else {
+        assertTrue("number of outlinks in .html file should be at least 2", 
outlinks.length >= 2);
+        Set<String> outlinkSet = new TreeSet<>();
+        for (Outlink o : outlinks) {
+          outlinkSet.add(o.getToUrl());
+        }
+        assertTrue("http://search.lucidimagination.com/p:nutch not in 
outlinks",
+            outlinkSet.contains("http://search.lucidimagination.com/p:nutch";));
+        assertTrue("http://search-lucene.com/nutch not in outlinks",
+            outlinkSet.contains("http://search-lucene.com/nutch";));
+      }
+    }
+  }
+
+}
diff --git a/src/plugin/urlnormalizer-regex/sample/regex-normalize-default.xml 
b/src/plugin/urlnormalizer-regex/sample/regex-normalize-default.xml
index 4d6eabcf2..3d1f7186c 100644
--- a/src/plugin/urlnormalizer-regex/sample/regex-normalize-default.xml
+++ b/src/plugin/urlnormalizer-regex/sample/regex-normalize-default.xml
@@ -1,7 +1,24 @@
 <?xml version="1.0"?>
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
 <!-- This is the configuration file for the RegexUrlNormalize Class.
      This is intended so that users can specify substitutions to be
-     done on URLs. The regex engine that is used is Perl5 compatible.
+     done on URLs using the Java regex syntax, see
+     https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
      The rules are applied to URLs in the order they occur in this file.  -->
 
 <!-- WATCH OUT: an xml parser reads this file an ampersands must be
diff --git a/src/plugin/urlnormalizer-regex/sample/regex-normalize-scope1.xml 
b/src/plugin/urlnormalizer-regex/sample/regex-normalize-scope1.xml
index 369896878..fc8e05e2d 100644
--- a/src/plugin/urlnormalizer-regex/sample/regex-normalize-scope1.xml
+++ b/src/plugin/urlnormalizer-regex/sample/regex-normalize-scope1.xml
@@ -1,7 +1,24 @@
 <?xml version="1.0"?>
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
 <!-- This is the configuration file for the RegexUrlNormalize Class.
      This is intended so that users can specify substitutions to be
-     done on URLs. The regex engine that is used is Perl5 compatible.
+     done on URLs using the Java regex syntax, see
+     https://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html
      The rules are applied to URLs in the order they occur in this file.  -->
 
 <!-- WATCH OUT: an xml parser reads this file an ampersands must be


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


> Get rid of oro
> --------------
>
>                 Key: NUTCH-2192
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2192
>             Project: Nutch
>          Issue Type: Task
>            Reporter: Markus Jelsma
>            Priority: Major
>             Fix For: 2.4, 1.16
>
>         Attachments: NUTCH-2192.patch
>
>
> Couple of classes still rely on oro, we should get rid of it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (NUTCH-2192) Get rid of oro

Reply via email to