Hi Albin,

you mean NUTCH-1870, right?
I'm in the process of reviewing your patch.
Just stuck in preparing the boilerplate required
to intregate parse-xsl into build, tests, javadoc.
I've added the jaxb dependencies to ivy,
but the xjb task fails. Presumably, because
there is a version mismatch.
See attached patch. If you can resolve this problem,
would be great!

Also we need a configuration template on conf/.
Just one rules and one transformer file,
ideally with some examples (commented out)
so that people can start with, and do not need
to read external stuff. Your blog [1] is great,
but it's better to have it at hand. Also conf/
it the first place to look at.

Thanks,
Sebastian

[1] http://albinscoding.wordpress.com/2014/09/25/xsl-parser-for-apache-nutch/


On 11/01/2014 09:48 PM, Albinscode wrote:
> Hello everybody,
> 
> If some more efforts are to be done on NUTCH-1740, I'll be glad to
> help. I developed this plugin because I was amongst people that didn't
> want to create new plugins just for few metadata extraction matters ;)
> 
> 2014-11-01 19:47 GMT+01:00 Lewis John McGibbney (JIRA) <j...@apache.org>:
>>
>>      [ 
>> https://issues.apache.org/jira/browse/NUTCH-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
>>  ]
>>
>> Lewis John McGibbney updated NUTCH-1644:
>> ----------------------------------------
>>     Fix Version/s:     (was: 2.3)
>>                    2.4
>>
>>> Should have a parser that uses xpath
>>> ------------------------------------
>>>
>>>                 Key: NUTCH-1644
>>>                 URL: https://issues.apache.org/jira/browse/NUTCH-1644
>>>             Project: Nutch
>>>          Issue Type: New Feature
>>>          Components: parser
>>>    Affects Versions: 2.2.1
>>>            Reporter: cihad güzel
>>>            Assignee: Lewis John McGibbney
>>>              Labels: parser, xpath
>>>             Fix For: 2.4
>>>
>>>         Attachments: NUTCH-1644.patch
>>>
>>>
>>> May want to parse some url via xpath. May be blog or news web sites. Should 
>>> be a plugin using xpath parse.
>>
>>
>>
>> --
>> This message was sent by Atlassian JIRA
>> (v6.3.4#6332)

diff --git build.xml build.xml
index ec1cee4..c157b2d 100644
--- build.xml
+++ build.xml
@@ -190,6 +190,7 @@
       <packageset dir="${plugins.dir}/parse-metatags/src/java"/>
       <packageset dir="${plugins.dir}/parse-swf/src/java"/>
       <packageset dir="${plugins.dir}/parse-tika/src/java"/>
+      <packageset dir="${plugins.dir}/parse-xsl/src/java"/>
       <packageset dir="${plugins.dir}/parse-zip/src/java"/>
       <packageset dir="${plugins.dir}/protocol-file/src/java"/>
       <packageset dir="${plugins.dir}/protocol-ftp/src/java"/>
@@ -595,6 +596,7 @@
       <packageset dir="${plugins.dir}/parse-metatags/src/java"/>
       <packageset dir="${plugins.dir}/parse-swf/src/java"/>
       <packageset dir="${plugins.dir}/parse-tika/src/java"/>
+      <packageset dir="${plugins.dir}/parse-xsl/src/java"/>
       <packageset dir="${plugins.dir}/parse-zip/src/java"/>
       <packageset dir="${plugins.dir}/protocol-file/src/java"/>
       <packageset dir="${plugins.dir}/protocol-ftp/src/java"/>
@@ -984,6 +986,8 @@
         <source path="${plugins.dir}/parse-swf/src/test/" />
         <source path="${plugins.dir}/parse-tika/src/java/" />
         <source path="${plugins.dir}/parse-tika/src/test/" />
+        <source path="${plugins.dir}/parse-xsl/src/java/" />
+        <source path="${plugins.dir}/parse-xsl/src/test/" />
         <source path="${plugins.dir}/parse-zip/src/java/" />
         <source path="${plugins.dir}/parse-zip/src/test/" />
         <source path="${plugins.dir}/protocol-file/src/java/" />
diff --git default.properties default.properties
index e9415cb..73a53fe 100644
--- default.properties
+++ default.properties
@@ -174,5 +174,5 @@ plugins.misc=\
    org.apache.nutch.collection*:\
    org.apache.nutch.analysis.lang*:\
    org.creativecommons.nutch*:\
-   org.apache.nutch.microformats.reltag*
-   
+   org.apache.nutch.microformats.reltag*:\
+   org.apache.nutch.parse.xsl*
\ No newline at end of file
diff --git src/plugin/build.xml src/plugin/build.xml
index 4ce6bee..eef9097 100755
--- src/plugin/build.xml
+++ src/plugin/build.xml
@@ -54,6 +54,7 @@
      <ant dir="parse-metatags" target="deploy"/>
      <ant dir="parse-swf" target="deploy"/>
      <ant dir="parse-tika" target="deploy"/>
+     <ant dir="parse-xsl" target="deploy"/>
      <ant dir="parse-zip" target="deploy"/>
      <ant dir="scoring-depth" target="deploy"/>
      <ant dir="scoring-opic" target="deploy"/>
@@ -96,6 +97,7 @@
      <ant dir="parse-metatags" target="test"/>
      <ant dir="parse-swf" target="test"/>
      <ant dir="parse-tika" target="test"/>
+     <ant dir="parse-xsl" target="test"/>
      <ant dir="parse-zip" target="test"/>
      <ant dir="subcollection" target="test"/>
      <ant dir="urlfilter-automaton" target="test"/>
@@ -147,6 +149,7 @@
     <ant dir="parse-metatags" target="clean"/>
     <ant dir="parse-swf" target="clean"/>
     <ant dir="parse-tika" target="clean"/>
+    <ant dir="parse-xsl" target="clean"/>
     <ant dir="parse-zip" target="clean"/>
     <ant dir="scoring-depth" target="clean"/>
     <ant dir="scoring-opic" target="clean"/>
diff --git src/plugin/parse-xsl/build.xml src/plugin/parse-xsl/build.xml
new file mode 100644
index 0000000..e4a53a1
--- /dev/null
+++ src/plugin/parse-xsl/build.xml
@@ -0,0 +1,45 @@
+<?xml version="1.0"?>
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements.  See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License.  You may obtain a copy of the License at
+
+     http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
+<project name="parse-xsl" default="jar-core">
+
+	<import file="../build-plugin.xml"/>
+
+	<!-- Build compilation dependencies -->
+	<target name="deps-jar">
+	  <ant target="jar" inheritall="false" dir="../lib-nekohtml"/>
+	  <ant target="jar" inheritall="false" dir="../parse-html"/>
+	</target>
+
+	<!-- Add compilation dependencies to classpath -->
+	<path id="plugin.deps">
+	  <fileset dir="${nutch.root}/build">
+	    <include name="**/lib-nekohtml/*.jar" />
+	    <include name="**/parse-html/*.jar" />
+	  </fileset>
+	</path>
+
+	<taskdef name="xjc" classname="com.sun.tools.xjc.XJCTask">
+	  <classpath>
+	    <path refid="classpath"/>
+	  </classpath>
+	</taskdef>
+
+	<xjc schema="conf/documents.xsd" destdir="src/java" package="org.apache.nutch.parse.xsl.xml.document"/>
+	<xjc schema="conf/rules.xsd" destdir="src/java" package="org.apache.nutch.parse.xsl.xml.rule"/>
+
+</project>
diff --git src/plugin/parse-xsl/ivy.xml src/plugin/parse-xsl/ivy.xml
new file mode 100644
index 0000000..30bd9af
--- /dev/null
+++ src/plugin/parse-xsl/ivy.xml
@@ -0,0 +1,46 @@
+<?xml version="1.0" ?>
+
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+
+<ivy-module version="1.0">
+  <info organisation="org.apache.nutch" module="${ant.project.name}">
+    <license name="Apache 2.0"/>
+    <ivyauthor name="Apache Nutch Team" url="http://nutch.apache.org"/>
+    <description>
+        Apache Nutch
+    </description>
+  </info>
+
+  <configurations>
+    <include file="../../../ivy/ivy-configurations.xml"/>
+  </configurations>
+
+  <publications>
+    <!--get the artifact from our module name-->
+    <artifact conf="master"/>
+  </publications>
+
+  <dependencies>
+   <dependency org="org.ccil.cowan.tagsoup" name="tagsoup" rev="1.2.1"/>
+   <dependency org="com.sun.xml.bind" name="jaxb-xjc" rev="2.2.11"/>
+   <dependency org="com.sun.xml.bind" name="jaxb-impl" rev="2.2.11"/>
+   <dependency org="com.sun.xml.bind" name="jaxb-jxc" rev="2.2.11"/>
+   <dependency org="com.sun.xml.bind" name="jaxb-core" rev="2.2.11"/>
+  </dependencies>
+
+</ivy-module>

Reply via email to