Hi Albin, you mean NUTCH-1870, right? I'm in the process of reviewing your patch. Just stuck in preparing the boilerplate required to intregate parse-xsl into build, tests, javadoc. I've added the jaxb dependencies to ivy, but the xjb task fails. Presumably, because there is a version mismatch. See attached patch. If you can resolve this problem, would be great!
Also we need a configuration template on conf/. Just one rules and one transformer file, ideally with some examples (commented out) so that people can start with, and do not need to read external stuff. Your blog [1] is great, but it's better to have it at hand. Also conf/ it the first place to look at. Thanks, Sebastian [1] http://albinscoding.wordpress.com/2014/09/25/xsl-parser-for-apache-nutch/ On 11/01/2014 09:48 PM, Albinscode wrote: > Hello everybody, > > If some more efforts are to be done on NUTCH-1740, I'll be glad to > help. I developed this plugin because I was amongst people that didn't > want to create new plugins just for few metadata extraction matters ;) > > 2014-11-01 19:47 GMT+01:00 Lewis John McGibbney (JIRA) <j...@apache.org>: >> >> [ >> https://issues.apache.org/jira/browse/NUTCH-1644?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel >> ] >> >> Lewis John McGibbney updated NUTCH-1644: >> ---------------------------------------- >> Fix Version/s: (was: 2.3) >> 2.4 >> >>> Should have a parser that uses xpath >>> ------------------------------------ >>> >>> Key: NUTCH-1644 >>> URL: https://issues.apache.org/jira/browse/NUTCH-1644 >>> Project: Nutch >>> Issue Type: New Feature >>> Components: parser >>> Affects Versions: 2.2.1 >>> Reporter: cihad güzel >>> Assignee: Lewis John McGibbney >>> Labels: parser, xpath >>> Fix For: 2.4 >>> >>> Attachments: NUTCH-1644.patch >>> >>> >>> May want to parse some url via xpath. May be blog or news web sites. Should >>> be a plugin using xpath parse. >> >> >> >> -- >> This message was sent by Atlassian JIRA >> (v6.3.4#6332)
diff --git build.xml build.xml index ec1cee4..c157b2d 100644 --- build.xml +++ build.xml @@ -190,6 +190,7 @@ <packageset dir="${plugins.dir}/parse-metatags/src/java"/> <packageset dir="${plugins.dir}/parse-swf/src/java"/> <packageset dir="${plugins.dir}/parse-tika/src/java"/> + <packageset dir="${plugins.dir}/parse-xsl/src/java"/> <packageset dir="${plugins.dir}/parse-zip/src/java"/> <packageset dir="${plugins.dir}/protocol-file/src/java"/> <packageset dir="${plugins.dir}/protocol-ftp/src/java"/> @@ -595,6 +596,7 @@ <packageset dir="${plugins.dir}/parse-metatags/src/java"/> <packageset dir="${plugins.dir}/parse-swf/src/java"/> <packageset dir="${plugins.dir}/parse-tika/src/java"/> + <packageset dir="${plugins.dir}/parse-xsl/src/java"/> <packageset dir="${plugins.dir}/parse-zip/src/java"/> <packageset dir="${plugins.dir}/protocol-file/src/java"/> <packageset dir="${plugins.dir}/protocol-ftp/src/java"/> @@ -984,6 +986,8 @@ <source path="${plugins.dir}/parse-swf/src/test/" /> <source path="${plugins.dir}/parse-tika/src/java/" /> <source path="${plugins.dir}/parse-tika/src/test/" /> + <source path="${plugins.dir}/parse-xsl/src/java/" /> + <source path="${plugins.dir}/parse-xsl/src/test/" /> <source path="${plugins.dir}/parse-zip/src/java/" /> <source path="${plugins.dir}/parse-zip/src/test/" /> <source path="${plugins.dir}/protocol-file/src/java/" /> diff --git default.properties default.properties index e9415cb..73a53fe 100644 --- default.properties +++ default.properties @@ -174,5 +174,5 @@ plugins.misc=\ org.apache.nutch.collection*:\ org.apache.nutch.analysis.lang*:\ org.creativecommons.nutch*:\ - org.apache.nutch.microformats.reltag* - + org.apache.nutch.microformats.reltag*:\ + org.apache.nutch.parse.xsl* \ No newline at end of file diff --git src/plugin/build.xml src/plugin/build.xml index 4ce6bee..eef9097 100755 --- src/plugin/build.xml +++ src/plugin/build.xml @@ -54,6 +54,7 @@ <ant dir="parse-metatags" target="deploy"/> <ant dir="parse-swf" target="deploy"/> <ant dir="parse-tika" target="deploy"/> + <ant dir="parse-xsl" target="deploy"/> <ant dir="parse-zip" target="deploy"/> <ant dir="scoring-depth" target="deploy"/> <ant dir="scoring-opic" target="deploy"/> @@ -96,6 +97,7 @@ <ant dir="parse-metatags" target="test"/> <ant dir="parse-swf" target="test"/> <ant dir="parse-tika" target="test"/> + <ant dir="parse-xsl" target="test"/> <ant dir="parse-zip" target="test"/> <ant dir="subcollection" target="test"/> <ant dir="urlfilter-automaton" target="test"/> @@ -147,6 +149,7 @@ <ant dir="parse-metatags" target="clean"/> <ant dir="parse-swf" target="clean"/> <ant dir="parse-tika" target="clean"/> + <ant dir="parse-xsl" target="clean"/> <ant dir="parse-zip" target="clean"/> <ant dir="scoring-depth" target="clean"/> <ant dir="scoring-opic" target="clean"/> diff --git src/plugin/parse-xsl/build.xml src/plugin/parse-xsl/build.xml new file mode 100644 index 0000000..e4a53a1 --- /dev/null +++ src/plugin/parse-xsl/build.xml @@ -0,0 +1,45 @@ +<?xml version="1.0"?> +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> +<project name="parse-xsl" default="jar-core"> + + <import file="../build-plugin.xml"/> + + <!-- Build compilation dependencies --> + <target name="deps-jar"> + <ant target="jar" inheritall="false" dir="../lib-nekohtml"/> + <ant target="jar" inheritall="false" dir="../parse-html"/> + </target> + + <!-- Add compilation dependencies to classpath --> + <path id="plugin.deps"> + <fileset dir="${nutch.root}/build"> + <include name="**/lib-nekohtml/*.jar" /> + <include name="**/parse-html/*.jar" /> + </fileset> + </path> + + <taskdef name="xjc" classname="com.sun.tools.xjc.XJCTask"> + <classpath> + <path refid="classpath"/> + </classpath> + </taskdef> + + <xjc schema="conf/documents.xsd" destdir="src/java" package="org.apache.nutch.parse.xsl.xml.document"/> + <xjc schema="conf/rules.xsd" destdir="src/java" package="org.apache.nutch.parse.xsl.xml.rule"/> + +</project> diff --git src/plugin/parse-xsl/ivy.xml src/plugin/parse-xsl/ivy.xml new file mode 100644 index 0000000..30bd9af --- /dev/null +++ src/plugin/parse-xsl/ivy.xml @@ -0,0 +1,46 @@ +<?xml version="1.0" ?> + +<!-- + Licensed to the Apache Software Foundation (ASF) under one or more + contributor license agreements. See the NOTICE file distributed with + this work for additional information regarding copyright ownership. + The ASF licenses this file to You under the Apache License, Version 2.0 + (the "License"); you may not use this file except in compliance with + the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. +--> + +<ivy-module version="1.0"> + <info organisation="org.apache.nutch" module="${ant.project.name}"> + <license name="Apache 2.0"/> + <ivyauthor name="Apache Nutch Team" url="http://nutch.apache.org"/> + <description> + Apache Nutch + </description> + </info> + + <configurations> + <include file="../../../ivy/ivy-configurations.xml"/> + </configurations> + + <publications> + <!--get the artifact from our module name--> + <artifact conf="master"/> + </publications> + + <dependencies> + <dependency org="org.ccil.cowan.tagsoup" name="tagsoup" rev="1.2.1"/> + <dependency org="com.sun.xml.bind" name="jaxb-xjc" rev="2.2.11"/> + <dependency org="com.sun.xml.bind" name="jaxb-impl" rev="2.2.11"/> + <dependency org="com.sun.xml.bind" name="jaxb-jxc" rev="2.2.11"/> + <dependency org="com.sun.xml.bind" name="jaxb-core" rev="2.2.11"/> + </dependencies> + +</ivy-module>