Massimo, http://nutch.apache.org/mailing_lists.html
=> [email protected] Thanks On 26 February 2015 at 19:11, Massimo Miccoli <[email protected]> wrote: > > > Massimo > > > Il giorno 26/feb/2015, alle ore 19:31, [email protected] ha scritto: > > > > Author: lewismc > > Date: Thu Feb 26 18:31:39 2015 > > New Revision: 1662530 > > > > URL: http://svn.apache.org/r1662530 > > Log: > > NUTCH-1933 nutch-selenium plugin > > > > Added: > > nutch/trunk/src/plugin/lib-selenium/ > > nutch/trunk/src/plugin/lib-selenium/build.xml > > nutch/trunk/src/plugin/lib-selenium/ivy.xml > > nutch/trunk/src/plugin/lib-selenium/plugin.xml > > nutch/trunk/src/plugin/lib-selenium/src/ > > nutch/trunk/src/plugin/lib-selenium/src/java/ > > nutch/trunk/src/plugin/lib-selenium/src/java/org/ > > nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/ > > nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/ > > > nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/ > > > nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/ > > > nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java > > nutch/trunk/src/plugin/protocol-selenium/ > > nutch/trunk/src/plugin/protocol-selenium/build-ivy.xml > > nutch/trunk/src/plugin/protocol-selenium/build.xml > > nutch/trunk/src/plugin/protocol-selenium/ivy.xml > > nutch/trunk/src/plugin/protocol-selenium/plugin.xml > > nutch/trunk/src/plugin/protocol-selenium/src/ > > nutch/trunk/src/plugin/protocol-selenium/src/java/ > > nutch/trunk/src/plugin/protocol-selenium/src/java/org/ > > nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/ > > nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/ > > > nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/ > > > nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/ > > > nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/Http.java > > > nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/HttpResponse.java > > > nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/package.html > > nutch/trunk/src/plugin/protocol-selenium/src/target/ > > nutch/trunk/src/plugin/protocol-selenium/src/target/classes/ > > nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/ > > > nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/ > > > nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/ > > > nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/protocol/ > > > nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/protocol/htmlunit/ > > > nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/protocol/htmlunit/package.html > > Modified: > > nutch/trunk/CHANGES.txt > > nutch/trunk/build.xml > > nutch/trunk/ivy/ivy.xml > > nutch/trunk/src/plugin/build.xml > > > > Modified: nutch/trunk/CHANGES.txt > > URL: > http://svn.apache.org/viewvc/nutch/trunk/CHANGES.txt?rev=1662530&r1=1662529&r2=1662530&view=diff > > > ============================================================================== > > --- nutch/trunk/CHANGES.txt (original) > > +++ nutch/trunk/CHANGES.txt Thu Feb 26 18:31:39 2015 > > @@ -2,6 +2,8 @@ Nutch Change Log > > > > Nutch Current Development 1.10-SNAPSHOT > > > > +* NUTCH-1933 nutch-selenium plugin (Mo Omer, Mohammad Al-Moshin, > lewismc) > > + > > * NUTCH-827 HTTP POST Authentication (Jasper van Veghel, yuanyun.cn, > snagel, lewismc) > > > > * NUTCH-1724 LinkDBReader to support regex output filtering (markus) > > > > Modified: nutch/trunk/build.xml > > URL: > http://svn.apache.org/viewvc/nutch/trunk/build.xml?rev=1662530&r1=1662529&r2=1662530&view=diff > > > ============================================================================== > > --- nutch/trunk/build.xml (original) > > +++ nutch/trunk/build.xml Thu Feb 26 18:31:39 2015 > > @@ -184,6 +184,7 @@ > > <packageset dir="${plugins.dir}/indexer-solr/src/java"/> > > <packageset dir="${plugins.dir}/language-identifier/src/java"/> > > <packageset dir="${plugins.dir}/lib-http/src/java"/> > > + <packageset dir="${plugins.dir}/lib-selenium/src/java"/> > > <packageset dir="${plugins.dir}/lib-regex-filter/src/java"/> > > <packageset dir="${plugins.dir}/microformats-reltag/src/java"/> > > <packageset dir="${plugins.dir}/parse-ext/src/java"/> > > @@ -197,6 +198,7 @@ > > <packageset dir="${plugins.dir}/protocol-ftp/src/java"/> > > <packageset dir="${plugins.dir}/protocol-http/src/java"/> > > <packageset dir="${plugins.dir}/protocol-httpclient/src/java"/> > > + <packageset dir="${plugins.dir}/protocol-selenium/src/java"/> > > <packageset dir="${plugins.dir}/scoring-depth/src/java"/> > > <packageset dir="${plugins.dir}/scoring-link/src/java"/> > > <packageset dir="${plugins.dir}/scoring-opic/src/java"/> > > @@ -591,6 +593,7 @@ > > <packageset dir="${plugins.dir}/indexer-solr/src/java"/> > > <packageset dir="${plugins.dir}/language-identifier/src/java"/> > > <packageset dir="${plugins.dir}/lib-http/src/java"/> > > + <packageset dir="${plugins.dir}/lib-selenium/src/java"/> > > <packageset dir="${plugins.dir}/lib-regex-filter/src/java"/> > > <packageset dir="${plugins.dir}/microformats-reltag/src/java"/> > > <packageset dir="${plugins.dir}/parse-ext/src/java"/> > > @@ -604,6 +607,7 @@ > > <packageset dir="${plugins.dir}/protocol-ftp/src/java"/> > > <packageset dir="${plugins.dir}/protocol-http/src/java"/> > > <packageset dir="${plugins.dir}/protocol-httpclient/src/java"/> > > + <packageset dir="${plugins.dir}/protocol-selenium/src/java"/> > > <packageset dir="${plugins.dir}/scoring-depth/src/java"/> > > <packageset dir="${plugins.dir}/scoring-link/src/java"/> > > <packageset dir="${plugins.dir}/scoring-opic/src/java"/> > > @@ -985,6 +989,8 @@ > > <source path="${plugins.dir}/language-identifier/src/test/" /> > > <source path="${plugins.dir}/lib-http/src/java/" /> > > <source path="${plugins.dir}/lib-http/src/test/" /> > > + <source path="${plugins.dir}/lib-selenium/src/java/" /> > > + <source path="${plugins.dir}/lib-selenium/src/test/" /> > > <source path="${plugins.dir}/lib-regex-filter/src/java/" /> > > <source path="${plugins.dir}/lib-regex-filter/src/test/" /> > > <source path="${plugins.dir}/microformats-reltag/src/java/" /> > > @@ -1008,6 +1014,8 @@ > > <source path="${plugins.dir}/protocol-httpclient/src/test/" /> > > <source path="${plugins.dir}/protocol-http/src/java/" /> > > <source path="${plugins.dir}/protocol-http/src/test/" /> > > + <source path="${plugins.dir}/protocol-selenium/src/java"/> > > + <source path="${plugins.dir}/protocol-selenium/src/test"/> > > <source path="${plugins.dir}/scoring-depth/src/java/" /> > > <source path="${plugins.dir}/scoring-link/src/java/" /> > > <source path="${plugins.dir}/scoring-opic/src/java/" /> > > > > Modified: nutch/trunk/ivy/ivy.xml > > URL: > http://svn.apache.org/viewvc/nutch/trunk/ivy/ivy.xml?rev=1662530&r1=1662529&r2=1662530&view=diff > > > ============================================================================== > > --- nutch/trunk/ivy/ivy.xml (original) > > +++ nutch/trunk/ivy/ivy.xml Thu Feb 26 18:31:39 2015 > > @@ -23,24 +23,24 @@ > > database etc. > > </description> > > </info> > > - > > + > > <configurations> > > <include file="${basedir}/ivy/ivy-configurations.xml" /> > > </configurations> > > - > > + > > <publications> > > <!--get the artifact from our module name --> > > <artifact conf="master" /> > > </publications> > > - > > + > > <dependencies> > > <dependency org="org.slf4j" name="slf4j-api" rev="1.6.1" > > conf="*->master" /> > > <dependency org="org.slf4j" name="slf4j-log4j12" rev="1.6.1" > > conf="*->master" /> > > - > > + > > <dependency org="log4j" name="log4j" rev="1.2.15" > conf="*->master" /> > > - > > + > > <dependency org="commons-lang" name="commons-lang" rev="2.6" > > conf="*->default" /> > > <dependency org="commons-collections" name="commons-collections" > > @@ -49,7 +49,7 @@ > > rev="3.1" conf="*->master" /> > > <dependency org="commons-codec" name="commons-codec" rev="1.3" > > conf="*->default" /> > > - > > + > > <dependency org="org.apache.hadoop" name="hadoop-core" rev="1.2.0" > > conf="*->default"> > > <exclude org="hsqldb" name="hsqldb" /> > > > > Modified: nutch/trunk/src/plugin/build.xml > > URL: > http://svn.apache.org/viewvc/nutch/trunk/src/plugin/build.xml?rev=1662530&r1=1662529&r2=1662530&view=diff > > > ============================================================================== > > --- nutch/trunk/src/plugin/build.xml (original) > > +++ nutch/trunk/src/plugin/build.xml Thu Feb 26 18:31:39 2015 > > @@ -50,6 +50,8 @@ > > <ant dir="protocol-ftp" target="deploy"/> > > <ant dir="protocol-http" target="deploy"/> > > <ant dir="protocol-httpclient" target="deploy"/> > > + <ant dir="lib-selenium" target="deploy"/> > > + <ant dir="protocol-selenium" target="deploy" /> > > <ant dir="parse-ext" target="deploy"/> > > <ant dir="parse-js" target="deploy"/> > > <ant dir="parse-html" target="deploy"/> > > @@ -149,6 +151,8 @@ > > <ant dir="protocol-ftp" target="clean"/> > > <ant dir="protocol-http" target="clean"/> > > <ant dir="protocol-httpclient" target="clean"/> > > + <ant dir="lib-selenium" target="clean"/> > > + <ant dir="protocol-selenium" target="clean" /> > > <ant dir="parse-ext" target="clean"/> > > <ant dir="parse-js" target="clean"/> > > <ant dir="parse-html" target="clean"/> > > > > Added: nutch/trunk/src/plugin/lib-selenium/build.xml > > URL: > http://svn.apache.org/viewvc/nutch/trunk/src/plugin/lib-selenium/build.xml?rev=1662530&view=auto > > > ============================================================================== > > --- nutch/trunk/src/plugin/lib-selenium/build.xml (added) > > +++ nutch/trunk/src/plugin/lib-selenium/build.xml Thu Feb 26 18:31:39 > 2015 > > @@ -0,0 +1,28 @@ > > +<?xml version="1.0"?> > > +<!-- > > + Licensed to the Apache Software Foundation (ASF) under one or more > > + contributor license agreements. See the NOTICE file distributed with > > + this work for additional information regarding copyright ownership. > > + The ASF licenses this file to You under the Apache License, Version 2.0 > > + (the "License"); you may not use this file except in compliance with > > + the License. You may obtain a copy of the License at > > + > > + http://www.apache.org/licenses/LICENSE-2.0 > > + > > + Unless required by applicable law or agreed to in writing, software > > + distributed under the License is distributed on an "AS IS" BASIS, > > + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or > implied. > > + See the License for the specific language governing permissions and > > + limitations under the License. > > +--> > > +<project name="lib-selenium" default="jar-core"> > > + > > + <import file="../build-plugin.xml"/> > > + > > + <!-- Add compilation dependencies to classpath --> > > + <path id="plugin.deps"> > > + <fileset dir="${nutch.root}/build"> > > + <include name="**/lib-http/*.jar" /> > > + </fileset> > > + </path> > > +</project> > > > > Added: nutch/trunk/src/plugin/lib-selenium/ivy.xml > > URL: > http://svn.apache.org/viewvc/nutch/trunk/src/plugin/lib-selenium/ivy.xml?rev=1662530&view=auto > > > ============================================================================== > > --- nutch/trunk/src/plugin/lib-selenium/ivy.xml (added) > > +++ nutch/trunk/src/plugin/lib-selenium/ivy.xml Thu Feb 26 18:31:39 2015 > > @@ -0,0 +1,48 @@ > > +<?xml version="1.0" ?> > > + > > +<!-- > > + Licensed to the Apache Software Foundation (ASF) under one or more > > + contributor license agreements. See the NOTICE file distributed with > > + this work for additional information regarding copyright ownership. > > + The ASF licenses this file to You under the Apache License, Version > 2.0 > > + (the "License"); you may not use this file except in compliance with > > + the License. You may obtain a copy of the License at > > + > > + http://www.apache.org/licenses/LICENSE-2.0 > > + > > + Unless required by applicable law or agreed to in writing, software > > + distributed under the License is distributed on an "AS IS" BASIS, > > + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or > implied. > > + See the License for the specific language governing permissions and > > + limitations under the License. > > +--> > > + > > +<ivy-module version="1.0"> > > + <info organisation="org.apache.nutch" module="${ant.project.name}"> > > + <license name="Apache 2.0"/> > > + <ivyauthor name="Apache Nutch Team" url="http://nutch.apache.org"/> > > + <description> > > + Apache Nutch > > + </description> > > + </info> > > + > > + <configurations> > > + <include file="../../..//ivy/ivy-configurations.xml"/> > > + </configurations> > > + > > + <publications> > > + <!--get the artifact from our module name--> > > + <artifact conf="master"/> > > + </publications> > > + > > + <dependencies> > > + <!-- begin selenium dependencies --> > > + <dependency org="org.seleniumhq.selenium" name="selenium-java" > rev="2.44.0" /> > > + > > + <dependency org="com.opera" name="operadriver" rev="1.5"> > > + <exclude org="org.seleniumhq.selenium" > name="selenium-remote-driver" /> > > + </dependency> > > + <!-- end selenium dependencies --> > > + </dependencies> > > + > > +</ivy-module> > > > > Added: nutch/trunk/src/plugin/lib-selenium/plugin.xml > > URL: > http://svn.apache.org/viewvc/nutch/trunk/src/plugin/lib-selenium/plugin.xml?rev=1662530&view=auto > > > ============================================================================== > > --- nutch/trunk/src/plugin/lib-selenium/plugin.xml (added) > > +++ nutch/trunk/src/plugin/lib-selenium/plugin.xml Thu Feb 26 18:31:39 > 2015 > > @@ -0,0 +1,42 @@ > > +<?xml version="1.0" encoding="UTF-8"?> > > +<!-- > > + Licensed to the Apache Software Foundation (ASF) under one or more > > + contributor license agreements. See the NOTICE file distributed with > > + this work for additional information regarding copyright ownership. > > + The ASF licenses this file to You under the Apache License, Version 2.0 > > + (the "License"); you may not use this file except in compliance with > > + the License. You may obtain a copy of the License at > > + > > + http://www.apache.org/licenses/LICENSE-2.0 > > + > > + Unless required by applicable law or agreed to in writing, software > > + distributed under the License is distributed on an "AS IS" BASIS, > > + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or > implied. > > + See the License for the specific language governing permissions and > > + limitations under the License. > > +--> > > +<!-- > > + ! A common framework for http protocol implementations > > + !--> > > +<plugin > > + id="lib-selenium" > > + name="HTTP Framework" > > + version="1.0" > > + provider-name="org.apache.nutch"> > > + > > + <runtime> > > + <library name="lib-selenium.jar"> > > + <export name="*"/> > > + </library> > > + </runtime> > > + > > + <requires> > > + <library name="selenium-java-2.4.0.jar"> > > + <export name="*"/> > > + </library> > > + <library name="operadriver-1.5.jar"> > > + <export name="*"/> > > + <exclude name="selenium-remote-driver" /> > > + </library> > > + </requires> > > +</plugin> > > > > Added: > nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java > > URL: > http://svn.apache.org/viewvc/nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java?rev=1662530&view=auto > > > ============================================================================== > > --- > nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java > (added) > > +++ > nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java > Thu Feb 26 18:31:39 2015 > > @@ -0,0 +1,78 @@ > > +/** > > + * Licensed to the Apache Software Foundation (ASF) under one or more > > + * contributor license agreements. See the NOTICE file distributed with > > + * this work for additional information regarding copyright ownership. > > + * The ASF licenses this file to You under the Apache License, Version > 2.0 > > + * (the "License"); you may not use this file except in compliance with > > + * the License. You may obtain a copy of the License at > > + * > > + * http://www.apache.org/licenses/LICENSE-2.0 > > + * > > + * Unless required by applicable law or agreed to in writing, software > > + * distributed under the License is distributed on an "AS IS" BASIS, > > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or > implied. > > + * See the License for the specific language governing permissions and > > + * limitations under the License. > > + */ > > +package org.apache.nutch.protocol.selenium; > > + > > +import org.apache.hadoop.conf.Configuration; > > +import org.slf4j.Logger; > > +import org.slf4j.LoggerFactory; > > +import org.openqa.selenium.By; > > +import org.openqa.selenium.WebDriver; > > +import org.openqa.selenium.firefox.FirefoxDriver; > > +import org.openqa.selenium.firefox.FirefoxProfile; > > +import org.openqa.selenium.support.ui.WebDriverWait; > > + > > +import java.lang.String; > > + > > +public class HttpWebClient { > > + > > + private static final Logger LOG = > LoggerFactory.getLogger("org.apache.nutch.protocol"); > > + > > + public static ThreadLocal<WebDriver> threadWebDriver = new > ThreadLocal<WebDriver>() { > > + > > + @Override > > + protected WebDriver initialValue() > > + { > > + FirefoxProfile profile = new FirefoxProfile(); > > + profile.setPreference("permissions.default.stylesheet", 2); > > + profile.setPreference("permissions.default.image", 2); > > + profile.setPreference("dom.ipc.plugins.enabled.libflashplayer.so", > "false"); > > + WebDriver driver = new FirefoxDriver(profile); > > + return driver; > > + }; > > + }; > > + > > + public static String getHtmlPage(String url, Configuration conf) { > > + WebDriver driver = null; > > + > > + try { > > + driver = new FirefoxDriver(); > > + //} WebDriver driver = threadWebDriver.get(); > > + // if (driver == null) { > > + // driver = new FirefoxDriver(); > > + // } > > + > > + driver.get(url); > > + > > + // Wait for the page to load, timeout after 3 seconds > > + new WebDriverWait(driver, 3); > > + > > + String innerHtml = > driver.findElement(By.tagName("body")).getAttribute("innerHTML"); > > + > > + return innerHtml; > > + > > + // I'm sure this catch statement is a code smell ; borrowing it > from lib-htmlunit > > + } catch (Exception e) { > > + throw new RuntimeException(e); > > + } finally { > > + if (driver != null) try { driver.quit(); } catch (Exception e) { > throw new RuntimeException(e); } > > + } > > + }; > > + > > + public static String getHtmlPage(String url) { > > + return getHtmlPage(url, null); > > + } > > +} > > \ No newline at end of file > > > > Added: nutch/trunk/src/plugin/protocol-selenium/build-ivy.xml > > URL: > http://svn.apache.org/viewvc/nutch/trunk/src/plugin/protocol-selenium/build-ivy.xml?rev=1662530&view=auto > > > ============================================================================== > > --- nutch/trunk/src/plugin/protocol-selenium/build-ivy.xml (added) > > +++ nutch/trunk/src/plugin/protocol-selenium/build-ivy.xml Thu Feb 26 > 18:31:39 2015 > > @@ -0,0 +1,54 @@ > > +<?xml version="1.0"?> > > +<!-- > > + Licensed to the Apache Software Foundation (ASF) under one or more > > + contributor license agreements. See the NOTICE file distributed with > > + this work for additional information regarding copyright ownership. > > + The ASF licenses this file to You under the Apache License, Version 2.0 > > + (the "License"); you may not use this file except in compliance with > > + the License. You may obtain a copy of the License at > > + > > + http://www.apache.org/licenses/LICENSE-2.0 > > + > > + Unless required by applicable law or agreed to in writing, software > > + distributed under the License is distributed on an "AS IS" BASIS, > > + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or > implied. > > + See the License for the specific language governing permissions and > > + limitations under the License. > > +--> > > +<project name="protocol-selenium" default="deps-jar" > xmlns:ivy="antlib:org.apache.ivy.ant"> > > + > > + <property name="ivy.install.version" value="2.1.0" /> > > + <condition property="ivy.home" value="${env.IVY_HOME}"> > > + <isset property="env.IVY_HOME" /> > > + </condition> > > + <property name="ivy.home" value="${user.home}/.ant" /> > > + <property name="ivy.checksums" value="" /> > > + <property name="ivy.jar.dir" value="${ivy.home}/lib" /> > > + <property name="ivy.jar.file" value="${ivy.jar.dir}/ivy.jar" /> > > + > > + <target name="download-ivy" unless="offline"> > > + > > + <mkdir dir="${ivy.jar.dir}"/> > > + <!-- download Ivy from web site so that it can be used even > without any special installation --> > > + <get src=" > http://repo2.maven.org/maven2/org/apache/ivy/ivy/${ivy.install.version}/ivy-${ivy.install.version}.jar > " > > + dest="${ivy.jar.file}" usetimestamp="true"/> > > + </target> > > + > > + <target name="init-ivy" depends="download-ivy"> > > + <!-- try to load ivy here from ivy home, in case the user has not > already dropped > > + it into ant's lib dir (note that the latter copy will > always take precedence). > > + We will not fail as long as local lib dir exists (it may > be empty) and > > + ivy is in at least one of ant's lib dir or the local lib > dir. --> > > + <path id="ivy.lib.path"> > > + <fileset dir="${ivy.jar.dir}" includes="*.jar"/> > > + > > + </path> > > + <taskdef resource="org/apache/ivy/ant/antlib.xml" > > + uri="antlib:org.apache.ivy.ant" > classpathref="ivy.lib.path"/> > > + </target> > > + > > + <target name="deps-jar" depends="init-ivy"> > > + <ivy:retrieve pattern="lib/[artifact]-[revision].[ext]"/> > > + </target> > > + > > +</project> > > > > Added: nutch/trunk/src/plugin/protocol-selenium/build.xml > > URL: > http://svn.apache.org/viewvc/nutch/trunk/src/plugin/protocol-selenium/build.xml?rev=1662530&view=auto > > > ============================================================================== > > --- nutch/trunk/src/plugin/protocol-selenium/build.xml (added) > > +++ nutch/trunk/src/plugin/protocol-selenium/build.xml Thu Feb 26 > 18:31:39 2015 > > @@ -0,0 +1,36 @@ > > +<?xml version="1.0"?> > > +<!-- > > + Licensed to the Apache Software Foundation (ASF) under one or more > > + contributor license agreements. See the NOTICE file distributed with > > + this work for additional information regarding copyright ownership. > > + The ASF licenses this file to You under the Apache License, Version 2.0 > > + (the "License"); you may not use this file except in compliance with > > + the License. You may obtain a copy of the License at > > + > > + http://www.apache.org/licenses/LICENSE-2.0 > > + > > + Unless required by applicable law or agreed to in writing, software > > + distributed under the License is distributed on an "AS IS" BASIS, > > + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or > implied. > > + See the License for the specific language governing permissions and > > + limitations under the License. > > +--> > > +<project name="protocol-selenium" default="jar-core"> > > + > > + <import file="../build-plugin.xml"/> > > + > > + <!-- Build compilation dependencies --> > > + <target name="deps-jar"> > > + <ant target="jar" inheritall="false" dir="../lib-http"/> > > + <ant target="jar" inheritall="false" dir="../lib-selenium"/> > > + </target> > > + > > + <!-- Add compilation dependencies to classpath --> > > + <path id="plugin.deps"> > > + <fileset dir="${nutch.root}/build"> > > + <include name="**/lib-http/*.jar" /> > > + <include name="**/lib-selenium/*.jar" /> > > + </fileset> > > + </path> > > + > > +</project> > > > > Added: nutch/trunk/src/plugin/protocol-selenium/ivy.xml > > URL: > http://svn.apache.org/viewvc/nutch/trunk/src/plugin/protocol-selenium/ivy.xml?rev=1662530&view=auto > > > ============================================================================== > > --- nutch/trunk/src/plugin/protocol-selenium/ivy.xml (added) > > +++ nutch/trunk/src/plugin/protocol-selenium/ivy.xml Thu Feb 26 18:31:39 > 2015 > > @@ -0,0 +1,48 @@ > > +<?xml version="1.0" ?> > > + > > +<!-- > > + Licensed to the Apache Software Foundation (ASF) under one or more > > + contributor license agreements. See the NOTICE file distributed with > > + this work for additional information regarding copyright ownership. > > + The ASF licenses this file to You under the Apache License, Version > 2.0 > > + (the "License"); you may not use this file except in compliance with > > + the License. You may obtain a copy of the License at > > + > > + http://www.apache.org/licenses/LICENSE-2.0 > > + > > + Unless required by applicable law or agreed to in writing, software > > + distributed under the License is distributed on an "AS IS" BASIS, > > + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or > implied. > > + See the License for the specific language governing permissions and > > + limitations under the License. > > +--> > > + > > +<ivy-module version="1.0"> > > + <info organisation="org.apache.nutch" module="${ant.project.name}"> > > + <license name="Apache 2.0"/> > > + <ivyauthor name="Apache Nutch Team" url="http://nutch.apache.org"/> > > + <description> > > + Apache Nutch > > + </description> > > + </info> > > + > > + <configurations> > > + <include file="../../..//ivy/ivy-configurations.xml"/> > > + </configurations> > > + > > + <publications> > > + <!--get the artifact from our module name--> > > + <artifact conf="default"/> > > + </publications> > > + > > + <dependencies> > > + <!-- begin selenium dependencies --> > > + <dependency org="org.seleniumhq.selenium" name="selenium-java" > rev="2.44.0" /> > > + > > + <dependency org="com.opera" name="operadriver" rev="1.5"> > > + <exclude org="org.seleniumhq.selenium" > name="selenium-remote-driver" /> > > + </dependency> > > + <!-- end selenium dependencies --> > > + </dependencies> > > + > > +</ivy-module> > > > > Added: nutch/trunk/src/plugin/protocol-selenium/plugin.xml > > URL: > http://svn.apache.org/viewvc/nutch/trunk/src/plugin/protocol-selenium/plugin.xml?rev=1662530&view=auto > > > ============================================================================== > > --- nutch/trunk/src/plugin/protocol-selenium/plugin.xml (added) > > +++ nutch/trunk/src/plugin/protocol-selenium/plugin.xml Thu Feb 26 > 18:31:39 2015 > > @@ -0,0 +1,90 @@ > > +<?xml version="1.0" encoding="UTF-8"?> > > +<!-- > > + Licensed to the Apache Software Foundation (ASF) under one or more > > + contributor license agreements. See the NOTICE file distributed with > > + this work for additional information regarding copyright ownership. > > + The ASF licenses this file to You under the Apache License, Version 2.0 > > + (the "License"); you may not use this file except in compliance with > > + the License. You may obtain a copy of the License at > > + > > + http://www.apache.org/licenses/LICENSE-2.0 > > + > > + Unless required by applicable law or agreed to in writing, software > > + distributed under the License is distributed on an "AS IS" BASIS, > > + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or > implied. > > + See the License for the specific language governing permissions and > > + limitations under the License. > > +--> > > +<plugin > > + id="protocol-selenium" > > + name="Http Protocol Plug-in" > > + version="1.0.0" > > + provider-name="nutch.org"> > > + > > + <runtime> > > + <library name="protocol-selenium.jar"> > > + <export name="*"/> > > + </library> > > + <library name="cglib-nodep-2.1_3.jar"/> > > + <library name="commons-codec-1.9.jar"/> > > + <library name="commons-collections-3.2.1.jar"/> > > + <library name="commons-exec-1.1.jar"/> > > + <library name="commons-io-2.4.jar"/> > > + <library name="commons-jxpath-1.3.jar"/> > > + <library name="commons-lang3-3.3.2.jar"/> > > + <library name="commons-logging-1.1.3.jar"/> > > + <library name="cssparser-0.9.14.jar"/> > > + <library name="gson-2.3.jar"/> > > + <library name="guava-18.0.jar"/> > > + <library name="htmlunit-2.15.jar"/> > > + <library name="htmlunit-core-js-2.15.jar"/> > > + <library name="httpclient-4.3.4.jar"/> > > + <library name="httpcore-4.3.2.jar"/> > > + <library name="httpmime-4.3.3.jar"/> > > + <library name="ini4j-0.5.2.jar"/> > > + <library name="jetty-http-8.1.15.v20140411.jar"/> > > + <library name="jetty-io-8.1.15.v20140411.jar"/> > > + <library name="jetty-util-8.1.15.v20140411.jar"/> > > + <library name="jetty-websocket-8.1.15.v20140411.jar"/> > > + <library name="jna-3.4.0.jar"/> > > + <library name="nekohtml-1.9.21.jar"/> > > + <library name="netty-3.5.2.Final.jar"/> > > + <library name="operadriver-1.5.jar"/> > > + <library name="operalaunchers-1.1.jar"/> > > + <library name="platform-3.4.0.jar"/> > > + <library name="protobuf-java-2.4.1.jar"/> > > + <library name="sac-1.3.jar"/> > > + <library name="selenium-api-2.44.0.jar"/> > > + <library name="selenium-chrome-driver-2.44.0.jar"/> > > + <library name="selenium-firefox-driver-2.44.0.jar"/> > > + <library name="selenium-htmlunit-driver-2.44.0.jar"/> > > + <library name="selenium-ie-driver-2.44.0.jar"/> > > + <library name="selenium-java-2.44.0.jar"/> > > + <library name="selenium-remote-driver-2.44.0.jar"/> > > + <library name="selenium-safari-driver-2.44.0.jar"/> > > + <library name="selenium-support-2.44.0.jar"/> > > + <library name="serializer-2.7.1.jar"/> > > + <library name="webbit-0.4.14.jar"/> > > + <library name="xalan-2.7.1.jar"/> > > + <library name="xercesImpl-2.11.0.jar"/> > > + <library name="xml-apis-1.4.01.jar"/> > > + </runtime> > > + > > + <requires> > > + <import plugin="nutch-extensionpoints"/> > > + <import plugin="lib-http"/> > > + <import plugin="lib-selenium"/> > > + </requires> > > + > > + <extension id="org.apache.nutch.protocol.selenium" > > + name="HttpProtocol" > > + point="org.apache.nutch.protocol.Protocol"> > > + > > + <implementation id="org.apache.nutch.protocol.selenium.Http" > > + class="org.apache.nutch.protocol.selenium.Http"> > > + <parameter name="protocolName" value="http"/> > > + </implementation> > > + > > + </extension> > > + > > +</plugin> > > > > Added: > nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/Http.java > > URL: > http://svn.apache.org/viewvc/nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/Http.java?rev=1662530&view=auto > > > ============================================================================== > > --- > nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/Http.java > (added) > > +++ > nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/Http.java > Thu Feb 26 18:31:39 2015 > > @@ -0,0 +1,59 @@ > > +/** > > + * Licensed to the Apache Software Foundation (ASF) under one or more > > + * contributor license agreements. See the NOTICE file distributed with > > + * this work for additional information regarding copyright ownership. > > + * The ASF licenses this file to You under the Apache License, Version > 2.0 > > + * (the "License"); you may not use this file except in compliance with > > + * the License. You may obtain a copy of the License at > > + * > > + * http://www.apache.org/licenses/LICENSE-2.0 > > + * > > + * Unless required by applicable law or agreed to in writing, software > > + * distributed under the License is distributed on an "AS IS" BASIS, > > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or > implied. > > + * See the License for the specific language governing permissions and > > + * limitations under the License. > > + */ > > +package org.apache.nutch.protocol.selenium; > > + > > +// JDK imports > > +import java.io.IOException; > > +import java.net.URL; > > +import org.apache.hadoop.conf.Configuration; > > +import org.apache.nutch.crawl.CrawlDatum; > > +import org.apache.nutch.net.protocols.Response; > > +import org.apache.nutch.protocol.http.api.HttpBase; > > +import org.apache.nutch.protocol.ProtocolException; > > +import org.apache.nutch.util.NutchConfiguration; > > + > > +import org.apache.nutch.protocol.selenium.HttpResponse; > > + > > +import org.slf4j.Logger; > > +import org.slf4j.LoggerFactory; > > + > > +public class Http extends HttpBase { > > + > > + public static final Logger LOG = LoggerFactory.getLogger(Http.class); > > + > > + public Http() { > > + super(LOG); > > + } > > + > > + @Override > > + public void setConf(Configuration conf) { > > + super.setConf(conf); > > + } > > + > > + public static void main(String[] args) throws Exception { > > + Http http = new Http(); > > + http.setConf(NutchConfiguration.create()); > > + main(http, args); > > + } > > + > > + @Override > > + protected Response getResponse(URL url, CrawlDatum datum, boolean > redirect) > > + throws ProtocolException, IOException { > > + return new HttpResponse(this, url, datum); > > + } > > + > > +} > > > > Added: > nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/HttpResponse.java > > URL: > http://svn.apache.org/viewvc/nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/HttpResponse.java?rev=1662530&view=auto > > > ============================================================================== > > --- > nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/HttpResponse.java > (added) > > +++ > nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/HttpResponse.java > Thu Feb 26 18:31:39 2015 > > @@ -0,0 +1,360 @@ > > +/** > > + * Licensed to the Apache Software Foundation (ASF) under one or more > > + * contributor license agreements. See the NOTICE file distributed with > > + * this work for additional information regarding copyright ownership. > > + * The ASF licenses this file to You under the Apache License, Version > 2.0 > > + * (the "License"); you may not use this file except in compliance with > > + * the License. You may obtain a copy of the License at > > + * > > + * http://www.apache.org/licenses/LICENSE-2.0 > > + * > > + * Unless required by applicable law or agreed to in writing, software > > + * distributed under the License is distributed on an "AS IS" BASIS, > > + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or > implied. > > + * See the License for the specific language governing permissions and > > + * limitations under the License. > > + */ > > +package org.apache.nutch.protocol.selenium; > > + > > +// JDK imports > > +import java.io.BufferedInputStream; > > +import java.io.EOFException; > > +import java.io.IOException; > > +import java.io.OutputStream; > > +import java.io.ByteArrayOutputStream; > > +import java.io.PushbackInputStream; > > +import java.net.InetSocketAddress; > > +import java.net.Socket; > > +import java.net.URL; > > + > > +import org.apache.hadoop.conf.Configuration; > > +import org.apache.nutch.crawl.CrawlDatum; > > +import org.apache.nutch.metadata.Metadata; > > +import org.apache.nutch.metadata.SpellCheckedMetadata; > > +import org.apache.nutch.net.protocols.HttpDateFormat; > > +import org.apache.nutch.net.protocols.Response; > > +import org.apache.nutch.protocol.ProtocolException; > > +import org.apache.nutch.protocol.http.api.HttpException; > > +import org.apache.nutch.protocol.http.api.HttpBase; > > + > > +/* Most of this code was borrowed from protocol-htmlunit; which in turn > borrowed it from protocol-httpclient */ > > + > > +public class HttpResponse implements Response { > > + > > + private Http http; > > + private URL url; > > + private String orig; > > + private String base; > > + private byte[] content; > > + private int code; > > + private Metadata headers = new SpellCheckedMetadata(); > > + > > + /** The nutch configuration */ > > + private Configuration conf = null; > > + > > + public HttpResponse(Http http, URL url, CrawlDatum datum) throws > ProtocolException, IOException { > > + > > + this.conf = http.getConf(); > > + this.http = http; > > + this.url = url; > > + this.orig = url.toString(); > > + this.base = url.toString(); > > + > > + if (!"http".equals(url.getProtocol())) > > + throw new HttpException("Not an HTTP url:" + url); > > + > > + if (Http.LOG.isTraceEnabled()) { > > + Http.LOG.trace("fetching " + url); > > + } > > + > > + String path = "".equals(url.getFile()) ? "/" : url.getFile(); > > + > > + // some servers will redirect a request with a host line like > > + // "Host: <hostname>:80" to "http://<hpstname>/<orig_path>"- they > > + // don't want the :80... > > + > > + String host = url.getHost(); > > + int port; > > + String portString; > > + if (url.getPort() == -1) { > > + port = 80; > > + portString = ""; > > + } else { > > + port = url.getPort(); > > + portString = ":" + port; > > + } > > + Socket socket = null; > > + > > + try { > > + socket = new Socket(); // create the socket > > + socket.setSoTimeout(http.getTimeout()); > > + > > + // connect > > + String sockHost = http.useProxy() ? http.getProxyHost() : host; > > + int sockPort = http.useProxy() ? http.getProxyPort() : port; > > + InetSocketAddress sockAddr = new InetSocketAddress(sockHost, > sockPort); > > + socket.connect(sockAddr, http.getTimeout()); > > + > > + // make request > > + OutputStream req = socket.getOutputStream(); > > + > > + StringBuffer reqStr = new StringBuffer("GET "); > > + if (http.useProxy()) { > > + reqStr.append(url.getProtocol() + "://" + host + portString + > path); > > + } else { > > + reqStr.append(path); > > + } > > + > > + reqStr.append(" HTTP/1.0\r\n"); > > + > > + reqStr.append("Host: "); > > + reqStr.append(host); > > + reqStr.append(portString); > > + reqStr.append("\r\n"); > > + > > + reqStr.append("Accept-Encoding: x-gzip, gzip, deflate\r\n"); > > + > > + String userAgent = http.getUserAgent(); > > + if ((userAgent == null) || (userAgent.length() == 0)) { > > + if (Http.LOG.isErrorEnabled()) { > > + Http.LOG.error("User-agent is not set!"); > > + } > > + } else { > > + reqStr.append("User-Agent: "); > > + reqStr.append(userAgent); > > + reqStr.append("\r\n"); > > + } > > + > > + reqStr.append("Accept-Language: "); > > + reqStr.append(this.http.getAcceptLanguage()); > > + reqStr.append("\r\n"); > > + > > + reqStr.append("Accept: "); > > + reqStr.append(this.http.getAccept()); > > + reqStr.append("\r\n"); > > + > > + if (datum.getModifiedTime() > 0) { > > + reqStr.append("If-Modified-Since: " + > HttpDateFormat.toString(datum.getModifiedTime())); > > + reqStr.append("\r\n"); > > + } > > + reqStr.append("\r\n"); > > + > > + byte[] reqBytes = reqStr.toString().getBytes(); > > + > > + req.write(reqBytes); > > + req.flush(); > > + > > + PushbackInputStream in = // process response > > + new PushbackInputStream(new > BufferedInputStream(socket.getInputStream(), Http.BUFFER_SIZE), > > + Http.BUFFER_SIZE); > > + > > + StringBuffer line = new StringBuffer(); > > + > > + boolean haveSeenNonContinueStatus = false; > > + while (!haveSeenNonContinueStatus) { > > + // parse status code line > > + this.code = parseStatusLine(in, line); > > + // parse headers > > + parseHeaders(in, line); > > + haveSeenNonContinueStatus = code != 100; // 100 is "Continue" > > + } > > + > > + // Get Content type header > > + String contentType = getHeader(Response.CONTENT_TYPE); > > + > > + // handle with Selenium only if content type in HTML or XHTML > > + if (contentType != null) { > > + if (contentType.contains("text/html") || > contentType.contains("application/xhtml")) { > > + readPlainContent(url); > > + } else { > > + try { > > + int contentLength = Integer.MAX_VALUE; > > + String contentLengthString = > headers.get(Response.CONTENT_LENGTH); > > + if (contentLengthString != null) { > > + try { > > + contentLength = > Integer.parseInt(contentLengthString.trim()); > > + } catch (NumberFormatException ex) { > > + throw new HttpException("bad content length: " + > contentLengthString); > > + } > > + } > > + > > + if (http.getMaxContent() >= 0 && contentLength > > http.getMaxContent()) { > > + contentLength = http.getMaxContent(); > > + } > > + > > + byte[] buffer = new byte[HttpBase.BUFFER_SIZE]; > > + int bufferFilled = 0; > > + int totalRead = 0; > > + ByteArrayOutputStream out = new ByteArrayOutputStream(); > > + while ((bufferFilled = in.read(buffer, 0, buffer.length)) > != -1 > > + && totalRead + bufferFilled <= contentLength) { > > + totalRead += bufferFilled; > > + out.write(buffer, 0, bufferFilled); > > + } > > + > > + content = out.toByteArray(); > > + > > + } catch (Exception e) { > > + if (code == 200) > > + throw new IOException(e.toString()); > > + // for codes other than 200 OK, we are fine with empty > content > > + } finally { > > + if (in != null) { > > + in.close(); > > + } > > + } > > + } > > + } > > + > > + } finally { > > + if (socket != null) > > + socket.close(); > > + } > > + } > > + > > + /* ------------------------- * > > + * <implementation:Response> * > > + * ------------------------- */ > > + > > + public URL getUrl() { > > + return url; > > + } > > + > > + public int getCode() { > > + return code; > > + } > > + > > + public String getHeader(String name) { > > + return headers.get(name); > > + } > > + > > + public Metadata getHeaders() { > > + return headers; > > + } > > + > > + public byte[] getContent() { > > + return content; > > + } > > + > > + /* ------------------------- * > > + * <implementation:Response> * > > + * ------------------------- */ > > + > > + private void readPlainContent(URL url) throws IOException { > > + String page = HttpWebClient.getHtmlPage(url.toString(), conf); > > + > > + content = page.getBytes("UTF-8"); > > + } > > + > > + private int parseStatusLine(PushbackInputStream in, StringBuffer > line) throws IOException, HttpException { > > + readLine(in, line, false); > > + > > + int codeStart = line.indexOf(" "); > > + int codeEnd = line.indexOf(" ", codeStart + 1); > > + > > + // handle lines with no plaintext result code, ie: > > + // "HTTP/1.1 200" vs "HTTP/1.1 200 OK" > > + if (codeEnd == -1) > > + codeEnd = line.length(); > > + > > + int code; > > + try { > > + code = Integer.parseInt(line.substring(codeStart + 1, codeEnd)); > > + } catch (NumberFormatException e) { > > + throw new HttpException("bad status line '" + line + "': " + > e.getMessage(), e); > > + } > > + > > + return code; > > + } > > + > > + private void processHeaderLine(StringBuffer line) throws IOException, > HttpException { > > + > > + int colonIndex = line.indexOf(":"); // key is up to colon > > + if (colonIndex == -1) { > > + int i; > > + for (i = 0; i < line.length(); i++) > > + if (!Character.isWhitespace(line.charAt(i))) > > + break; > > + if (i == line.length()) > > + return; > > + throw new HttpException("No colon in header:" + line); > > + } > > + String key = line.substring(0, colonIndex); > > + > > + int valueStart = colonIndex + 1; // skip whitespace > > + while (valueStart < line.length()) { > > + int c = line.charAt(valueStart); > > + if (c != ' ' && c != '\t') > > + break; > > + valueStart++; > > + } > > + String value = line.substring(valueStart); > > + headers.set(key, value); > > + } > > + > > + // Adds headers to our headers Metadata > > + private void parseHeaders(PushbackInputStream in, StringBuffer line) > throws IOException, HttpException { > > + > > + while (readLine(in, line, true) != 0) { > > + > > + // handle HTTP responses with missing blank line after headers > > + int pos; > > + if (((pos = line.indexOf("<!DOCTYPE")) != -1) || ((pos = > line.indexOf("<HTML")) != -1) > > + || ((pos = line.indexOf("<html")) != -1)) { > > + > > + in.unread(line.substring(pos).getBytes("UTF-8")); > > + line.setLength(pos); > > + > > + try { > > + //TODO: (CM) We don't know the header names here > > + //since we're just handling them generically. It would > > + //be nice to provide some sort of mapping function here > > + //for the returned header names to the standard metadata > > + //names in the ParseData class > > + processHeaderLine(line); > > + } catch (Exception e) { > > + // fixme: > > + Http.LOG.warn("Error: ", e); > > + } > > + return; > > + } > > + > > + processHeaderLine(line); > > + } > > + } > > + > > + private static int readLine(PushbackInputStream in, StringBuffer > line, boolean allowContinuedLine) > > + throws IOException { > > + line.setLength(0); > > + for (int c = in.read(); c != -1; c = in.read()) { > > + switch (c) { > > + case '\r': > > + if (peek(in) == '\n') { > > + in.read(); > > + } > > + case '\n': > > + if (line.length() > 0) { > > + // at EOL -- check for continued line if the current > > + // (possibly continued) line wasn't blank > > + if (allowContinuedLine) > > + switch (peek(in)) { > > + case ' ': > > + case '\t': // line is continued > > + in.read(); > > + continue; > > + } > > + } > > + return line.length(); // else complete > > + default: > > + line.append((char) c); > > + } > > + } > > + throw new EOFException(); > > + } > > + > > + private static int peek(PushbackInputStream in) throws IOException { > > + int value = in.read(); > > + in.unread(value); > > + return value; > > + } > > +} > > > > Added: > nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/package.html > > URL: > http://svn.apache.org/viewvc/nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/package.html?rev=1662530&view=auto > > > ============================================================================== > > --- > nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/package.html > (added) > > +++ > nutch/trunk/src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/package.html > Thu Feb 26 18:31:39 2015 > > @@ -0,0 +1,5 @@ > > +<html> > > +<body> > > +<p>Protocol plugin which supports retrieving documents via > selenium.</p><p></p> > > +</body> > > +</html> > > > > Added: > nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/protocol/htmlunit/package.html > > URL: > http://svn.apache.org/viewvc/nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/protocol/htmlunit/package.html?rev=1662530&view=auto > > > ============================================================================== > > --- > nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/protocol/htmlunit/package.html > (added) > > +++ > nutch/trunk/src/plugin/protocol-selenium/src/target/classes/org/apache/nutch/protocol/htmlunit/package.html > Thu Feb 26 18:31:39 2015 > > @@ -0,0 +1,5 @@ > > +<html> > > +<body> > > +<p>Protocol plugin which supports retrieving documents via the > htmlunit.</p><p></p> > > +</body> > > +</html> > > > > > -- Open Source Solutions for Text Engineering http://digitalpebble.blogspot.com/ http://www.digitalpebble.com http://twitter.com/digitalpebble

