Author: lewismc
Date: Wed Sep 23 00:59:52 2015
New Revision: 1704754
URL: http://svn.apache.org/viewvc?rev=1704754&view=rev
Log:
NUTCH-2105 Update Nutch Cassandra Dockerfile to work with Gora Nutch 2.3.1
Modified:
nutch/branches/2.x/CHANGES.txt
nutch/branches/2.x/docker/cassandra/README.md
nutch/branches/2.x/docker/cassandra/bin/build.sh
nutch/branches/2.x/docker/cassandra/bin/ipof.sh
nutch/branches/2.x/docker/cassandra/bin/nodes.sh
nutch/branches/2.x/docker/cassandra/bin/restart.sh
nutch/branches/2.x/docker/cassandra/bin/start.sh
nutch/branches/2.x/docker/cassandra/bin/stop.sh
nutch/branches/2.x/docker/cassandra/cassandra/Dockerfile
nutch/branches/2.x/docker/cassandra/cassandra/bootstrap.sh
nutch/branches/2.x/docker/cassandra/nutch/Dockerfile
nutch/branches/2.x/docker/cassandra/nutch/bootstrap.sh
nutch/branches/2.x/docker/cassandra/nutch/config/nutch-site.xml
nutch/branches/2.x/docker/cassandra/nutch/testUrls/seed.txt
Modified: nutch/branches/2.x/CHANGES.txt
URL:
http://svn.apache.org/viewvc/nutch/branches/2.x/CHANGES.txt?rev=1704754&r1=1704753&r2=1704754&view=diff
==============================================================================
--- nutch/branches/2.x/CHANGES.txt (original)
+++ nutch/branches/2.x/CHANGES.txt Wed Sep 23 00:59:52 2015
@@ -2,6 +2,8 @@ Nutch Change Log
Current Development 2.4-SNAPSHOT
+* NUTCH-2105 Update Nutch Cassandra Dockerfile to work with Gora Nutch 2.3.1
(lewismc)
+
* NUTCH-1946 Upgrade to Gora 0.6.1 (lewismc, hsaputra, Jeroen Vlek)
* NUTCH-2094 Stopping and Restarting a crawl has issues in the Web UI (Prerna
Satija via mattmann)
Modified: nutch/branches/2.x/docker/cassandra/README.md
URL:
http://svn.apache.org/viewvc/nutch/branches/2.x/docker/cassandra/README.md?rev=1704754&r1=1704753&r2=1704754&view=diff
==============================================================================
--- nutch/branches/2.x/docker/cassandra/README.md (original)
+++ nutch/branches/2.x/docker/cassandra/README.md Wed Sep 23 00:59:52 2015
@@ -1,13 +1,11 @@
-#Apache Nutch 2.x with Cassandra on Docker
+Apache Nutch 2.x with Cassandra on Docker
=======================
-This project is 3 Docker containers running Apache Nutch 2.x configured with
Cassandra storage.
-
-Due to the lack of integration information between Nutch 2.x / Cassandra,
Mohamed Meabed (@Meabed) developed these docker containers with configuration
and integration between them.
+This project contains 3 Docker containers running Apache Nutch 2.x configured
with [Apache Cassandra](http://cassandra.apache.org) storage.
This is project is fully operational but its still experimental, any feedback,
suggestions should be directed to [email protected] and contribution(s)
will be highly appreciated!
-##Usage notes:
+#Usage
1. Build the images and start the containers " NOTE: for Mac OS running
boot2docker, Please read the Notes section Below ".
Modified: nutch/branches/2.x/docker/cassandra/bin/build.sh
URL:
http://svn.apache.org/viewvc/nutch/branches/2.x/docker/cassandra/bin/build.sh?rev=1704754&r1=1704753&r2=1704754&view=diff
==============================================================================
--- nutch/branches/2.x/docker/cassandra/bin/build.sh (original)
+++ nutch/branches/2.x/docker/cassandra/bin/build.sh Wed Sep 23 00:59:52 2015
@@ -1,8 +1,23 @@
#!/bin/sh
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
B_DIR="`pwd`/"
docker pull meabed/debian-jdk
#
-docker build -t "meabed/nutch:2.3" $B_DIR/nutch/
-docker build -t "meabed/cassandra" $B_DIR/cassandra/
+docker build -t "apache/nutch:2.x" $B_DIR/nutch/
+docker build -t "apache/cassandra" $B_DIR/cassandra/
Modified: nutch/branches/2.x/docker/cassandra/bin/ipof.sh
URL:
http://svn.apache.org/viewvc/nutch/branches/2.x/docker/cassandra/bin/ipof.sh?rev=1704754&r1=1704753&r2=1704754&view=diff
==============================================================================
--- nutch/branches/2.x/docker/cassandra/bin/ipof.sh (original)
+++ nutch/branches/2.x/docker/cassandra/bin/ipof.sh Wed Sep 23 00:59:52 2015
@@ -1,4 +1,19 @@
#!/bin/sh
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
CONTAINER=$1
docker inspect --format '{{ .NetworkSettings.IPAddress }}' $CONTAINER
Modified: nutch/branches/2.x/docker/cassandra/bin/nodes.sh
URL:
http://svn.apache.org/viewvc/nutch/branches/2.x/docker/cassandra/bin/nodes.sh?rev=1704754&r1=1704753&r2=1704754&view=diff
==============================================================================
--- nutch/branches/2.x/docker/cassandra/bin/nodes.sh (original)
+++ nutch/branches/2.x/docker/cassandra/bin/nodes.sh Wed Sep 23 00:59:52 2015
@@ -1,4 +1,19 @@
#!/bin/sh
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
function isRunning {
id=$(docker ps -a | grep $1 | awk '{print $1}')
Modified: nutch/branches/2.x/docker/cassandra/bin/restart.sh
URL:
http://svn.apache.org/viewvc/nutch/branches/2.x/docker/cassandra/bin/restart.sh?rev=1704754&r1=1704753&r2=1704754&view=diff
==============================================================================
--- nutch/branches/2.x/docker/cassandra/bin/restart.sh (original)
+++ nutch/branches/2.x/docker/cassandra/bin/restart.sh Wed Sep 23 00:59:52 2015
@@ -1,4 +1,19 @@
#!/bin/sh
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
B_DIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )
Modified: nutch/branches/2.x/docker/cassandra/bin/start.sh
URL:
http://svn.apache.org/viewvc/nutch/branches/2.x/docker/cassandra/bin/start.sh?rev=1704754&r1=1704753&r2=1704754&view=diff
==============================================================================
--- nutch/branches/2.x/docker/cassandra/bin/start.sh (original)
+++ nutch/branches/2.x/docker/cassandra/bin/start.sh Wed Sep 23 00:59:52 2015
@@ -1,4 +1,19 @@
#!/bin/sh
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
B_DIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )
DOCKER_DATA_FOLDER=$B_DIR/docker-data
@@ -8,11 +23,12 @@ chmod -R 777 $DOCKER_DATA_FOLDER
source "$B_DIR/nodes.sh"
source "$B_DIR/stop.sh"
-cassandraId=$(docker run -d -P -v $DOCKER_DATA_FOLDER:/data:rw --name
$cassandraNodeName meabed/cassandra)
+cassandraId=$(docker run -d -P -v $DOCKER_DATA_FOLDER:/data:rw --name
$cassandraNodeName apache/cassandra)
cassandraIP=$("$B_DIR"/ipof.sh $cassandraId)
# -p 9200:9200
# http://dockerhost:9200/_plugin/kopf/
# http://dockerhost:9200/_plugin/HQ/
-docker run -d -p 8899:8899 -P -e CASSANDRA_NODE_NAME=$cassandraNodeName -it
--link $cassandraNodeName:$cassandraNodeName -v $DOCKER_DATA_FOLDER:/data:rw
--name $nutchNodeName meabed/nutch:2.3
+docker run -d -p 8899:8899 -P -e CASSANDRA_NODE_NAME=$cassandraNodeName -it
--link $cassandraNodeName:$cassandraNodeName -v $DOCKER_DATA_FOLDER:/data:rw
--name $nutchNodeName apache/nutch:2.x
+# apache/nutch2cassandra
Modified: nutch/branches/2.x/docker/cassandra/bin/stop.sh
URL:
http://svn.apache.org/viewvc/nutch/branches/2.x/docker/cassandra/bin/stop.sh?rev=1704754&r1=1704753&r2=1704754&view=diff
==============================================================================
--- nutch/branches/2.x/docker/cassandra/bin/stop.sh (original)
+++ nutch/branches/2.x/docker/cassandra/bin/stop.sh Wed Sep 23 00:59:52 2015
@@ -1,4 +1,19 @@
#!/bin/sh
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
B_DIR=$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )
source "$B_DIR/nodes.sh"
Modified: nutch/branches/2.x/docker/cassandra/cassandra/Dockerfile
URL:
http://svn.apache.org/viewvc/nutch/branches/2.x/docker/cassandra/cassandra/Dockerfile?rev=1704754&r1=1704753&r2=1704754&view=diff
==============================================================================
--- nutch/branches/2.x/docker/cassandra/cassandra/Dockerfile (original)
+++ nutch/branches/2.x/docker/cassandra/cassandra/Dockerfile Wed Sep 23
00:59:52 2015
@@ -1,7 +1,20 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
#
-# Cassandra
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
# meabed/debian-jdk
-# docker build -t meabed/cassandra:latest .
+# docker build -t apache/cassandra:latest .
#
# sudo sysctl -w vm.max_map_count=2621444
# sudo su
@@ -13,14 +26,14 @@
# ulimit -c unlimited
FROM meabed/debian-jdk
-MAINTAINER Mohamed Meabed "[email protected]"
+MAINTAINER Nutch Developers "[email protected]"
USER root
ENV DEBIAN_FRONTEND noninteractive
# ADD DataStax sources
-RUN echo "deb http://debian.datastax.com/community stable main" | tee -a
/etc/apt/sources.list.d/cassandra.sources.list
+RUN echo "deb http://debian.datastax.com/community 2.1 main" | tee -a
/etc/apt/sources.list.d/cassandra.sources.list
RUN curl -L http://debian.datastax.com/debian/repo_key | apt-key add -
RUN apt-get update
Modified: nutch/branches/2.x/docker/cassandra/cassandra/bootstrap.sh
URL:
http://svn.apache.org/viewvc/nutch/branches/2.x/docker/cassandra/cassandra/bootstrap.sh?rev=1704754&r1=1704753&r2=1704754&view=diff
==============================================================================
--- nutch/branches/2.x/docker/cassandra/cassandra/bootstrap.sh (original)
+++ nutch/branches/2.x/docker/cassandra/cassandra/bootstrap.sh Wed Sep 23
00:59:52 2015
@@ -1,4 +1,19 @@
#!/bin/bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
export PATH=$PATH:/usr/local/sbin/
export PATH=$PATH:/usr/sbin/
Modified: nutch/branches/2.x/docker/cassandra/nutch/Dockerfile
URL:
http://svn.apache.org/viewvc/nutch/branches/2.x/docker/cassandra/nutch/Dockerfile?rev=1704754&r1=1704753&r2=1704754&view=diff
==============================================================================
--- nutch/branches/2.x/docker/cassandra/nutch/Dockerfile (original)
+++ nutch/branches/2.x/docker/cassandra/nutch/Dockerfile Wed Sep 23 00:59:52
2015
@@ -1,30 +1,41 @@
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
#
-# Nutch
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
# meabed/debian-jdk
-# docker build -t meabed/nutch:latest .
+# docker build -t apache/nutch:2.x .
#
FROM meabed/debian-jdk
-MAINTAINER Mohamed Meabed "[email protected]"
+MAINTAINER Nutch Developers "[email protected]"
USER root
ENV DEBIAN_FRONTEND noninteractive
-ENV NUTCH_VERSION 2.3
-
#ant
-RUN apt-get install -y ant
+RUN apt-get update && apt-get install -y ant subversion --fix-missing
#Download nutch
-RUN mkdir -p /opt/downloads && cd /opt/downloads && curl -SsfLO
"http://archive.apache.org/dist/nutch/$NUTCH_VERSION/apache-nutch-$NUTCH_VERSION-src.tar.gz"
-RUN cd /opt && tar xvfz /opt/downloads/apache-nutch-$NUTCH_VERSION-src.tar.gz
-#WORKDIR /opt/apache-nutch-$NUTCH_VERSION
-ENV NUTCH_ROOT /opt/apache-nutch-$NUTCH_VERSION
+RUN mkdir -p /opt/downloads && cd /opt/downloads && svn co
http://svn.apache.org/repos/asf/nutch/branches/2.x apache-nutch-2.x
+RUN cd /opt
+RUN ln -s /opt/downloads/apache-nutch-2.x /opt/apache-nutch-2.x
+ENV NUTCH_ROOT /opt/apache-nutch-2.x
ENV HOME /root
#Nutch-default
-# RUN sed -i '/^ <name>http.agent.name<\/name>$/{$!{N;s/^
<name>http.agent.name<\/name>\n <value><\/value>$/
<name>http.agent.name<\/name>\n <value>iData Bot<\/value>/;ty;P;D;:y}}'
$NUTCH_ROOT/conf/nutch-default.xml
+# RUN sed -i '/^ <name>http.agent.name<\/name>$/{$!{N;s/^
<name>http.agent.name<\/name>\n <value><\/value>$/
<name>http.agent.name<\/name>\n <value>Nutch 2.X Cassandra
Docker<\/value>/;ty;P;D;:y}}' $NUTCH_ROOT/conf/nutch-default.xml
RUN vim -c 'g/name="gora-cassandra"/+1d' -c 'x' $NUTCH_ROOT/ivy/ivy.xml
RUN vim -c 'g/name="gora-cassandra"/-1d' -c 'x' $NUTCH_ROOT/ivy/ivy.xml
@@ -39,14 +50,12 @@ RUN rm $NUTCH_ROOT/lib/native/*
#Modification and compilation again
-ADD plugin/nutch2-index-html/src/plugin/ $NUTCH_ROOT/src/plugin/
-RUN sed -i '/dir="index-more" target="deploy".*/ s/.*/&\n <ant
dir="index-html" target="deploy"\/>/' $NUTCH_ROOT/src/plugin/build.xml
-RUN sed -i '/dir="index-more" target="clean".*/ s/.*/&\n <ant
dir="index-html" target="clean"\/>/' $NUTCH_ROOT/src/plugin/build.xml
-
+#ADD plugin/nutch2-index-html/src/plugin/ $NUTCH_ROOT/src/plugin/
+#RUN sed -i '/dir="index-more" target="deploy".*/ s/.*/&\n <ant
dir="index-html" target="deploy"\/>/' #$NUTCH_ROOT/src/plugin/build.xml
+#RUN sed -i '/dir="index-more" target="clean".*/ s/.*/&\n <ant
dir="index-html" target="clean"\/>/' #$NUTCH_ROOT/src/plugin/build.xml
+#RUN cd $NUTCH_ROOT && ant runtime
-RUN cd $NUTCH_ROOT && ant runtime
-
-RUN ln -s /opt/apache-nutch-$NUTCH_VERSION/runtime/local /opt/nutch
+RUN ln -s /opt/apache-nutch-2.x/runtime/local /opt/nutch
ENV NUTCH_HOME /opt/nutch
@@ -57,7 +66,7 @@ CMD mkdir -p $NUTCH_HOME/testUrls
ADD testUrls $NUTCH_HOME/testUrls
# Adding rawcontent that hold html of the page field in index to elasticsearch
-RUN sed -i '/field name="date" type.*/ s/.*/&\n\n <field
name="rawcontent" type="text" sstored="true" indexed="true"
multiValued="false"\/>\n/' $NUTCH_HOME/conf/schema.xml
+#RUN sed -i '/field name="date" type.*/ s/.*/&\n\n <field
name="rawcontent" type="text" sstored="true" indexed="true"
multiValued="false"\/>\n/' $NUTCH_HOME/conf/schema.xml
# remove nutche-site.xml default file to replace it by our configuration
RUN rm $NUTCH_HOME/conf/nutch-site.xml
@@ -66,10 +75,6 @@ ADD config/nutch-site.xml $NUTCH_HOME/co
# Port that nutchserver will use
ENV NUTCHSERVER_PORT 8899
-#RUN cd $NUTCH_HOME && ls -al
-
-#RUN mkdir -p /opt/nutch/urls && cd /opt/crawl
-
ADD bootstrap.sh /etc/bootstrap.sh
RUN chown root:root /etc/bootstrap.sh
RUN chmod 700 /etc/bootstrap.sh
Modified: nutch/branches/2.x/docker/cassandra/nutch/bootstrap.sh
URL:
http://svn.apache.org/viewvc/nutch/branches/2.x/docker/cassandra/nutch/bootstrap.sh?rev=1704754&r1=1704753&r2=1704754&view=diff
==============================================================================
--- nutch/branches/2.x/docker/cassandra/nutch/bootstrap.sh (original)
+++ nutch/branches/2.x/docker/cassandra/nutch/bootstrap.sh Wed Sep 23 00:59:52
2015
@@ -1,4 +1,19 @@
#!/bin/bash
+#
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
export PATH=$PATH:/usr/local/sbin/
export PATH=$PATH:/usr/sbin/
Modified: nutch/branches/2.x/docker/cassandra/nutch/config/nutch-site.xml
URL:
http://svn.apache.org/viewvc/nutch/branches/2.x/docker/cassandra/nutch/config/nutch-site.xml?rev=1704754&r1=1704753&r2=1704754&view=diff
==============================================================================
--- nutch/branches/2.x/docker/cassandra/nutch/config/nutch-site.xml (original)
+++ nutch/branches/2.x/docker/cassandra/nutch/config/nutch-site.xml Wed Sep 23
00:59:52 2015
@@ -1,5 +1,21 @@
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
+<!--
+ Licensed to the Apache Software Foundation (ASF) under one or more
+ contributor license agreements. See the NOTICE file distributed with
+ this work for additional information regarding copyright ownership.
+ The ASF licenses this file to You under the Apache License, Version 2.0
+ (the "License"); you may not use this file except in compliance with
+ the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+ Unless required by applicable law or agreed to in writing, software
+ distributed under the License is distributed on an "AS IS" BASIS,
+ WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ See the License for the specific language governing permissions and
+ limitations under the License.
+-->
<configuration>
@@ -20,16 +36,8 @@
<value>0.0.1</value>
</property>
<property>
- <name>http.agent.url</name>
- <value>http://www.google.com</value>
- </property>
- <property>
- <name>http.agent.email</name>
- <value>[email protected]</value>
- </property>
- <property>
<name>http.content.limit</name>
- <value>1000000</value>
+ <value>-1</value>
</property>
<property>
<name>storage.data.store.class</name>
@@ -37,35 +45,6 @@
<description>Default class for storing data</description>
</property>
<property>
- <name>fetcher.server.delay</name>
- <value>2.0</value>
- <description>The number of seconds the fetcher will delay between
- successive requests to the same server.
- </description>
- </property>
- <property>
- <name>indexer.max.title.length</name>
- <value>300</value>
- <description>The maximum number of characters of a title that are
indexed. A value of -1 disables this check.
- Used by index-basic.
- </description>
- </property>
- <property>
- <name>db.ignore.external.links</name>
- <value>true</value>
- <description>If true, outlinks leading from a page to external hosts
- will be ignored. This is an effective way to limit the crawl to
include
- only initially injected hosts, without creating complex URLFilters.
- </description>
- </property>
- <property>
- <name>fetcher.parse</name>
- <value>true</value>
- <description>If true, fetcher will parse content. NOTE: previous
releases would
- default to true. Since 2.0 this is set to false as a safer default.
- </description>
- </property>
- <property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor|more|html)|urlnormalizer-(pass|regex|basic)|scoring-opic|protocol-httpclient|language-identifier|indexer-solr</value>
<description>Regular expression naming plugin directory names to
Modified: nutch/branches/2.x/docker/cassandra/nutch/testUrls/seed.txt
URL:
http://svn.apache.org/viewvc/nutch/branches/2.x/docker/cassandra/nutch/testUrls/seed.txt?rev=1704754&r1=1704753&r2=1704754&view=diff
==============================================================================
--- nutch/branches/2.x/docker/cassandra/nutch/testUrls/seed.txt (original)
+++ nutch/branches/2.x/docker/cassandra/nutch/testUrls/seed.txt Wed Sep 23
00:59:52 2015
@@ -1 +1,16 @@
-http://www.google.com
+# Licensed to the Apache Software Foundation (ASF) under one or more
+# contributor license agreements. See the NOTICE file distributed with
+# this work for additional information regarding copyright ownership.
+# The ASF licenses this file to You under the Apache License, Version 2.0
+# (the "License"); you may not use this file except in compliance with
+# the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+http://nutch.apache.org