No problem, I'm glad to be able to pass along something (doing my part in this
little corner of the project! :)
Ah, I see it's back again, just as I was about to post everything! Since it's
back I'm just going to post the other files for now. This is a sample
nutch-site.xml (just fill in your info). Then a regex-urlfilter.txt. From
what I understand, this setup would grab anything under
http://lucene.apache.org NOT under http://lucene.apache.org/solr and anything
under http://tomcat.apache.org NOT under
http://tomcat.apache.org/connectors-doc . I couldn't think of a good way to
represent the urls directory, but the simulated "listing" in urls-listing.txt
is about what I meant. Hopefully it'll make sense. I'll warn you that there's
very likely a better way to do this (anyone who has ideas?). I just know that
this particular setup worked for me (famous last words ;)
If you do have the time to look into it, NUTCH-442 would be a good place to
start. I'm going to try to spend some time with that code in the near future
to see if I can get the wiki page "updated" to something better.
----- Original Message -----
From: "Gene Campbell" <[EMAIL PROTECTED]>
To: [email protected]
Sent: Thursday, May 29, 2008 6:12:28 PM GMT -06:00 US/Canada Central
Subject: Re: Nutch, Solr, Lucene - resources
Thanks for the helpful reply, very kind!
>
>> I don't know what's up with FooFactory at the moment, but I put together
>> the Solr&Nutch page. I may be able to send/post something. Was there
>> something in particular you were looking for?
I found the google cached site, but there are some links to other
resources that I need to see. I tried to search google for them
(site:www.foofactory.fi) - no luck.
These are links to information I'd like to see (copied from foofactory
site page)
1. Set up conf/regex-urlfilter.txt
2. Set up conf/nutch-site.xml
3. Generate a list of seed urls into folder urls
4. Grab this simple script that will help you along in your crawling task.
and
A patch against Nutch trunk is provided for those who wish to be brave.
The "patch" is a link. (Incidently, I don't know if it's required, or
optional, but I figured I'd be brave.)
urls/
url1
url2
Where url1 is just a line like this:
http://lucene.apache.org/
url2 is just a line like this:
http://tomcat.apache.org/
and so on...
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# The default url filter.
# Better for whole-internet crawling.
# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'. The first matching pattern in the file
# determines whether a URL is included or ignored. If no pattern
# matches, the URL is ignored.
# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):
# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]
# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
+^http://lucene.apache.org/*
+^http://tomcat.apache.org/*
-^http://lucene.apache.org/solr/*
-^http://tomcat.apache.org/connectors-doc/*
# accept anything else
+.
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<configuration>
<property>
<name>http.agent.name</name>
<value>SomeShortNameForTheCrawler</value>
</property>
<property>
<name>http.agent.description</name>
<value>Longer name for the crawler</value>
</property>
<property>
<name>http.agent.url</name>
<value>http://homeurl.for.crawler</value>
</property>
<property>
<name>http.agent.email</name>
<value>myadmin at mycompany dot com</value>
</property>
<property>
<name>indexer.solr.url</name>
<value>http://127.0.0.1:8983/solr/</value>
</property>
</configuration>
urls/
url1
url2
Where url1 is just a line like this:
http://lucene.apache.org/
url2 is just a line like this:
http://tomcat.apache.org/
and so on...