Re: Nutch, Solr, Lucene - resources

ntkach Fri, 30 May 2008 16:07:26 -0700

No problem, I'm glad to be able to pass along something (doing my part in this 
little corner of the project! :)


Ah, I see it's back again, just as I was about to post everything! Since it's 
back I'm just going to post the other files for now.  This is a sample 
nutch-site.xml (just fill in your info).  Then a regex-urlfilter.txt.  From 
what I understand, this setup would grab anything under 
http://lucene.apache.org NOT under http://lucene.apache.org/solr and anything 
under http://tomcat.apache.org NOT under 
http://tomcat.apache.org/connectors-doc .  I couldn't think of a good way to 
represent the urls directory, but the simulated "listing" in urls-listing.txt 
is about what I meant.  Hopefully it'll make sense.  I'll warn you that there's 
very likely a better way to do this (anyone who has ideas?).  I just know that 
this particular setup worked for me (famous last words ;)  

If you do have the time to look into it, NUTCH-442 would be a good place to 
start.  I'm going to try to spend some time with that code in the near future 
to see if I can get the wiki page "updated" to something better.


----- Original Message -----
From: "Gene Campbell" <[EMAIL PROTECTED]>
To: [email protected]
Sent: Thursday, May 29, 2008 6:12:28 PM GMT -06:00 US/Canada Central
Subject: Re: Nutch, Solr, Lucene - resources

Thanks for  the helpful reply, very kind!

>
>> I don't know what's up with FooFactory at the moment, but I put together
>> the Solr&Nutch page.  I may be able to send/post something.  Was there
>> something in particular you were looking for?

I found the google cached site, but there are some links to other
resources that I need to see.  I tried to search google for them
(site:www.foofactory.fi) - no luck.

These are links to information I'd like to see (copied from foofactory
site page)
1. Set up conf/regex-urlfilter.txt
2. Set up conf/nutch-site.xml
3. Generate a list of seed urls into folder urls
4. Grab this simple script that will help you along in your crawling task.

and

A patch against Nutch trunk is provided for those who wish to be brave.

The "patch" is a link.  (Incidently, I don't know if it's required, or
optional, but I figured I'd be brave.)

urls/
     url1
     url2

Where url1 is just a line like this:
http://lucene.apache.org/

url2 is just a line like this:
http://tomcat.apache.org/

and so on...

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# The default url filter.
# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
[EMAIL PROTECTED]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/
+^http://lucene.apache.org/*
+^http://tomcat.apache.org/*

-^http://lucene.apache.org/solr/*
-^http://tomcat.apache.org/connectors-doc/*

# accept anything else
+.

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
    <name>http.agent.name</name>
    <value>SomeShortNameForTheCrawler</value>
  </property>
  <property>
    <name>http.agent.description</name>
    <value>Longer name for the crawler</value>
  </property>
  <property>
    <name>http.agent.url</name>
    <value>http://homeurl.for.crawler</value>
  </property>
  <property>
    <name>http.agent.email</name>
    <value>myadmin at mycompany dot com</value>
  </property>
  <property>
    <name>indexer.solr.url</name>
    <value>http://127.0.0.1:8983/solr/</value>
  </property>
</configuration>

urls/
     url1
     url2

Where url1 is just a line like this:
http://lucene.apache.org/

url2 is just a line like this:
http://tomcat.apache.org/

and so on...

Re: Nutch, Solr, Lucene - resources

Reply via email to