Thomas,
It appears to me that this is exactly what I need. I can create a fetchlist
on the urls I need to crawl and can then fetch them. I can essentially not
worry about the older entries unless they are modified.
Two question:
First of all, will this re-fetch old document already in the database? In
my instance if a forum topic is updated it would be put into the flat url
list. Would it be refetched with this tool?
Secondly, can anyone point me in the direction of how to properly set this
up? As I mentioned in another post I'm lost when it comes to java. I want
to be able to compile this and use it but the last thing I want to do is
screw anything up.
Matt
----- Original Message -----
From: "TDLN" <[EMAIL PROTECTED]>
To: <[email protected]>; "Honda-Search Administrator"
<[EMAIL PROTECTED]>
Sent: Sunday, June 25, 2006 3:02 AM
Subject: Re: Will pay for someone to help
Matt,
AFAIK Nutch does not support fetching arbitrary fetch lists out of the
box.
here is a tool in JIRA that supports this though:
http://issues.apache.org/jira/browse/NUTCH-68.
- Thomas
On 6/25/06, Honda-Search Administrator <[EMAIL PROTECTED]> wrote:
I'm having a difficult time configuring nutch to behave the way I want it
to
behave.
In a nutshell here is my situation:
I crawl a number of forums that relate to Hondas every night for posts.
The
purpose of my website is to be a search engine for all of the forums at
once.
I have a base set of URLs in the webDB right now. Every day I write a
file
of URLs (that I place in urls/inject.txt) that I want nutch to inject
into
the database to crawl. I do NOT want to recrawl other URLS. I only want
to
crawl/recrawl the urls in my list.
Can you help me configure nutch (or help with the correct scripts, crons,
etc.) to do this? i've tried without success.
I am running nutch 0.7.2 and am totally confused with what to do next.
It
seems to me to be a simple fix, but I can't figure it out.
As I mentioned I will pay if someone can set me up. I've run the crawl a
number of times now and i just keep on screwing things up.
Matt