I do something similar on a daily basis with nutch 0.7.  I look in DATE_DIR
folder for new files to index and pass that into nutch via the fetch_new.txt
file. Here is the daily indexing script I use (since the files are local I
replace the root with my webservers base directory).

find /${DATE_DIR} -name '*.txt' > out.txt
sed -e 's@/@http://<myserver.com>/@' < out.txt > fetch_new.txt
nutch inject db -urlfile ./fetch_new.txt
nutch generate db segments
s=`ls -d segments/2* | tail -n 1`
echo Segment is $s
nutch fetch $s
echo Done Fetching
nutch updatedb db $s
nutch analyze db 2
nutch index $s
nutch dedup segments tmpfile

I have the refetch time set high.

<property>
 <name>db.default.fetch.interval</name>
 <value>120</value>
 <description>The default number of days between re-fetches of a page.
 </description>
</property>
Every month or so I do a segment merge
nutch mergesegs -dir segments  -i -ds

And latetly I've been deleting the /contents folder in the segments
directory since i don't need the cached version of the files since I have
them on the local filesystem, this helps save disk space (in 0.8-dev it's a
property option)

Roberto

On 6/25/06, Honda-Search Administrator <[EMAIL PROTECTED]> wrote:

Thomas,

It appears to me that this is exactly what I need.  I can create a
fetchlist
on the urls I need to crawl and can then fetch them.  I can essentially
not
worry about the older entries unless they are modified.

Two question:

First of all, will this re-fetch old document already in the database?  In
my instance if a forum topic is updated it would be put into the flat url
list.  Would it be refetched with this tool?

Secondly, can anyone point me in the direction of how to properly set this
up?  As I mentioned in another post I'm lost when it comes to java.  I
want
to be able to compile this and use it but the last thing I want to do is
screw anything up.

Matt

----- Original Message -----
From: "TDLN" <[EMAIL PROTECTED]>
To: <[email protected]>; "Honda-Search Administrator"
<[EMAIL PROTECTED]>
Sent: Sunday, June 25, 2006 3:02 AM
Subject: Re: Will pay for someone to help


> Matt,
>
> AFAIK Nutch does not support fetching arbitrary fetch lists out of the
> box.
>
> here is a tool in JIRA that supports this though:
> http://issues.apache.org/jira/browse/NUTCH-68.
>
> - Thomas
>
>
> On 6/25/06, Honda-Search Administrator <[EMAIL PROTECTED]> wrote:
>> I'm having a difficult time configuring nutch to behave the way I want
it
>> to
>> behave.
>>
>> In a nutshell here is my situation:
>>
>> I crawl a number of forums that relate to Hondas every night for posts.
>> The
>> purpose of my website is to be a search engine for all of the forums at
>> once.
>>
>> I have a base set of URLs in the webDB right now.  Every day I write a
>> file
>> of URLs (that I place in urls/inject.txt) that I want nutch to inject
>> into
>> the database to crawl.  I do NOT want to recrawl other URLS.  I only
want
>> to
>> crawl/recrawl the urls in my list.
>>
>> Can you help me configure nutch (or help with the correct scripts,
crons,
>> etc.) to do this?  i've tried without success.
>>
>> I am running nutch 0.7.2 and am totally confused with what to do next.
>> It
>> seems to me to be a simple fix, but I can't figure it out.
>>
>> As I mentioned I will pay if someone can set me up.  I've run the crawl
a
>> number of times now and i just keep on screwing things up.
>>
>> Matt
>>
>>
>
>


Using Tomcat but need to do more? Need to support web services, security?
Get stuff done quickly with pre-integrated technology to make your job easier
Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo
http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to