Re: Incremental crawl using Nutch

Dennis Kubes Fri, 23 Feb 2007 06:50:27 -0800

sandeep pujar wrote:

By incremental I meant after a full crawl then next
crawls should fetch only the changed pages.

The problem with fetching changed pages is you need to know what pageshave changed. Once you do you can load only the changed pages throughan inject, generated, fetch, cycle and then merge crawldb and segmentswith previously fetched results. The python script performs this typeof process but not for changed pages, for new unfetched links. You maybe able to modify it to fetch only changed pages.


Dennis Kubes


I was not clear on how I could use the python
automation script for that.

Is there something I am missing here ?


--- Dennis Kubes <[EMAIL PROTECTED]> wrote:

You can use the python automation script found at:

http://wiki.apache.org/nutch/Automating_Fetches_with_Python

I almost have a new version ready.  Will post it in

the next couple ofdays to the wiki.


Dennis Kubes

sandeep pujar wrote:

Greetings,

Are there ways we can initiate incremental

crawl/index

using Nutch.

I tried to lookup  wikis and other sources and did

not

find much information.

Any ideas pointers,

Thanks,
Sandeep

____________________________________________________________________________________

Sucker-punch spam with award-winning protection.Try the free Yahoo! Mail Beta.

http://advision.webevents.yahoo.com/mailbeta/features_spam.html

____________________________________________________________________________________

Don't get soaked.  Take a quick peak at the forecast
with the Yahoo! Search weather shortcut.
http://tools.search.yahoo.com/shortcuts/#loc_weather

Re: Incremental crawl using Nutch

Reply via email to