Hi,

Nutch in trunk was very stable but since few last days they have some problems. You can try build from last week it's much better than 0.9.

It looks that they should release rc1 in 2-3 weeks


John Martyniak pisze:
Thank you, that sounds like a close match to what I am looking for.

It looks like it is part of 1.0, I am only using 0.9 at this time.

But I think I read that a RC1 was coming out soon, so might use that, but I will download a nightly build and play with that.

My site isn't production yet, but we are building a large index (Right now it is sitting at around ~3.5 million urls), is Nutch-trunk stable enough to use to do the fetching and indexing?

-John


On Mar 3, 2009, at 11:11 AM, Bartosz Gadzimski wrote:

Oh, I forgot. I didn't test that one so can tell you how it works.

I know that many people are makeing generate, fetch, etc. loops very often to make sites fresh

John Martyniak pisze:
Justin,

thanks for the info this very helpful.

This value would apply to all pages though. I was thinking that if you have things like youtube.com, cnn.com, etc in your index you would probably want them to be re-fetched more frequently. So I was wondering if there was some filter or other plugin, that is in nutch that will let you specify that value.

Once option I was thinking of was creating a list of the URLs and then import that when creating the segment to be fetched. But that would include a lot of outside processing.

-John

On Mar 3, 2009, at 10:51 AM, Justin Yao wrote:

Hi John,

You can find below parameters in conf/nutch-default.xml. You can change the value and put your own one in conf/nutch-site.xml


<property>
<name>db.default.fetch.interval</name>
<value>30</value>
<description>(DEPRECATED) The default number of days between re-fetches of a page.
</description>
</property>

<property>
<name>db.fetch.interval.default</name>
<value>2592000</value>
<description>The default number of seconds between re-fetches of a page (30 days).
</description>
</property>

<property>
<name>db.fetch.interval.max</name>
<value>7776000</value>
<description>The maximum number of seconds between re-fetches of a page
(90 days). After this period every page in the db will be re-tried, no
matter what is its status.
</description>
</property>


Justin

John Martyniak wrote:
How does nutch determine when content needs to be re-fetched? The way that I understand it is that it is "next fetch" date which 7 days in the future. Is there anyway to change that? Or to increase the fetching interval. Or somehow base it on how many times a piece of content is requested. I would like to keep the content as fresh as possible, and the information changes more frequently than every 7 days.
Thanks in advance,
-John






Reply via email to