Thanks for the update on that.
The last thing that I need to do is kill the index:) So hopefully
Nutch 1.0 will be out soon and stable.
-John
On Mar 3, 2009, at 1:51 PM, Bartosz Gadzimski wrote:
Hi,
Nutch in trunk was very stable but since few last days they have
some problems. You can try build from last week it's much better
than 0.9.
It looks that they should release rc1 in 2-3 weeks
John Martyniak pisze:
Thank you, that sounds like a close match to what I am looking for.
It looks like it is part of 1.0, I am only using 0.9 at this time.
But I think I read that a RC1 was coming out soon, so might use
that, but I will download a nightly build and play with that.
My site isn't production yet, but we are building a large index
(Right now it is sitting at around ~3.5 million urls), is Nutch-
trunk stable enough to use to do the fetching and indexing?
-John
On Mar 3, 2009, at 11:11 AM, Bartosz Gadzimski wrote:
Oh, I forgot. I didn't test that one so can tell you how it works.
I know that many people are makeing generate, fetch, etc. loops
very often to make sites fresh
John Martyniak pisze:
Justin,
thanks for the info this very helpful.
This value would apply to all pages though. I was thinking that
if you have things like youtube.com, cnn.com, etc in your index
you would probably want them to be re-fetched more frequently.
So I was wondering if there was some filter or other plugin, that
is in nutch that will let you specify that value.
Once option I was thinking of was creating a list of the URLs and
then import that when creating the segment to be fetched. But
that would include a lot of outside processing.
-John
On Mar 3, 2009, at 10:51 AM, Justin Yao wrote:
Hi John,
You can find below parameters in conf/nutch-default.xml. You can
change the value and put your own one in conf/nutch-site.xml
<property>
<name>db.default.fetch.interval</name>
<value>30</value>
<description>(DEPRECATED) The default number of days between re-
fetches of a page.
</description>
</property>
<property>
<name>db.fetch.interval.default</name>
<value>2592000</value>
<description>The default number of seconds between re-fetches of
a page (30 days).
</description>
</property>
<property>
<name>db.fetch.interval.max</name>
<value>7776000</value>
<description>The maximum number of seconds between re-fetches of
a page
(90 days). After this period every page in the db will be re-
tried, no
matter what is its status.
</description>
</property>
Justin
John Martyniak wrote:
How does nutch determine when content needs to be re-fetched?
The way that I understand it is that it is "next fetch" date
which 7 days in the future.
Is there anyway to change that? Or to increase the fetching
interval. Or somehow base it on how many times a piece of
content is requested.
I would like to keep the content as fresh as possible, and the
information changes more frequently than every 7 days.
Thanks in advance,
-John