Hi,
Nutch in trunk was very stable but since few last days they have some
problems. You can try build from last week it's much better than 0.9.
It looks that they should release rc1 in 2-3 weeks
John Martyniak pisze:
Thank you, that sounds like a close match to what I am looking for.
It looks like it is part of 1.0, I am only using 0.9 at this time.
But I think I read that a RC1 was coming out soon, so might use that,
but I will download a nightly build and play with that.
My site isn't production yet, but we are building a large index (Right
now it is sitting at around ~3.5 million urls), is Nutch-trunk stable
enough to use to do the fetching and indexing?
-John
On Mar 3, 2009, at 11:11 AM, Bartosz Gadzimski wrote:
Oh, I forgot. I didn't test that one so can tell you how it works.
I know that many people are makeing generate, fetch, etc. loops very
often to make sites fresh
John Martyniak pisze:
Justin,
thanks for the info this very helpful.
This value would apply to all pages though. I was thinking that if
you have things like youtube.com, cnn.com, etc in your index you
would probably want them to be re-fetched more frequently. So I was
wondering if there was some filter or other plugin, that is in nutch
that will let you specify that value.
Once option I was thinking of was creating a list of the URLs and
then import that when creating the segment to be fetched. But that
would include a lot of outside processing.
-John
On Mar 3, 2009, at 10:51 AM, Justin Yao wrote:
Hi John,
You can find below parameters in conf/nutch-default.xml. You can
change the value and put your own one in conf/nutch-site.xml
<property>
<name>db.default.fetch.interval</name>
<value>30</value>
<description>(DEPRECATED) The default number of days between
re-fetches of a page.
</description>
</property>
<property>
<name>db.fetch.interval.default</name>
<value>2592000</value>
<description>The default number of seconds between re-fetches of a
page (30 days).
</description>
</property>
<property>
<name>db.fetch.interval.max</name>
<value>7776000</value>
<description>The maximum number of seconds between re-fetches of a
page
(90 days). After this period every page in the db will be re-tried, no
matter what is its status.
</description>
</property>
Justin
John Martyniak wrote:
How does nutch determine when content needs to be re-fetched? The
way that I understand it is that it is "next fetch" date which 7
days in the future.
Is there anyway to change that? Or to increase the fetching
interval. Or somehow base it on how many times a piece of content
is requested.
I would like to keep the content as fresh as possible, and the
information changes more frequently than every 7 days.
Thanks in advance,
-John