[Nutch-general] Re: Do not index seed page?

Stefan Groschupf Mon, 23 Jan 2006 12:18:12 -0800

Blocking a page in a url filter will also not fetch a page, so thatdoesn't solve your problem.You can remove the page manually from the index e.g. by usingPruneIndexTool.However I have something here that also can solve the problem but Ineed some more time to prepare a patch.


Stefan


Am 21.01.2006 um 16:54 schrieb Franz Werfel:

Yes, that is an option we are certainly considering, but we would
rather have a start page and forget about it.
Cheers, Fr

On 1/20/06, Neal Whitley <[EMAIL PROTECTED]> wrote:

Franz,

Someone else will need to confirm this...

FYI...why not simply inject the urls directly into Nutch?

./nutch inject db/ -urlfile seeds.txt


At 03:49 PM 1/20/2006, you wrote:

Thank you, but if I do that will the page be read for urls?
Cheers, Frank

On 1/20/06, Neal Whitley <[EMAIL PROTECTED]> wrote:

Franz,

I 'think' you could use the regex url filter to not index this page
(regex-urlfilter.txt).

Something like:  -^http://([a-z0-9]*\.)*tripod.com/

I am new to Nutch so I make no guarantee... :-)

Neal



At 05:23 AM 1/20/2006, you wrote:
Hello,

We are trying to implement Nutch on an intranet and have setup a
special page which has links to all the other pages of thesite, since
many are not linked together.
We will start with this special page and then go from there toall theother pages, but we would like to not index the first page (sothat it
doesn't show up in search results), just use it for its links.
Is it possible easily?

Thank you.


---------------------------------------------------------------
company:        http://www.media-style.com
forum:        http://www.text-mining.org
blog:            http://www.find23.net

[Nutch-general] Re: Do not index seed page?

Reply via email to