I have a similar issue and have begun working on a tool that would prune an
index using a file of regexes. When I get it working I will be happy to make
it publicly available.

-Bryan

On 1/23/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote:
>
> Blocking a page in a url filter will also not fetch a page, so that
> doesn't solve your problem.
> You can remove the page manually from the index e.g. by using
> PruneIndexTool.
> However I have something here that also can solve the problem but  I
> need some more time to prepare a patch.
>
> Stefan
>
> Am 21.01.2006 um 16:54 schrieb Franz Werfel:
>
> > Yes, that is an option we are certainly considering, but we would
> > rather have a start page and forget about it.
> > Cheers, Fr
> >
> > On 1/20/06, Neal Whitley <[EMAIL PROTECTED]> wrote:
> >> Franz,
> >>
> >> Someone else will need to confirm this...
> >>
> >> FYI...why not simply inject the urls directly into Nutch?
> >>
> >> ./nutch inject db/ -urlfile seeds.txt
> >>
> >>
> >> At 03:49 PM 1/20/2006, you wrote:
> >>
> >>> Thank you, but if I do that will the page be read for urls?
> >>> Cheers, Frank
> >>>
> >>> On 1/20/06, Neal Whitley <[EMAIL PROTECTED]> wrote:
> >>>> Franz,
> >>>>
> >>>> I 'think' you could use the regex url filter to not index this page
> >>>> (regex-urlfilter.txt).
> >>>>
> >>>> Something like:  -^http://([a-z0-9]*\.)*tripod.com/
> >>>>
> >>>> I am new to Nutch so I make no guarantee... :-)
> >>>>
> >>>> Neal
> >>>>
> >>>>
> >>>>
> >>>> At 05:23 AM 1/20/2006, you wrote:
> >>>>
> >>>>> Hello,
> >>>>>
> >>>>> We are trying to implement Nutch on an intranet and have setup a
> >>>>> special page which has links to all the other pages of the
> >>>>> site, since
> >>>>> many are not linked together.
> >>>>> We will start with this special page and then go from there to
> >>>>> all the
> >>>>> other pages, but we would like to not index the first page (so
> >>>>> that it
> >>>>> doesn't show up in search results), just use it for its links.
> >>>>> Is it possible easily?
> >>>>>
> >>>>> Thank you.
> >>>>
> >>>>
> >>
> >>
> >
>
> ---------------------------------------------------------------
> company:        http://www.media-style.com
> forum:        http://www.text-mining.org
> blog:            http://www.find23.net
>
>
>
>

Reply via email to