I have a similar issue and have begun working on a tool that would prune an index using a file of regexes. When I get it working I will be happy to make it publicly available.
-Bryan On 1/23/06, Stefan Groschupf <[EMAIL PROTECTED]> wrote: > > Blocking a page in a url filter will also not fetch a page, so that > doesn't solve your problem. > You can remove the page manually from the index e.g. by using > PruneIndexTool. > However I have something here that also can solve the problem but I > need some more time to prepare a patch. > > Stefan > > Am 21.01.2006 um 16:54 schrieb Franz Werfel: > > > Yes, that is an option we are certainly considering, but we would > > rather have a start page and forget about it. > > Cheers, Fr > > > > On 1/20/06, Neal Whitley <[EMAIL PROTECTED]> wrote: > >> Franz, > >> > >> Someone else will need to confirm this... > >> > >> FYI...why not simply inject the urls directly into Nutch? > >> > >> ./nutch inject db/ -urlfile seeds.txt > >> > >> > >> At 03:49 PM 1/20/2006, you wrote: > >> > >>> Thank you, but if I do that will the page be read for urls? > >>> Cheers, Frank > >>> > >>> On 1/20/06, Neal Whitley <[EMAIL PROTECTED]> wrote: > >>>> Franz, > >>>> > >>>> I 'think' you could use the regex url filter to not index this page > >>>> (regex-urlfilter.txt). > >>>> > >>>> Something like: -^http://([a-z0-9]*\.)*tripod.com/ > >>>> > >>>> I am new to Nutch so I make no guarantee... :-) > >>>> > >>>> Neal > >>>> > >>>> > >>>> > >>>> At 05:23 AM 1/20/2006, you wrote: > >>>> > >>>>> Hello, > >>>>> > >>>>> We are trying to implement Nutch on an intranet and have setup a > >>>>> special page which has links to all the other pages of the > >>>>> site, since > >>>>> many are not linked together. > >>>>> We will start with this special page and then go from there to > >>>>> all the > >>>>> other pages, but we would like to not index the first page (so > >>>>> that it > >>>>> doesn't show up in search results), just use it for its links. > >>>>> Is it possible easily? > >>>>> > >>>>> Thank you. > >>>> > >>>> > >> > >> > > > > --------------------------------------------------------------- > company: http://www.media-style.com > forum: http://www.text-mining.org > blog: http://www.find23.net > > > >
