Otis, Check out the purge tool (bin/nutch purge)
It's easy to remove URLS individually or based on regular expressions, but you'll need to learn lucrene syntax to do it. It will remove certain pages from the index, but won't exclude them from being recrawled the next time around. For that you'll need to change your filters in your conf directory. ----- Original Message ----- From: <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Friday, July 07, 2006 9:50 AM Subject: Re: [Nutch-general] Link db (traversal + modification) > Thanks Stefan. > So one has to iterate and re-write the whole graph, and there is no way to > just modify it on the fly by, for example, removing specific links/pages? > > Thanks, > Otis > > ----- Original Message ---- > From: Stefan Groschupf <[EMAIL PROTECTED]> > To: [email protected] > Sent: Friday, July 7, 2006 1:52:24 AM > Subject: Re: [Nutch-general] Link db (traversal + modification) > > Hi Otis, > > the link graph live in the linkdb. > I suggest to write a small map reduce tool that reads the existing > linkDb filter the pages you want to remove and write the result back > to disk. > This will be just a couble lines of code. > The hadoop package comes with some nice map reduce examples. > > Stefan > > > On 06.07.2006, at 22:47, <[EMAIL PROTECTED]> wrote: > >> Hi, >> >> What's the best way to travere the graph of all fetched pages and >> optionally modify it (e.g. remove a page because you know it's spam)? >> I looked at various Nutch classes, and only LinksDbReader looks >> like it let's you iterate through all links (and for each link get >> its inlinks). Is this right? >> >> But how would one go about modifying the links db? >> Perhaps I should be asking about where/how the links db is stored >> on disk, and whether one should just access and modify that data >> directly on disk? >> >> Thanks, >> Otis >> >> >> > > > Using Tomcat but need to do more? Need to support web services, security? > Get stuff done quickly with pre-integrated technology to make your job > easier > Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo > http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 > _______________________________________________ > Nutch-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/nutch-general > > > > > Using Tomcat but need to do more? Need to support web services, security? Get stuff done quickly with pre-integrated technology to make your job easier Download IBM WebSphere Application Server v.1.0.1 based on Apache Geronimo http://sel.as-us.falkag.net/sel?cmd=lnk&kid=120709&bid=263057&dat=121642 _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
