Hi Enis, Right, I can easily delete the page from the Lucene index, though I'd prefer to follow the Nutch protocol and avoid messing something up by touching the index directly. However, I don't want that page to re-appear in one of the subsequent fetches. Well, it won't re-appear, because it will remain missing, but it would be great to be able to tell Nutch to "forget it" "from everywhere". Is that doable? I could read and re-write the *Db Maps, but that's a lot of IO... just to get a couple of URLs erased. I'd prefer a friendly persuasion where Nutch flags a given page as "forget this page as soon as possible" and it just happens later on.
Thanks, Otis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Simpy -- http://www.simpy.com/ - Tag - Search - Share ----- Original Message ---- From: Enis Soztutar <[EMAIL PROTECTED]> To: [email protected] Sent: Thursday, April 5, 2007 3:29:55 AM Subject: Re: [Nutch-general] Removing pages from index immediately Since hadoop's map files are write once, it is not possible to delete some urls from the crawldb and linkdb. The only thing you can do is to create the map files once again without the deleted urls. But running the crawl once more as you suggested seems more appropriate. Deleting documents from the index is just lucene stuff. In your case it seems that every once in a while, you crawl the whole site, and create the indexes and db's and then just throw the old one out. And between two crawls you can delete the urls from the index. [EMAIL PROTECTED] wrote: > Hi, > > I'd like to be able to immediately remove certain pages from Nutch (index, > crawldb, linkdb...). > The scenario is that I'm using Nutch to index a single site or a set of > internal sites. Once in a while editors of the site remove a page from the > site. When that happens, I want to update at least the index and ideally > crawldb, linkdb, so that people searching the index don't get the missing > page in results and end up going there, hitting the 404. > > I don't think there is a "direct" way to do this with Nutch, is there? > If there really is no direct way to do this, I was thinking I'd just put the > URL of the recently removed page into the first next fetchlist and then > somehow get Nutch to immediately remove that page/URL once it hits a 404. > How does that sound? > > Is there a way to configure Nutch to delete the page after it gets a 404 for > it even just once? I thought I saw the setting for that somewhere a few > weeks ago, but now I can't find it. > > Thanks, > Otis > . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . > Simpy -- http://www.simpy.com/ - Tag - Search - Share > > > > ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
