hi Enis, This is franklin ..currently i m using nutch 0.7.2 for my crawling and indexing for my search engine... i read from ur message that u can delete a particular index directly?if so how its possible..i m desperately searching for a clue to do this one... my requirement is to delete the porn site's index from my crawled data... ur help is highly needed....
expecting u to help me in this regards .. Thanks in advance.. Franklin.S ogjunk-nutch wrote: > > Hi Enis, > > Right, I can easily delete the page from the Lucene index, though I'd > prefer to follow the Nutch protocol and avoid messing something up by > touching the index directly. However, I don't want that page to re-appear > in one of the subsequent fetches. Well, it won't re-appear, because it > will remain missing, but it would be great to be able to tell Nutch to > "forget it" "from everywhere". Is that doable? > I could read and re-write the *Db Maps, but that's a lot of IO... just to > get a couple of URLs erased. I'd prefer a friendly persuasion where Nutch > flags a given page as "forget this page as soon as possible" and it just > happens later on. > > Thanks, > Otis > . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . > Simpy -- http://www.simpy.com/ - Tag - Search - Share > > ----- Original Message ---- > From: Enis Soztutar <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Sent: Thursday, April 5, 2007 3:29:55 AM > Subject: Re: [Nutch-general] Removing pages from index immediately > > Since hadoop's map files are write once, it is not possible to delete > some urls from the crawldb and linkdb. The only thing you can do is to > create the map files once again without the deleted urls. But running > the crawl once more as you suggested seems more appropriate. Deleting > documents from the index is just lucene stuff. > > In your case it seems that every once in a while, you crawl the whole > site, and create the indexes and db's and then just throw the old one > out. And between two crawls you can delete the urls from the index. > > [EMAIL PROTECTED] wrote: >> Hi, >> >> I'd like to be able to immediately remove certain pages from Nutch >> (index, crawldb, linkdb...). >> The scenario is that I'm using Nutch to index a single site or a set of >> internal sites. Once in a while editors of the site remove a page from >> the site. When that happens, I want to update at least the index and >> ideally crawldb, linkdb, so that people searching the index don't get the >> missing page in results and end up going there, hitting the 404. >> >> I don't think there is a "direct" way to do this with Nutch, is there? >> If there really is no direct way to do this, I was thinking I'd just put >> the URL of the recently removed page into the first next fetchlist and >> then somehow get Nutch to immediately remove that page/URL once it hits a >> 404. How does that sound? >> >> Is there a way to configure Nutch to delete the page after it gets a 404 >> for it even just once? I thought I saw the setting for that somewhere a >> few weeks ago, but now I can't find it. >> >> Thanks, >> Otis >> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . >> Simpy -- http://www.simpy.com/ - Tag - Search - Share >> >> >> >> > > > ------------------------------------------------------------------------- > Take Surveys. Earn Cash. Influence the Future of IT > Join SourceForge.net's Techsay panel and you'll get the chance to share > your > opinions on IT & business topics through brief surveys-and earn cash > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV > _______________________________________________ > Nutch-general mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/nutch-general > > > > > -- View this message in context: http://www.nabble.com/Re%3A--Nutch-general--Removing-pages-from-index-immediately-tf3530204.html#a10218273 Sent from the Nutch - User mailing list archive at Nabble.com. ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-general mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-general
