Here is the link to the docs: http://lucene.apache.org/nutch/apidocs/index.html

You would then need to create a filter of 'pruned' urls to ignore if
they are discovered again.  This list can get quite large, but I
really don't know how else to do it.  It would be cool if we could
hack the crawldb (or webdb I believe in your version) to include a
flag of 'good/bad' or something.


On 4/27/07, Briggs <[EMAIL PROTECTED]> wrote:
Isn't this what you are looking for?

org.apache.nutch.tools.PruneIndexTool.



On 4/27/07, franklinb4u <[EMAIL PROTECTED]> wrote:
>
> hi Enis,
> This is franklin ..currently i m using nutch 0.7.2 for my crawling and
> indexing for my search engine...
> i read from ur message that u can delete a particular index directly?if so
> how its possible..i m desperately searching for a clue to do this one...
> my requirement is to delete the porn site's index from my crawled data...
> ur help is highly needed....
>
> expecting u to help me in this regards ..
>
> Thanks in advance..
> Franklin.S
>
>
> ogjunk-nutch wrote:
> >
> > Hi Enis,
> >
> > Right, I can easily delete the page from the Lucene index, though I'd
> > prefer to follow the Nutch protocol and avoid messing something up by
> > touching the index directly.  However, I don't want that page to re-appear
> > in one of the subsequent fetches.  Well, it won't re-appear, because it
> > will remain missing, but it would be great to be able to tell Nutch to
> > "forget it" "from everywhere".  Is that doable?
> > I could read and re-write the *Db Maps, but that's a lot of IO... just to
> > get a couple of URLs erased.  I'd prefer a friendly persuasion where Nutch
> > flags a given page as "forget this page as soon as possible" and it just
> > happens later on.
> >
> > Thanks,
> > Otis
> >  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> > Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
> >
> > ----- Original Message ----
> > From: Enis Soztutar <[EMAIL PROTECTED]>
> > To: [email protected]
> > Sent: Thursday, April 5, 2007 3:29:55 AM
> > Subject: Re: [Nutch-general] Removing pages from index immediately
> >
> > Since hadoop's map files are write once, it is not possible to delete
> > some urls from the crawldb and linkdb. The only thing you can do is to
> > create the map files once again without the deleted urls. But running
> > the crawl once more as you suggested seems more appropriate. Deleting
> > documents from the index is just lucene stuff.
> >
> > In your case it seems that every once in a while, you crawl the whole
> > site, and create the indexes and db's and then just throw the old one
> > out. And between two crawls you can delete the urls from the index.
> >
> > [EMAIL PROTECTED] wrote:
> >> Hi,
> >>
> >> I'd like to be able to immediately remove certain pages from Nutch
> >> (index, crawldb, linkdb...).
> >> The scenario is that I'm using Nutch to index a single site or a set of
> >> internal sites.  Once in a while editors of the site remove a page from
> >> the site.  When that happens, I want to update at least the index and
> >> ideally crawldb, linkdb, so that people searching the index don't get the
> >> missing page in results and end up going there, hitting the 404.
> >>
> >> I don't think there is a "direct" way to do this with Nutch, is there?
> >> If there really is no direct way to do this, I was thinking I'd just put
> >> the URL of the recently removed page into the first next fetchlist and
> >> then somehow get Nutch to immediately remove that page/URL once it hits a
> >> 404.  How does that sound?
> >>
> >> Is there a way to configure Nutch to delete the page after it gets a 404
> >> for it even just once?  I thought I saw the setting for that somewhere a
> >> few weeks ago, but now I can't find it.
> >>
> >> Thanks,
> >> Otis
> >>  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> >> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
> >>
> >>
> >>
> >>
> >
> >
> > -------------------------------------------------------------------------
> > Take Surveys. Earn Cash. Influence the Future of IT
> > Join SourceForge.net's Techsay panel and you'll get the chance to share
> > your
> > opinions on IT & business topics through brief surveys-and earn cash
> > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> > _______________________________________________
> > Nutch-general mailing list
> > [EMAIL PROTECTED]
> > https://lists.sourceforge.net/lists/listinfo/nutch-general
> >
> >
> >
> >
> >
>
> --
> View this message in context: 
http://www.nabble.com/Re%3A--Nutch-general--Removing-pages-from-index-immediately-tf3530204.html#a10218273
> Sent from the Nutch - User mailing list archive at Nabble.com.
>
>


--
"Conscious decisions by conscious minds are what make reality real"



--
"Conscious decisions by conscious minds are what make reality real"

Reply via email to