Well, it looks like the link I sent you goes to the 0.9 version of the
nutch api.  There is a link error on the nutch project site because
the 0.7.2 doc link points to the 0.9 docs.



On 4/27/07, Briggs <[EMAIL PROTECTED]> wrote:
> Here is the link to the docs: 
> http://lucene.apache.org/nutch/apidocs/index.html
>
> You would then need to create a filter of 'pruned' urls to ignore if
> they are discovered again.  This list can get quite large, but I
> really don't know how else to do it.  It would be cool if we could
> hack the crawldb (or webdb I believe in your version) to include a
> flag of 'good/bad' or something.
>
>
> On 4/27/07, Briggs <[EMAIL PROTECTED]> wrote:
> > Isn't this what you are looking for?
> >
> > org.apache.nutch.tools.PruneIndexTool.
> >
> >
> >
> > On 4/27/07, franklinb4u <[EMAIL PROTECTED]> wrote:
> > >
> > > hi Enis,
> > > This is franklin ..currently i m using nutch 0.7.2 for my crawling and
> > > indexing for my search engine...
> > > i read from ur message that u can delete a particular index directly?if so
> > > how its possible..i m desperately searching for a clue to do this one...
> > > my requirement is to delete the porn site's index from my crawled data...
> > > ur help is highly needed....
> > >
> > > expecting u to help me in this regards ..
> > >
> > > Thanks in advance..
> > > Franklin.S
> > >
> > >
> > > ogjunk-nutch wrote:
> > > >
> > > > Hi Enis,
> > > >
> > > > Right, I can easily delete the page from the Lucene index, though I'd
> > > > prefer to follow the Nutch protocol and avoid messing something up by
> > > > touching the index directly.  However, I don't want that page to 
> > > > re-appear
> > > > in one of the subsequent fetches.  Well, it won't re-appear, because it
> > > > will remain missing, but it would be great to be able to tell Nutch to
> > > > "forget it" "from everywhere".  Is that doable?
> > > > I could read and re-write the *Db Maps, but that's a lot of IO... just 
> > > > to
> > > > get a couple of URLs erased.  I'd prefer a friendly persuasion where 
> > > > Nutch
> > > > flags a given page as "forget this page as soon as possible" and it just
> > > > happens later on.
> > > >
> > > > Thanks,
> > > > Otis
> > > >  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> > > > Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
> > > >
> > > > ----- Original Message ----
> > > > From: Enis Soztutar <[EMAIL PROTECTED]>
> > > > To: [EMAIL PROTECTED]
> > > > Sent: Thursday, April 5, 2007 3:29:55 AM
> > > > Subject: Re: [Nutch-general] Removing pages from index immediately
> > > >
> > > > Since hadoop's map files are write once, it is not possible to delete
> > > > some urls from the crawldb and linkdb. The only thing you can do is to
> > > > create the map files once again without the deleted urls. But running
> > > > the crawl once more as you suggested seems more appropriate. Deleting
> > > > documents from the index is just lucene stuff.
> > > >
> > > > In your case it seems that every once in a while, you crawl the whole
> > > > site, and create the indexes and db's and then just throw the old one
> > > > out. And between two crawls you can delete the urls from the index.
> > > >
> > > > [EMAIL PROTECTED] wrote:
> > > >> Hi,
> > > >>
> > > >> I'd like to be able to immediately remove certain pages from Nutch
> > > >> (index, crawldb, linkdb...).
> > > >> The scenario is that I'm using Nutch to index a single site or a set of
> > > >> internal sites.  Once in a while editors of the site remove a page from
> > > >> the site.  When that happens, I want to update at least the index and
> > > >> ideally crawldb, linkdb, so that people searching the index don't get 
> > > >> the
> > > >> missing page in results and end up going there, hitting the 404.
> > > >>
> > > >> I don't think there is a "direct" way to do this with Nutch, is there?
> > > >> If there really is no direct way to do this, I was thinking I'd just 
> > > >> put
> > > >> the URL of the recently removed page into the first next fetchlist and
> > > >> then somehow get Nutch to immediately remove that page/URL once it 
> > > >> hits a
> > > >> 404.  How does that sound?
> > > >>
> > > >> Is there a way to configure Nutch to delete the page after it gets a 
> > > >> 404
> > > >> for it even just once?  I thought I saw the setting for that somewhere 
> > > >> a
> > > >> few weeks ago, but now I can't find it.
> > > >>
> > > >> Thanks,
> > > >> Otis
> > > >>  . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
> > > >> Simpy -- http://www.simpy.com/  -  Tag  -  Search  -  Share
> > > >>
> > > >>
> > > >>
> > > >>
> > > >
> > > >
> > > > -------------------------------------------------------------------------
> > > > Take Surveys. Earn Cash. Influence the Future of IT
> > > > Join SourceForge.net's Techsay panel and you'll get the chance to share
> > > > your
> > > > opinions on IT & business topics through brief surveys-and earn cash
> > > > http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
> > > > _______________________________________________
> > > > Nutch-general mailing list
> > > > [email protected]
> > > > https://lists.sourceforge.net/lists/listinfo/nutch-general
> > > >
> > > >
> > > >
> > > >
> > > >
> > >
> > > --
> > > View this message in context: 
> > > http://www.nabble.com/Re%3A--Nutch-general--Removing-pages-from-index-immediately-tf3530204.html#a10218273
> > > Sent from the Nutch - User mailing list archive at Nabble.com.
> > >
> > >
> >
> >
> > --
> > "Conscious decisions by conscious minds are what make reality real"
> >
>
>
> --
> "Conscious decisions by conscious minds are what make reality real"
>


-- 
"Conscious decisions by conscious minds are what make reality real"

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
Nutch-general mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to