Hi, I am trying to index a website. That website has <meta name='ROBOTS' content='NOINDEX, NOFOLLOW'> in their html file.
If they want to remove this, they will have to remove it in all their pages and they don't want to regenerate these pages from database. I already crawled this website. Is there anyway I can make Nutch to ignore the above and index the page? One way I can think of is: a) Retrieve HTML from segments b) Remove that line c) Write back d) Re-index Anyone has a better solution? Can I use PruneIndexTool? If the above is the way I go about it, how do I do it...I mean, what are the commands I need to issue/classes I need to call and modify? Any help is appreciated. Thanks. Karthik -- View this message in context: http://www.nabble.com/Ignore-Robots-meta-tag-tf3659247.html#a10224500 Sent from the Nutch - User mailing list archive at Nabble.com.
