nutch resprects robot.txt files. http://www.robotstxt.org/wc/robots.html
Am 27.12.2005 um 15:59 schrieb Jeff Breidenbach:
Hi all, Another open source search engine, HtDig, allows web page authors to mark up a page such that some sections are not indexed. The syntax looks like the following: <!--htdig_noindex--> ... material inside is not indexed ... <!--/htdig_noindex--> Does a similar feature exist in Nutch? If the answer is "write a plugin" does anyone have tips on where to start? Also, how hard is something like this for a Nutch newbie who doesn't know anything about HTML parsing? I have a bunch of documents already marked up with the htdig syntax, and in the interests of interoperability I'm tempted to follow the syntax exactly. -Jeff
