It's really hard to define what a blog is sometimes, but you could try to detect for an RSS/Atom feed present: <link rel="alternate" type="application/rss+xml" title="RSS" or just detect for common signatures of blogging software. It would require some type of custom parser I would imagine. ----- Original Message ----- From: "Armando Gonçalves" <[email protected]> To: [email protected] Sent: Wednesday, February 4, 2009 9:02:24 PM GMT -08:00 US/Canada Pacific Subject: Fetch only Blogs.
Can Anyone tell-me if there is a way of nutch just fetch blogs during the crawl process??? My current application has a white list of domains, any better idea ? -- Armando Gonçalves C.C 2005-2
