On 9 December 2013 19:37, Debamitro Chakraborti <[email protected]> wrote: > Any way to crawl the back issues of prominent Indian newspapers like The > Hindu, TOI, Indian Express, Hindustan Times etc? > I was part of a team which needed to analyse news reports from a time frame > and we hacked together a TOI crawler (which still has limitations) and were > working on a The Hindu crawler -- would love to know about something simpler > that is already available.
The newsrack.in recommendation is a good one, but Newsrack is intended as much more than a simple crawler. If a crawler is what you need, you should look into something like Nutch ( http://nutch.apache.org/ ). If you prefer to write your own for simple, non-generic. needs we have happily used Scrapy in the Python world ( http://scrapy.org/ ). Regards, Gora -- For more details about this list http://datameet.org/discussions/ --- You received this message because you are subscribed to the Google Groups "datameet" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.
