Hi Gora, The problem with using scrapy (or just simply BeautifulSoup) is that some newspapers generate content dynamically, using javascript. The possible solutions we found was using phantomjs or Goose (a python library). If Nutch can handle content generated through javascript (which it doesn't appear to) then we'll use it.
Debamitro On Tuesday, 10 December 2013 11:41:03 UTC+5:30, Gora Mohanty wrote: > > On 9 December 2013 19:37, Debamitro Chakraborti > <[email protected]<javascript:>> > wrote: > > Any way to crawl the back issues of prominent Indian newspapers like The > > Hindu, TOI, Indian Express, Hindustan Times etc? > > I was part of a team which needed to analyse news reports from a time > frame > > and we hacked together a TOI crawler (which still has limitations) and > were > > working on a The Hindu crawler -- would love to know about something > simpler > > that is already available. > > The newsrack.in recommendation is a good one, but Newsrack > is intended as much more than a simple crawler. If a crawler is > what you need, you should look into something like Nutch > ( http://nutch.apache.org/ ). If you prefer to write your own for > simple, non-generic. needs we have happily used Scrapy in > the Python world ( http://scrapy.org/ ). > > Regards, > Gora > -- For more details about this list http://datameet.org/discussions/ --- You received this message because you are subscribed to the Google Groups "datameet" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/groups/opt_out.
