Re: [datameet] how to crawl Indian newspaper sites

Debamitro Chakraborti Tue, 10 Dec 2013 01:57:55 -0800

Hi Gora,

The problem with using scrapy (or just simply BeautifulSoup) is that some 
newspapers generate content dynamically, using javascript. The possible 
solutions we found was using phantomjs or Goose (a python library). If 
Nutch can handle content generated through javascript (which it doesn't 
appear to) then we'll use it.


Debamitro
On Tuesday, 10 December 2013 11:41:03 UTC+5:30, Gora Mohanty wrote:
>
> On 9 December 2013 19:37, Debamitro Chakraborti 
> <[email protected]<javascript:>> 
> wrote: 
> > Any way to crawl the back issues of prominent Indian newspapers like The 
> > Hindu, TOI, Indian Express, Hindustan Times etc? 
> > I was part of a team which needed to analyse news reports from a time 
> frame 
> > and we hacked together a TOI crawler (which still has limitations) and 
> were 
> > working on a The Hindu crawler -- would love to know about something 
> simpler 
> > that is already available. 
>
> The newsrack.in recommendation is a good one, but Newsrack 
> is intended as much more than a simple crawler. If a crawler is 
> what you need, you should look into something like Nutch 
> ( http://nutch.apache.org/ ). If you prefer to write your own for 
> simple, non-generic. needs we have happily used Scrapy in 
> the Python world ( http://scrapy.org/ ). 
>
> Regards, 
> Gora 
>

-- 
For more details about this list
http://datameet.org/discussions/
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Re: [datameet] how to crawl Indian newspaper sites

Reply via email to