Hi,

That problem is common to whatever crawler you use. Nutch will extract
links from JavaScript, but that's it. I would use Rhino, and a custom
HtmlParser plugin for Nutch. This is admittedly non-trivial, but I know of
no open source tool that already does this.

Regards,
Gora
Hi Gora,

The problem with using scrapy (or just simply BeautifulSoup) is that some
newspapers generate content dynamically, using javascript. The possible
solutions we found was using phantomjs or Goose (a python library). If
Nutch can handle content generated through javascript (which it doesn't
appear to) then we'll use it.

Debamitro
On Tuesday, 10 December 2013 11:41:03 UTC+5:30, Gora Mohanty wrote:
>
> On 9 December 2013 19:37, Debamitro Chakraborti <[email protected]>
> wrote:
> > Any way to crawl the back issues of prominent Indian newspapers like The
> > Hindu, TOI, Indian Express, Hindustan Times etc?
> > I was part of a team which needed to analyse news reports from a time
> frame
> > and we hacked together a TOI crawler (which still has limitations) and
> were
> > working on a The Hindu crawler -- would love to know about something
> simpler
> > that is already available.
>
> The newsrack.in recommendation is a good one, but Newsrack
> is intended as much more than a simple crawler. If a crawler is
> what you need, you should look into something like Nutch
> ( http://nutch.apache.org/ ). If you prefer to write your own for
> simple, non-generic. needs we have happily used Scrapy in
> the Python world ( http://scrapy.org/ ).
>
> Regards,
> Gora
>
 --
For more details about this list
http://datameet.org/discussions/
---
You received this message because you are subscribed to the Google Groups
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an
email to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

-- 
For more details about this list
http://datameet.org/discussions/
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to