Hi Arvind,

Yes I am interested. What we need is historical data, so we need to crawl 
archives. Unfortunately not all Indian newspapers have good archive pages. 
The Hindu has a systematic urls for its archive, but it creates the links 
for each day dynamically using javascript. TOI has static archived pages 
and is the easiest to crawl. We couldn't find archives for HT. You see, our 
application needs the dates of the reports so it is best to crawl datewise. 
I'll contact you separately.

Debamitro

On Monday, 9 December 2013 23:36:34 UTC+5:30, Arvind Batra wrote:
>
> Hi Debamitro,
>
> Couple of months ago, me and few of my friends built a media monitoring 
> tool to track what traditional media was writing about Aam Aadmi Party. Our 
> work can be seen here - http://aap.mediatrack.in 
>
> As part of the process, we wrote a crawler that crawls Hindu, TOI, HT and 
> three other sources. To keep track of scale we are crawling a depth of 2 
> starting from the daily site map page of each of these news sites.  I can 
> share our crawler code.  We also have last two months of crawl data from 
> these sources, we will be happy to share that as well. 
>
> Please do let me know if you are interested. 
>
>
> Thanks,
> arvind
>
>
>
>
>
>
> On Mon, Dec 9, 2013 at 11:24 PM, Debamitro Chakraborti 
> <[email protected]<javascript:>
> > wrote:
>
>> I know of newsrack (in fact I created the NREGA topic on the site long 
>> long ago) but what I am looking for is a crawler of past records which I 
>> can use for my own research. Maybe the code behind newsrack can be reused 
>> to build such a crawler -- but I didn't see it anywhere on the site.
>> Anyway, thanks.
>>
>> Debamitro
>>
>>
>> On Monday, 9 December 2013 19:48:32 UTC+5:30, Meera K wrote:
>>
>>> See if newsrack.in fits your needs, it uses rss feeds though. But 
>>> allows programming it so more powerful than Google news.
>>>
>>>  Regds, Meera
>>> ~ Bangalore's own interactive newsmagazine at www.citizenmatters.in ~
>>>
>>>
>>> On Mon, Dec 9, 2013 at 7:37 PM, Debamitro Chakraborti <[email protected]
>>> > wrote:
>>>
>>>>  Any way to crawl the back issues of prominent Indian newspapers like 
>>>> The Hindu, TOI, Indian Express, Hindustan Times etc?
>>>> I was part of a team which needed to analyse news reports from a time 
>>>> frame and we hacked together a TOI crawler (which still has limitations) 
>>>> and were working on a The Hindu crawler -- would love to know about 
>>>> something simpler that is already available.
>>>>
>>>> Debamitro
>>>>
>>>> -- 
>>>> For more details about this list
>>>> http://datameet.org/discussions/
>>>> --- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "datameet" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to [email protected].
>>>>
>>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>>
>>>
>>>  -- 
>> For more details about this list
>> http://datameet.org/discussions/
>> --- 
>> You received this message because you are subscribed to the Google Groups 
>> "datameet" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>
>

-- 
For more details about this list
http://datameet.org/discussions/
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to