Re: [datameet] Land records data scraping

Nikhil VJ Sun, 13 Nov 2016 19:52:12 -0800

Hi friends,

I've created some shell scripts to aggregate the data from downloaded
7/12 records (html files) into two csv's. Sharing a github link having
the code and instructions:
https://github.com/answerquest/mahabhulekh-7-12-aggregating


Still no luck on automated scraping from the site, but this
aggregating was the next step and has really simplified the process of
inspecting multiple records at once.

-Nikhil

On 10/27/16, Nikhil VJ <nikhil...@gmail.com> wrote:
> Hi Ankit,
>
> Thanks for the R lead! I checked it out.. am already doing something
> like it using some quick shell/bash commands and a python script that
> converts any html table to csv (http://stackoverflow.com/a/16697784).
> Once we have the data down in HTMLs it's fairly straightforward. This
> part come after the scraping.
>
> The data in this case is not in permanent HTMLs that we can just save
> in batch. It's being generated at server-side on Mahabhulekh server
> depending on form inputs in an authenticated user session and then
> being rendered as html at one constant URL. So what I'm looking for is
> something that would simulate / automate (with due time intervals
> between each call of course, we must not overload the server) the
> calls to the mahabhulekh server, and capture the output it is
> returning.
>
> So far I'm not able to programmatically capture the HTML coming in the
> popup window it is generating. The POST request returns a generic null
> response or the site's main webpage in all the wget and curl commands
> I've tried. Folks who have done some scraping earlier might be able to
> help.
>
> Another track worth exploring might be iMacros or other ways to
> automate browser sessions. Foiks working in testing departments of
> ticketing / booking sites etc might know and could help, so please
> share this with your friends working in such projects!
>
> I've read at some places R can be used to simulate this.. so yes it'll
> be worth to keep exploring but I know shell scripting more so hoping
> something comes there.
>
> --
> --
> Cheers,
> Nikhil
> +91-966-583-1250
> Pune, India
> Self-designed learner at Swaraj University
> <http://www.swarajuniversity.org>
> Blog <http://nikhilsheth.blogspot.in> | Contribute
> <https://www.payumoney.com/webfronts/#/index/NikhilVJ>
>
>
>
> On 10/25/16, Ankit Gaur <gauran...@gmail.com> wrote:
>> Though I am not very well conversant with Data Sciences and web scraping,
>> we had a recent DataKind meetup
>> https://www.meetup.com/DataKind-Bangalore/events/234855978/ in Bangalore,
>> where Bargava talked about using R's rvest library
>> <https://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/>. We
>> were able to do some basic scraping on goodreads with this. See if this
>> fits your needs.
>>
>> Thanks,
>> Ankit
>>
>> On Mon, Oct 24, 2016 at 10:09 PM, Nikhil VJ <nikhil...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> I'm looking at Maharashtra's land records portal :
>>> https://mahabhulekh.maharashtra.gov.in
>>>
>>> .. and wondering if it's possible to scrape data from here?
>>>
>>> Will share a workflow:
>>> choose 7/12 (७/१२) > select any जिल्हा > तालुका > गाव
>>> select शोध :  सर्वे नंबर / गट नंबर (first option)
>>> type 1 in the text box and press the "शोधा" button
>>> Then we get a dropdown with options like 1/1 , 1/2, 1/3 etc.
>>>
>>> On selecting any and clicking "७/१२ पहा",
>>> a new window/tab opens up (you have to enable popups), having static
>>> HTML content (some tables). I need to capture this content.
>>>
>>> The URL is always the same:
>>> https://mahabhulekh.maharashtra.gov.in/Konkan/pg712.aspx
>>> ..but the content changes depending on the options chosen.
>>>
>>> On using the browser's "Inspect Element"> Network and clicking the
>>> final button, there is a request to this URL:
>>>
>>> https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx/call712
>>>
>>> and the request Params / Payload is like:
>>>
>>> {'sno':'1','vid':'273200030398260000','dn':'रत्
>>> नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32','
>>> did':'32','tid':'3'}
>>>
>>> when you change the survey/gat number to 1/10, the params change like
>>> so:
>>> {'sno':'1#10','vid':'273200030398260000','dn':'रत्
>>> नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32','
>>> did':'32','tid':'3'}
>>>
>>> for 1/1अ:
>>> {'sno':'1#1अ','vid':'273200030398260000','dn':'रत्
>>> नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32','
>>> did':'32','tid':'3'}
>>>
>>> I tried some wget and curl commands but no luck so far. Do let me know
>>> if you can make some headway.
>>>
>>> Also, it would be great to also learn how to extract on the list of
>>> districts, talukas (subdistricts) in each district, and villages in
>>> each taluka.
>>>
>>> dumping other info at bottom if it helps.
>>>
>>> Why do this:
>>> At present it's just an exploration following on from our work on
>>> village shapefiles.
>>> The district > taluka > village mapping data from official Land
>>> Records data could serve as a good source for triangulation.
>>> Then, while I don't see myself going deeper into this right now, I am
>>> aware that land records / ownership has major corruption,
>>> entanglements and other issues precisely because of the lack of
>>> transparency. The mahabhulekh website itself is a significant step
>>> forward in making this sector a little more transparent, and more push
>>> in this direction would probably do more good IMHO. At some point
>>> GIS/lat-long info might come in, and it would be good to bring the
>>> data to a level that is ready for it.
>>>
>>>
>>> Data dump:
>>> When we press the button to fetch the 7/12 (saatbarah) record, the
>>> console records a POST with these parameters:
>>>
>>> Copy as cURL:
>>> curl 'https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx/call712'
>>> -H 'Host: mahabhulekh.maharashtra.gov.in' -H 'User-Agent: Mozilla/5.0
>>> (X11; Ubuntu; Linux i686; rv:42.0) Gecko/20100101 Firefox/42.0' -H
>>> 'Accept: application/json, text/plain, */*' -H 'Accept-Language:
>>> en-US,en;q=0.5' --compressed -H 'Content-Type:
>>> application/json;charset=utf-8' -H 'Referer:
>>> https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx' -H
>>> 'Content-Length: 170' -H 'Cookie:
>>> ASP.NET_SessionId=3ozsnwd3nhh4py4hmiqcjeoc' -H 'Connection:
>>> keep-alive' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache'
>>>
>>> Copy POST data:
>>> {'sno':'1#1अ','vid':'273200030398260000','dn':'रत्
>>> नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32','
>>> did':'32','tid':'3'}
>>>
>>> request headers:
>>> POST /Konkan/Home.aspx/call712 HTTP/1.1
>>> Host: mahabhulekh.maharashtra.gov.in
>>> User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:42.0)
>>> Gecko/20100101 Firefox/42.0
>>> Accept: application/json, text/plain, */*
>>> Accept-Language: en-US,en;q=0.5
>>> Accept-Encoding: gzip, deflate
>>> Content-Type: application/json;charset=utf-8
>>> Referer: https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx
>>> Content-Length: 170
>>> Cookie: ASP.NET_SessionId=3ozsnwd3nhh4py4hmiqcjeoc
>>> Connection: keep-alive
>>> Pragma: no-cache
>>> Cache-Control: no-cache
>>>
>>> response headers:
>>> HTTP/1.1 200 OK
>>> Cache-Control: private, max-age=0
>>> Content-Type: application/json; charset=utf-8
>>> Server: Microsoft-IIS/8.0
>>> X-Powered-By: ASP.NET
>>> Date: Mon, 24 Oct 2016 15:31:40 GMT
>>> Content-Length: 10
>>>
>>> Copy Response:
>>> {"d":null}
>>>
>>>
>>> --
>>> --
>>> Cheers,
>>> Nikhil
>>> +91-966-583-1250
>>> Pune, India
>>> Self-designed learner at Swaraj University <http://www.swarajuniversity.
>>> org>
>>> Blog <http://nikhilsheth.blogspot.in> | Contribute
>>> <https://www.payumoney.com/webfronts/#/index/NikhilVJ>
>>>
>>> --
>>> Datameet is a community of Data Science enthusiasts in India. Know more
>>> about us by visiting http://datameet.org
>>> ---
>>> You received this message because you are subscribed to the Google
>>> Groups
>>> "datameet" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an
>>> email to datameet+unsubscr...@googlegroups.com.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>> --
>> Datameet is a community of Data Science enthusiasts in India. Know more
>> about us by visiting http://datameet.org
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "datameet" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to datameet+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>
> --
> --
> Cheers,
> Nikhil
> +91-966-583-1250
> Pune, India
> Self-designed learner at Swaraj University
> <http://www.swarajuniversity.org>
> Blog <http://nikhilsheth.blogspot.in> | Contribute
> <https://www.payumoney.com/webfronts/#/index/NikhilVJ>
>


-- 
--
Cheers,
Nikhil
+91-966-583-1250
Pune, India
Self-designed learner at Swaraj University <http://www.swarajuniversity.org>
Blog <http://nikhilsheth.blogspot.in> | Contribute
<https://www.payumoney.com/webfronts/#/index/NikhilVJ>

-- 
Datameet is a community of Data Science enthusiasts in India. Know more about 
us by visiting http://datameet.org
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to datameet+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Re: [datameet] Land records data scraping

Reply via email to