Though I am not very well conversant with Data Sciences and web scraping,
we had a recent DataKind meetup
https://www.meetup.com/DataKind-Bangalore/events/234855978/ in Bangalore,
where Bargava talked about using R's rvest library
<https://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/>. We
were able to do some basic scraping on goodreads with this. See if this
fits your needs.

Thanks,
Ankit

On Mon, Oct 24, 2016 at 10:09 PM, Nikhil VJ <[email protected]> wrote:

> Hi,
>
> I'm looking at Maharashtra's land records portal :
> https://mahabhulekh.maharashtra.gov.in
>
> .. and wondering if it's possible to scrape data from here?
>
> Will share a workflow:
> choose 7/12 (७/१२) > select any जिल्हा > तालुका > गाव
> select शोध :  सर्वे नंबर / गट नंबर (first option)
> type 1 in the text box and press the "शोधा" button
> Then we get a dropdown with options like 1/1 , 1/2, 1/3 etc.
>
> On selecting any and clicking "७/१२ पहा",
> a new window/tab opens up (you have to enable popups), having static
> HTML content (some tables). I need to capture this content.
>
> The URL is always the same:
> https://mahabhulekh.maharashtra.gov.in/Konkan/pg712.aspx
> ..but the content changes depending on the options chosen.
>
> On using the browser's "Inspect Element"> Network and clicking the
> final button, there is a request to this URL:
>
> https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx/call712
>
> and the request Params / Payload is like:
>
> {'sno':'1','vid':'273200030398260000','dn':'रत्
> नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32','
> did':'32','tid':'3'}
>
> when you change the survey/gat number to 1/10, the params change like so:
> {'sno':'1#10','vid':'273200030398260000','dn':'रत्
> नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32','
> did':'32','tid':'3'}
>
> for 1/1अ:
> {'sno':'1#1अ','vid':'273200030398260000','dn':'रत्
> नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32','
> did':'32','tid':'3'}
>
> I tried some wget and curl commands but no luck so far. Do let me know
> if you can make some headway.
>
> Also, it would be great to also learn how to extract on the list of
> districts, talukas (subdistricts) in each district, and villages in
> each taluka.
>
> dumping other info at bottom if it helps.
>
> Why do this:
> At present it's just an exploration following on from our work on
> village shapefiles.
> The district > taluka > village mapping data from official Land
> Records data could serve as a good source for triangulation.
> Then, while I don't see myself going deeper into this right now, I am
> aware that land records / ownership has major corruption,
> entanglements and other issues precisely because of the lack of
> transparency. The mahabhulekh website itself is a significant step
> forward in making this sector a little more transparent, and more push
> in this direction would probably do more good IMHO. At some point
> GIS/lat-long info might come in, and it would be good to bring the
> data to a level that is ready for it.
>
>
> Data dump:
> When we press the button to fetch the 7/12 (saatbarah) record, the
> console records a POST with these parameters:
>
> Copy as cURL:
> curl 'https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx/call712'
> -H 'Host: mahabhulekh.maharashtra.gov.in' -H 'User-Agent: Mozilla/5.0
> (X11; Ubuntu; Linux i686; rv:42.0) Gecko/20100101 Firefox/42.0' -H
> 'Accept: application/json, text/plain, */*' -H 'Accept-Language:
> en-US,en;q=0.5' --compressed -H 'Content-Type:
> application/json;charset=utf-8' -H 'Referer:
> https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx' -H
> 'Content-Length: 170' -H 'Cookie:
> ASP.NET_SessionId=3ozsnwd3nhh4py4hmiqcjeoc' -H 'Connection:
> keep-alive' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache'
>
> Copy POST data:
> {'sno':'1#1अ','vid':'273200030398260000','dn':'रत्
> नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32','
> did':'32','tid':'3'}
>
> request headers:
> POST /Konkan/Home.aspx/call712 HTTP/1.1
> Host: mahabhulekh.maharashtra.gov.in
> User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:42.0)
> Gecko/20100101 Firefox/42.0
> Accept: application/json, text/plain, */*
> Accept-Language: en-US,en;q=0.5
> Accept-Encoding: gzip, deflate
> Content-Type: application/json;charset=utf-8
> Referer: https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx
> Content-Length: 170
> Cookie: ASP.NET_SessionId=3ozsnwd3nhh4py4hmiqcjeoc
> Connection: keep-alive
> Pragma: no-cache
> Cache-Control: no-cache
>
> response headers:
> HTTP/1.1 200 OK
> Cache-Control: private, max-age=0
> Content-Type: application/json; charset=utf-8
> Server: Microsoft-IIS/8.0
> X-Powered-By: ASP.NET
> Date: Mon, 24 Oct 2016 15:31:40 GMT
> Content-Length: 10
>
> Copy Response:
> {"d":null}
>
>
> --
> --
> Cheers,
> Nikhil
> +91-966-583-1250
> Pune, India
> Self-designed learner at Swaraj University <http://www.swarajuniversity.
> org>
> Blog <http://nikhilsheth.blogspot.in> | Contribute
> <https://www.payumoney.com/webfronts/#/index/NikhilVJ>
>
> --
> Datameet is a community of Data Science enthusiasts in India. Know more
> about us by visiting http://datameet.org
> ---
> You received this message because you are subscribed to the Google Groups
> "datameet" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> For more options, visit https://groups.google.com/d/optout.
>

-- 
Datameet is a community of Data Science enthusiasts in India. Know more about 
us by visiting http://datameet.org
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to