Though I am not very well conversant with Data Sciences and web scraping, we had a recent DataKind meetup https://www.meetup.com/DataKind-Bangalore/events/234855978/ in Bangalore, where Bargava talked about using R's rvest library <https://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/>. We were able to do some basic scraping on goodreads with this. See if this fits your needs.
Thanks, Ankit On Mon, Oct 24, 2016 at 10:09 PM, Nikhil VJ <[email protected]> wrote: > Hi, > > I'm looking at Maharashtra's land records portal : > https://mahabhulekh.maharashtra.gov.in > > .. and wondering if it's possible to scrape data from here? > > Will share a workflow: > choose 7/12 (७/१२) > select any जिल्हा > तालुका > गाव > select शोध : सर्वे नंबर / गट नंबर (first option) > type 1 in the text box and press the "शोधा" button > Then we get a dropdown with options like 1/1 , 1/2, 1/3 etc. > > On selecting any and clicking "७/१२ पहा", > a new window/tab opens up (you have to enable popups), having static > HTML content (some tables). I need to capture this content. > > The URL is always the same: > https://mahabhulekh.maharashtra.gov.in/Konkan/pg712.aspx > ..but the content changes depending on the options chosen. > > On using the browser's "Inspect Element"> Network and clicking the > final button, there is a request to this URL: > > https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx/call712 > > and the request Params / Payload is like: > > {'sno':'1','vid':'273200030398260000','dn':'रत् > नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32',' > did':'32','tid':'3'} > > when you change the survey/gat number to 1/10, the params change like so: > {'sno':'1#10','vid':'273200030398260000','dn':'रत् > नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32',' > did':'32','tid':'3'} > > for 1/1अ: > {'sno':'1#1अ','vid':'273200030398260000','dn':'रत् > नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32',' > did':'32','tid':'3'} > > I tried some wget and curl commands but no luck so far. Do let me know > if you can make some headway. > > Also, it would be great to also learn how to extract on the list of > districts, talukas (subdistricts) in each district, and villages in > each taluka. > > dumping other info at bottom if it helps. > > Why do this: > At present it's just an exploration following on from our work on > village shapefiles. > The district > taluka > village mapping data from official Land > Records data could serve as a good source for triangulation. > Then, while I don't see myself going deeper into this right now, I am > aware that land records / ownership has major corruption, > entanglements and other issues precisely because of the lack of > transparency. The mahabhulekh website itself is a significant step > forward in making this sector a little more transparent, and more push > in this direction would probably do more good IMHO. At some point > GIS/lat-long info might come in, and it would be good to bring the > data to a level that is ready for it. > > > Data dump: > When we press the button to fetch the 7/12 (saatbarah) record, the > console records a POST with these parameters: > > Copy as cURL: > curl 'https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx/call712' > -H 'Host: mahabhulekh.maharashtra.gov.in' -H 'User-Agent: Mozilla/5.0 > (X11; Ubuntu; Linux i686; rv:42.0) Gecko/20100101 Firefox/42.0' -H > 'Accept: application/json, text/plain, */*' -H 'Accept-Language: > en-US,en;q=0.5' --compressed -H 'Content-Type: > application/json;charset=utf-8' -H 'Referer: > https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx' -H > 'Content-Length: 170' -H 'Cookie: > ASP.NET_SessionId=3ozsnwd3nhh4py4hmiqcjeoc' -H 'Connection: > keep-alive' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache' > > Copy POST data: > {'sno':'1#1अ','vid':'273200030398260000','dn':'रत् > नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32',' > did':'32','tid':'3'} > > request headers: > POST /Konkan/Home.aspx/call712 HTTP/1.1 > Host: mahabhulekh.maharashtra.gov.in > User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:42.0) > Gecko/20100101 Firefox/42.0 > Accept: application/json, text/plain, */* > Accept-Language: en-US,en;q=0.5 > Accept-Encoding: gzip, deflate > Content-Type: application/json;charset=utf-8 > Referer: https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx > Content-Length: 170 > Cookie: ASP.NET_SessionId=3ozsnwd3nhh4py4hmiqcjeoc > Connection: keep-alive > Pragma: no-cache > Cache-Control: no-cache > > response headers: > HTTP/1.1 200 OK > Cache-Control: private, max-age=0 > Content-Type: application/json; charset=utf-8 > Server: Microsoft-IIS/8.0 > X-Powered-By: ASP.NET > Date: Mon, 24 Oct 2016 15:31:40 GMT > Content-Length: 10 > > Copy Response: > {"d":null} > > > -- > -- > Cheers, > Nikhil > +91-966-583-1250 > Pune, India > Self-designed learner at Swaraj University <http://www.swarajuniversity. > org> > Blog <http://nikhilsheth.blogspot.in> | Contribute > <https://www.payumoney.com/webfronts/#/index/NikhilVJ> > > -- > Datameet is a community of Data Science enthusiasts in India. Know more > about us by visiting http://datameet.org > --- > You received this message because you are subscribed to the Google Groups > "datameet" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > For more options, visit https://groups.google.com/d/optout. > -- Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org --- You received this message because you are subscribed to the Google Groups "datameet" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
