Hi friends, I've created some shell scripts to aggregate the data from downloaded 7/12 records (html files) into two csv's. Sharing a github link having the code and instructions: https://github.com/answerquest/mahabhulekh-7-12-aggregating
Still no luck on automated scraping from the site, but this aggregating was the next step and has really simplified the process of inspecting multiple records at once. -Nikhil On 10/27/16, Nikhil VJ <nikhil...@gmail.com> wrote: > Hi Ankit, > > Thanks for the R lead! I checked it out.. am already doing something > like it using some quick shell/bash commands and a python script that > converts any html table to csv (http://stackoverflow.com/a/16697784). > Once we have the data down in HTMLs it's fairly straightforward. This > part come after the scraping. > > The data in this case is not in permanent HTMLs that we can just save > in batch. It's being generated at server-side on Mahabhulekh server > depending on form inputs in an authenticated user session and then > being rendered as html at one constant URL. So what I'm looking for is > something that would simulate / automate (with due time intervals > between each call of course, we must not overload the server) the > calls to the mahabhulekh server, and capture the output it is > returning. > > So far I'm not able to programmatically capture the HTML coming in the > popup window it is generating. The POST request returns a generic null > response or the site's main webpage in all the wget and curl commands > I've tried. Folks who have done some scraping earlier might be able to > help. > > Another track worth exploring might be iMacros or other ways to > automate browser sessions. Foiks working in testing departments of > ticketing / booking sites etc might know and could help, so please > share this with your friends working in such projects! > > I've read at some places R can be used to simulate this.. so yes it'll > be worth to keep exploring but I know shell scripting more so hoping > something comes there. > > -- > -- > Cheers, > Nikhil > +91-966-583-1250 > Pune, India > Self-designed learner at Swaraj University > <http://www.swarajuniversity.org> > Blog <http://nikhilsheth.blogspot.in> | Contribute > <https://www.payumoney.com/webfronts/#/index/NikhilVJ> > > > > On 10/25/16, Ankit Gaur <gauran...@gmail.com> wrote: >> Though I am not very well conversant with Data Sciences and web scraping, >> we had a recent DataKind meetup >> https://www.meetup.com/DataKind-Bangalore/events/234855978/ in Bangalore, >> where Bargava talked about using R's rvest library >> <https://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/>. We >> were able to do some basic scraping on goodreads with this. See if this >> fits your needs. >> >> Thanks, >> Ankit >> >> On Mon, Oct 24, 2016 at 10:09 PM, Nikhil VJ <nikhil...@gmail.com> wrote: >> >>> Hi, >>> >>> I'm looking at Maharashtra's land records portal : >>> https://mahabhulekh.maharashtra.gov.in >>> >>> .. and wondering if it's possible to scrape data from here? >>> >>> Will share a workflow: >>> choose 7/12 (७/१२) > select any जिल्हा > तालुका > गाव >>> select शोध : सर्वे नंबर / गट नंबर (first option) >>> type 1 in the text box and press the "शोधा" button >>> Then we get a dropdown with options like 1/1 , 1/2, 1/3 etc. >>> >>> On selecting any and clicking "७/१२ पहा", >>> a new window/tab opens up (you have to enable popups), having static >>> HTML content (some tables). I need to capture this content. >>> >>> The URL is always the same: >>> https://mahabhulekh.maharashtra.gov.in/Konkan/pg712.aspx >>> ..but the content changes depending on the options chosen. >>> >>> On using the browser's "Inspect Element"> Network and clicking the >>> final button, there is a request to this URL: >>> >>> https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx/call712 >>> >>> and the request Params / Payload is like: >>> >>> {'sno':'1','vid':'273200030398260000','dn':'रत् >>> नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32',' >>> did':'32','tid':'3'} >>> >>> when you change the survey/gat number to 1/10, the params change like >>> so: >>> {'sno':'1#10','vid':'273200030398260000','dn':'रत् >>> नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32',' >>> did':'32','tid':'3'} >>> >>> for 1/1अ: >>> {'sno':'1#1अ','vid':'273200030398260000','dn':'रत् >>> नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32',' >>> did':'32','tid':'3'} >>> >>> I tried some wget and curl commands but no luck so far. Do let me know >>> if you can make some headway. >>> >>> Also, it would be great to also learn how to extract on the list of >>> districts, talukas (subdistricts) in each district, and villages in >>> each taluka. >>> >>> dumping other info at bottom if it helps. >>> >>> Why do this: >>> At present it's just an exploration following on from our work on >>> village shapefiles. >>> The district > taluka > village mapping data from official Land >>> Records data could serve as a good source for triangulation. >>> Then, while I don't see myself going deeper into this right now, I am >>> aware that land records / ownership has major corruption, >>> entanglements and other issues precisely because of the lack of >>> transparency. The mahabhulekh website itself is a significant step >>> forward in making this sector a little more transparent, and more push >>> in this direction would probably do more good IMHO. At some point >>> GIS/lat-long info might come in, and it would be good to bring the >>> data to a level that is ready for it. >>> >>> >>> Data dump: >>> When we press the button to fetch the 7/12 (saatbarah) record, the >>> console records a POST with these parameters: >>> >>> Copy as cURL: >>> curl 'https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx/call712' >>> -H 'Host: mahabhulekh.maharashtra.gov.in' -H 'User-Agent: Mozilla/5.0 >>> (X11; Ubuntu; Linux i686; rv:42.0) Gecko/20100101 Firefox/42.0' -H >>> 'Accept: application/json, text/plain, */*' -H 'Accept-Language: >>> en-US,en;q=0.5' --compressed -H 'Content-Type: >>> application/json;charset=utf-8' -H 'Referer: >>> https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx' -H >>> 'Content-Length: 170' -H 'Cookie: >>> ASP.NET_SessionId=3ozsnwd3nhh4py4hmiqcjeoc' -H 'Connection: >>> keep-alive' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache' >>> >>> Copy POST data: >>> {'sno':'1#1अ','vid':'273200030398260000','dn':'रत् >>> नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32',' >>> did':'32','tid':'3'} >>> >>> request headers: >>> POST /Konkan/Home.aspx/call712 HTTP/1.1 >>> Host: mahabhulekh.maharashtra.gov.in >>> User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:42.0) >>> Gecko/20100101 Firefox/42.0 >>> Accept: application/json, text/plain, */* >>> Accept-Language: en-US,en;q=0.5 >>> Accept-Encoding: gzip, deflate >>> Content-Type: application/json;charset=utf-8 >>> Referer: https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx >>> Content-Length: 170 >>> Cookie: ASP.NET_SessionId=3ozsnwd3nhh4py4hmiqcjeoc >>> Connection: keep-alive >>> Pragma: no-cache >>> Cache-Control: no-cache >>> >>> response headers: >>> HTTP/1.1 200 OK >>> Cache-Control: private, max-age=0 >>> Content-Type: application/json; charset=utf-8 >>> Server: Microsoft-IIS/8.0 >>> X-Powered-By: ASP.NET >>> Date: Mon, 24 Oct 2016 15:31:40 GMT >>> Content-Length: 10 >>> >>> Copy Response: >>> {"d":null} >>> >>> >>> -- >>> -- >>> Cheers, >>> Nikhil >>> +91-966-583-1250 >>> Pune, India >>> Self-designed learner at Swaraj University <http://www.swarajuniversity. >>> org> >>> Blog <http://nikhilsheth.blogspot.in> | Contribute >>> <https://www.payumoney.com/webfronts/#/index/NikhilVJ> >>> >>> -- >>> Datameet is a community of Data Science enthusiasts in India. Know more >>> about us by visiting http://datameet.org >>> --- >>> You received this message because you are subscribed to the Google >>> Groups >>> "datameet" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an >>> email to datameet+unsubscr...@googlegroups.com. >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- >> Datameet is a community of Data Science enthusiasts in India. Know more >> about us by visiting http://datameet.org >> --- >> You received this message because you are subscribed to the Google Groups >> "datameet" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to datameet+unsubscr...@googlegroups.com. >> For more options, visit https://groups.google.com/d/optout. >> > > > -- > -- > Cheers, > Nikhil > +91-966-583-1250 > Pune, India > Self-designed learner at Swaraj University > <http://www.swarajuniversity.org> > Blog <http://nikhilsheth.blogspot.in> | Contribute > <https://www.payumoney.com/webfronts/#/index/NikhilVJ> > -- -- Cheers, Nikhil +91-966-583-1250 Pune, India Self-designed learner at Swaraj University <http://www.swarajuniversity.org> Blog <http://nikhilsheth.blogspot.in> | Contribute <https://www.payumoney.com/webfronts/#/index/NikhilVJ> -- Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org --- You received this message because you are subscribed to the Google Groups "datameet" group. To unsubscribe from this group and stop receiving emails from it, send an email to datameet+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.