Hi Pradeep, My aim is more that people can use snippets from the scripts to devise their own stuff.
I've put the repo under the GPL license; would prefer to share freely and have others take it forward. And I don't think commercial use with this kind of data would be permitted but feel free to check -Nikhil On 11/14/16, Pradeep Bhatt <bhatt.prad...@gmail.com> wrote: > This is very interesting. > > Can this be used for commercial purposes? Where I can read about data > policy on this? > > Regards, > Pradeep > > On Mon, Nov 14, 2016 at 9:21 AM, Nikhil VJ <nikhil...@gmail.com> wrote: > >> Hi friends, >> >> I've created some shell scripts to aggregate the data from downloaded >> 7/12 records (html files) into two csv's. Sharing a github link having >> the code and instructions: >> https://github.com/answerquest/mahabhulekh-7-12-aggregating >> >> Still no luck on automated scraping from the site, but this >> aggregating was the next step and has really simplified the process of >> inspecting multiple records at once. >> >> -Nikhil >> >> On 10/27/16, Nikhil VJ <nikhil...@gmail.com> wrote: >> > Hi Ankit, >> > >> > Thanks for the R lead! I checked it out.. am already doing something >> > like it using some quick shell/bash commands and a python script that >> > converts any html table to csv (http://stackoverflow.com/a/16697784). >> > Once we have the data down in HTMLs it's fairly straightforward. This >> > part come after the scraping. >> > >> > The data in this case is not in permanent HTMLs that we can just save >> > in batch. It's being generated at server-side on Mahabhulekh server >> > depending on form inputs in an authenticated user session and then >> > being rendered as html at one constant URL. So what I'm looking for is >> > something that would simulate / automate (with due time intervals >> > between each call of course, we must not overload the server) the >> > calls to the mahabhulekh server, and capture the output it is >> > returning. >> > >> > So far I'm not able to programmatically capture the HTML coming in the >> > popup window it is generating. The POST request returns a generic null >> > response or the site's main webpage in all the wget and curl commands >> > I've tried. Folks who have done some scraping earlier might be able to >> > help. >> > >> > Another track worth exploring might be iMacros or other ways to >> > automate browser sessions. Foiks working in testing departments of >> > ticketing / booking sites etc might know and could help, so please >> > share this with your friends working in such projects! >> > >> > I've read at some places R can be used to simulate this.. so yes it'll >> > be worth to keep exploring but I know shell scripting more so hoping >> > something comes there. >> > >> > -- >> > -- >> > Cheers, >> > Nikhil >> > +91-966-583-1250 >> > Pune, India >> > Self-designed learner at Swaraj University >> > <http://www.swarajuniversity.org> >> > Blog <http://nikhilsheth.blogspot.in> | Contribute >> > <https://www.payumoney.com/webfronts/#/index/NikhilVJ> >> > >> > >> > >> > On 10/25/16, Ankit Gaur <gauran...@gmail.com> wrote: >> >> Though I am not very well conversant with Data Sciences and web >> scraping, >> >> we had a recent DataKind meetup >> >> https://www.meetup.com/DataKind-Bangalore/events/234855978/ in >> Bangalore, >> >> where Bargava talked about using R's rvest library >> >> <https://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/>. >> We >> >> were able to do some basic scraping on goodreads with this. See if >> >> this >> >> fits your needs. >> >> >> >> Thanks, >> >> Ankit >> >> >> >> On Mon, Oct 24, 2016 at 10:09 PM, Nikhil VJ <nikhil...@gmail.com> >> wrote: >> >> >> >>> Hi, >> >>> >> >>> I'm looking at Maharashtra's land records portal : >> >>> https://mahabhulekh.maharashtra.gov.in >> >>> >> >>> .. and wondering if it's possible to scrape data from here? >> >>> >> >>> Will share a workflow: >> >>> choose 7/12 (७/१२) > select any जिल्हा > तालुका > गाव >> >>> select शोध : सर्वे नंबर / गट नंबर (first option) >> >>> type 1 in the text box and press the "शोधा" button >> >>> Then we get a dropdown with options like 1/1 , 1/2, 1/3 etc. >> >>> >> >>> On selecting any and clicking "७/१२ पहा", >> >>> a new window/tab opens up (you have to enable popups), having static >> >>> HTML content (some tables). I need to capture this content. >> >>> >> >>> The URL is always the same: >> >>> https://mahabhulekh.maharashtra.gov.in/Konkan/pg712.aspx >> >>> ..but the content changes depending on the options chosen. >> >>> >> >>> On using the browser's "Inspect Element"> Network and clicking the >> >>> final button, there is a request to this URL: >> >>> >> >>> https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx/call712 >> >>> >> >>> and the request Params / Payload is like: >> >>> >> >>> {'sno':'1','vid':'273200030398260000','dn':'रत् >> >>> नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32',' >> >>> did':'32','tid':'3'} >> >>> >> >>> when you change the survey/gat number to 1/10, the params change like >> >>> so: >> >>> {'sno':'1#10','vid':'273200030398260000','dn':'रत् >> >>> नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32',' >> >>> did':'32','tid':'3'} >> >>> >> >>> for 1/1अ: >> >>> {'sno':'1#1अ','vid':'273200030398260000','dn':'रत् >> >>> नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32',' >> >>> did':'32','tid':'3'} >> >>> >> >>> I tried some wget and curl commands but no luck so far. Do let me >> >>> know >> >>> if you can make some headway. >> >>> >> >>> Also, it would be great to also learn how to extract on the list of >> >>> districts, talukas (subdistricts) in each district, and villages in >> >>> each taluka. >> >>> >> >>> dumping other info at bottom if it helps. >> >>> >> >>> Why do this: >> >>> At present it's just an exploration following on from our work on >> >>> village shapefiles. >> >>> The district > taluka > village mapping data from official Land >> >>> Records data could serve as a good source for triangulation. >> >>> Then, while I don't see myself going deeper into this right now, I am >> >>> aware that land records / ownership has major corruption, >> >>> entanglements and other issues precisely because of the lack of >> >>> transparency. The mahabhulekh website itself is a significant step >> >>> forward in making this sector a little more transparent, and more >> >>> push >> >>> in this direction would probably do more good IMHO. At some point >> >>> GIS/lat-long info might come in, and it would be good to bring the >> >>> data to a level that is ready for it. >> >>> >> >>> >> >>> Data dump: >> >>> When we press the button to fetch the 7/12 (saatbarah) record, the >> >>> console records a POST with these parameters: >> >>> >> >>> Copy as cURL: >> >>> curl >> >>> 'https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx/call712' >> >>> -H 'Host: mahabhulekh.maharashtra.gov.in' -H 'User-Agent: Mozilla/5.0 >> >>> (X11; Ubuntu; Linux i686; rv:42.0) Gecko/20100101 Firefox/42.0' -H >> >>> 'Accept: application/json, text/plain, */*' -H 'Accept-Language: >> >>> en-US,en;q=0.5' --compressed -H 'Content-Type: >> >>> application/json;charset=utf-8' -H 'Referer: >> >>> https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx' -H >> >>> 'Content-Length: 170' -H 'Cookie: >> >>> ASP.NET_SessionId=3ozsnwd3nhh4py4hmiqcjeoc' -H 'Connection: >> >>> keep-alive' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache' >> >>> >> >>> Copy POST data: >> >>> {'sno':'1#1अ','vid':'273200030398260000','dn':'रत् >> >>> नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32',' >> >>> did':'32','tid':'3'} >> >>> >> >>> request headers: >> >>> POST /Konkan/Home.aspx/call712 HTTP/1.1 >> >>> Host: mahabhulekh.maharashtra.gov.in >> >>> User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:42.0) >> >>> Gecko/20100101 Firefox/42.0 >> >>> Accept: application/json, text/plain, */* >> >>> Accept-Language: en-US,en;q=0.5 >> >>> Accept-Encoding: gzip, deflate >> >>> Content-Type: application/json;charset=utf-8 >> >>> Referer: https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx >> >>> Content-Length: 170 >> >>> Cookie: ASP.NET_SessionId=3ozsnwd3nhh4py4hmiqcjeoc >> >>> Connection: keep-alive >> >>> Pragma: no-cache >> >>> Cache-Control: no-cache >> >>> >> >>> response headers: >> >>> HTTP/1.1 200 OK >> >>> Cache-Control: private, max-age=0 >> >>> Content-Type: application/json; charset=utf-8 >> >>> Server: Microsoft-IIS/8.0 >> >>> X-Powered-By: ASP.NET >> >>> Date: Mon, 24 Oct 2016 15:31:40 GMT >> >>> Content-Length: 10 >> >>> >> >>> Copy Response: >> >>> {"d":null} >> >>> >> >>> >> >>> -- >> >>> -- >> >>> Cheers, >> >>> Nikhil >> >>> +91-966-583-1250 >> >>> Pune, India >> >>> Self-designed learner at Swaraj University < >> http://www.swarajuniversity. >> >>> org> >> >>> Blog <http://nikhilsheth.blogspot.in> | Contribute >> >>> <https://www.payumoney.com/webfronts/#/index/NikhilVJ> >> >>> >> >>> -- >> >>> Datameet is a community of Data Science enthusiasts in India. Know >> >>> more >> >>> about us by visiting http://datameet.org >> >>> --- >> >>> You received this message because you are subscribed to the Google >> >>> Groups >> >>> "datameet" group. >> >>> To unsubscribe from this group and stop receiving emails from it, >> >>> send >> >>> an >> >>> email to datameet+unsubscr...@googlegroups.com. >> >>> For more options, visit https://groups.google.com/d/optout. >> >>> >> >> >> >> -- >> >> Datameet is a community of Data Science enthusiasts in India. Know >> >> more >> >> about us by visiting http://datameet.org >> >> --- >> >> You received this message because you are subscribed to the Google >> Groups >> >> "datameet" group. >> >> To unsubscribe from this group and stop receiving emails from it, send >> an >> >> email to datameet+unsubscr...@googlegroups.com. >> >> For more options, visit https://groups.google.com/d/optout. >> >> >> > >> > >> > -- >> > -- >> > Cheers, >> > Nikhil >> > +91-966-583-1250 >> > Pune, India >> > Self-designed learner at Swaraj University >> > <http://www.swarajuniversity.org> >> > Blog <http://nikhilsheth.blogspot.in> | Contribute >> > <https://www.payumoney.com/webfronts/#/index/NikhilVJ> >> > >> >> >> -- >> -- >> Cheers, >> Nikhil >> +91-966-583-1250 >> Pune, India >> Self-designed learner at Swaraj University <http://www.swarajuniversity.o >> rg> >> Blog <http://nikhilsheth.blogspot.in> | Contribute >> <https://www.payumoney.com/webfronts/#/index/NikhilVJ> >> >> -- >> Datameet is a community of Data Science enthusiasts in India. Know more >> about us by visiting http://datameet.org >> --- >> You received this message because you are subscribed to the Google Groups >> "datameet" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to datameet+unsubscr...@googlegroups.com. >> For more options, visit https://groups.google.com/d/optout. >> > > -- > Datameet is a community of Data Science enthusiasts in India. Know more > about us by visiting http://datameet.org > --- > You received this message because you are subscribed to the Google Groups > "datameet" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to datameet+unsubscr...@googlegroups.com. > For more options, visit https://groups.google.com/d/optout. > -- -- Cheers, Nikhil +91-966-583-1250 Pune, India Self-designed learner at Swaraj University <http://www.swarajuniversity.org> Blog <http://nikhilsheth.blogspot.in> | Contribute <https://www.payumoney.com/webfronts/#/index/NikhilVJ> -- Datameet is a community of Data Science enthusiasts in India. Know more about us by visiting http://datameet.org --- You received this message because you are subscribed to the Google Groups "datameet" group. To unsubscribe from this group and stop receiving emails from it, send an email to datameet+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.