Hi Pradeep,

My aim is more that people can use snippets from the scripts to devise
their own stuff.

I've put the repo under the GPL license; would prefer to share freely
and have others take it forward. And I don't think commercial use with
this kind of data would be permitted but feel free to check

-Nikhil


On 11/14/16, Pradeep Bhatt <bhatt.prad...@gmail.com> wrote:
> This is very interesting.
>
> Can this be used for commercial purposes? Where I can read about data
> policy on this?
>
> Regards,
> Pradeep
>
> On Mon, Nov 14, 2016 at 9:21 AM, Nikhil VJ <nikhil...@gmail.com> wrote:
>
>> Hi friends,
>>
>> I've created some shell scripts to aggregate the data from downloaded
>> 7/12 records (html files) into two csv's. Sharing a github link having
>> the code and instructions:
>> https://github.com/answerquest/mahabhulekh-7-12-aggregating
>>
>> Still no luck on automated scraping from the site, but this
>> aggregating was the next step and has really simplified the process of
>> inspecting multiple records at once.
>>
>> -Nikhil
>>
>> On 10/27/16, Nikhil VJ <nikhil...@gmail.com> wrote:
>> > Hi Ankit,
>> >
>> > Thanks for the R lead! I checked it out.. am already doing something
>> > like it using some quick shell/bash commands and a python script that
>> > converts any html table to csv (http://stackoverflow.com/a/16697784).
>> > Once we have the data down in HTMLs it's fairly straightforward. This
>> > part come after the scraping.
>> >
>> > The data in this case is not in permanent HTMLs that we can just save
>> > in batch. It's being generated at server-side on Mahabhulekh server
>> > depending on form inputs in an authenticated user session and then
>> > being rendered as html at one constant URL. So what I'm looking for is
>> > something that would simulate / automate (with due time intervals
>> > between each call of course, we must not overload the server) the
>> > calls to the mahabhulekh server, and capture the output it is
>> > returning.
>> >
>> > So far I'm not able to programmatically capture the HTML coming in the
>> > popup window it is generating. The POST request returns a generic null
>> > response or the site's main webpage in all the wget and curl commands
>> > I've tried. Folks who have done some scraping earlier might be able to
>> > help.
>> >
>> > Another track worth exploring might be iMacros or other ways to
>> > automate browser sessions. Foiks working in testing departments of
>> > ticketing / booking sites etc might know and could help, so please
>> > share this with your friends working in such projects!
>> >
>> > I've read at some places R can be used to simulate this.. so yes it'll
>> > be worth to keep exploring but I know shell scripting more so hoping
>> > something comes there.
>> >
>> > --
>> > --
>> > Cheers,
>> > Nikhil
>> > +91-966-583-1250
>> > Pune, India
>> > Self-designed learner at Swaraj University
>> > <http://www.swarajuniversity.org>
>> > Blog <http://nikhilsheth.blogspot.in> | Contribute
>> > <https://www.payumoney.com/webfronts/#/index/NikhilVJ>
>> >
>> >
>> >
>> > On 10/25/16, Ankit Gaur <gauran...@gmail.com> wrote:
>> >> Though I am not very well conversant with Data Sciences and web
>> scraping,
>> >> we had a recent DataKind meetup
>> >> https://www.meetup.com/DataKind-Bangalore/events/234855978/ in
>> Bangalore,
>> >> where Bargava talked about using R's rvest library
>> >> <https://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/>.
>> We
>> >> were able to do some basic scraping on goodreads with this. See if
>> >> this
>> >> fits your needs.
>> >>
>> >> Thanks,
>> >> Ankit
>> >>
>> >> On Mon, Oct 24, 2016 at 10:09 PM, Nikhil VJ <nikhil...@gmail.com>
>> wrote:
>> >>
>> >>> Hi,
>> >>>
>> >>> I'm looking at Maharashtra's land records portal :
>> >>> https://mahabhulekh.maharashtra.gov.in
>> >>>
>> >>> .. and wondering if it's possible to scrape data from here?
>> >>>
>> >>> Will share a workflow:
>> >>> choose 7/12 (७/१२) > select any जिल्हा > तालुका > गाव
>> >>> select शोध :  सर्वे नंबर / गट नंबर (first option)
>> >>> type 1 in the text box and press the "शोधा" button
>> >>> Then we get a dropdown with options like 1/1 , 1/2, 1/3 etc.
>> >>>
>> >>> On selecting any and clicking "७/१२ पहा",
>> >>> a new window/tab opens up (you have to enable popups), having static
>> >>> HTML content (some tables). I need to capture this content.
>> >>>
>> >>> The URL is always the same:
>> >>> https://mahabhulekh.maharashtra.gov.in/Konkan/pg712.aspx
>> >>> ..but the content changes depending on the options chosen.
>> >>>
>> >>> On using the browser's "Inspect Element"> Network and clicking the
>> >>> final button, there is a request to this URL:
>> >>>
>> >>> https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx/call712
>> >>>
>> >>> and the request Params / Payload is like:
>> >>>
>> >>> {'sno':'1','vid':'273200030398260000','dn':'रत्
>> >>> नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32','
>> >>> did':'32','tid':'3'}
>> >>>
>> >>> when you change the survey/gat number to 1/10, the params change like
>> >>> so:
>> >>> {'sno':'1#10','vid':'273200030398260000','dn':'रत्
>> >>> नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32','
>> >>> did':'32','tid':'3'}
>> >>>
>> >>> for 1/1अ:
>> >>> {'sno':'1#1अ','vid':'273200030398260000','dn':'रत्
>> >>> नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32','
>> >>> did':'32','tid':'3'}
>> >>>
>> >>> I tried some wget and curl commands but no luck so far. Do let me
>> >>> know
>> >>> if you can make some headway.
>> >>>
>> >>> Also, it would be great to also learn how to extract on the list of
>> >>> districts, talukas (subdistricts) in each district, and villages in
>> >>> each taluka.
>> >>>
>> >>> dumping other info at bottom if it helps.
>> >>>
>> >>> Why do this:
>> >>> At present it's just an exploration following on from our work on
>> >>> village shapefiles.
>> >>> The district > taluka > village mapping data from official Land
>> >>> Records data could serve as a good source for triangulation.
>> >>> Then, while I don't see myself going deeper into this right now, I am
>> >>> aware that land records / ownership has major corruption,
>> >>> entanglements and other issues precisely because of the lack of
>> >>> transparency. The mahabhulekh website itself is a significant step
>> >>> forward in making this sector a little more transparent, and more
>> >>> push
>> >>> in this direction would probably do more good IMHO. At some point
>> >>> GIS/lat-long info might come in, and it would be good to bring the
>> >>> data to a level that is ready for it.
>> >>>
>> >>>
>> >>> Data dump:
>> >>> When we press the button to fetch the 7/12 (saatbarah) record, the
>> >>> console records a POST with these parameters:
>> >>>
>> >>> Copy as cURL:
>> >>> curl
>> >>> 'https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx/call712'
>> >>> -H 'Host: mahabhulekh.maharashtra.gov.in' -H 'User-Agent: Mozilla/5.0
>> >>> (X11; Ubuntu; Linux i686; rv:42.0) Gecko/20100101 Firefox/42.0' -H
>> >>> 'Accept: application/json, text/plain, */*' -H 'Accept-Language:
>> >>> en-US,en;q=0.5' --compressed -H 'Content-Type:
>> >>> application/json;charset=utf-8' -H 'Referer:
>> >>> https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx' -H
>> >>> 'Content-Length: 170' -H 'Cookie:
>> >>> ASP.NET_SessionId=3ozsnwd3nhh4py4hmiqcjeoc' -H 'Connection:
>> >>> keep-alive' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache'
>> >>>
>> >>> Copy POST data:
>> >>> {'sno':'1#1अ','vid':'273200030398260000','dn':'रत्
>> >>> नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32','
>> >>> did':'32','tid':'3'}
>> >>>
>> >>> request headers:
>> >>> POST /Konkan/Home.aspx/call712 HTTP/1.1
>> >>> Host: mahabhulekh.maharashtra.gov.in
>> >>> User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:42.0)
>> >>> Gecko/20100101 Firefox/42.0
>> >>> Accept: application/json, text/plain, */*
>> >>> Accept-Language: en-US,en;q=0.5
>> >>> Accept-Encoding: gzip, deflate
>> >>> Content-Type: application/json;charset=utf-8
>> >>> Referer: https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx
>> >>> Content-Length: 170
>> >>> Cookie: ASP.NET_SessionId=3ozsnwd3nhh4py4hmiqcjeoc
>> >>> Connection: keep-alive
>> >>> Pragma: no-cache
>> >>> Cache-Control: no-cache
>> >>>
>> >>> response headers:
>> >>> HTTP/1.1 200 OK
>> >>> Cache-Control: private, max-age=0
>> >>> Content-Type: application/json; charset=utf-8
>> >>> Server: Microsoft-IIS/8.0
>> >>> X-Powered-By: ASP.NET
>> >>> Date: Mon, 24 Oct 2016 15:31:40 GMT
>> >>> Content-Length: 10
>> >>>
>> >>> Copy Response:
>> >>> {"d":null}
>> >>>
>> >>>
>> >>> --
>> >>> --
>> >>> Cheers,
>> >>> Nikhil
>> >>> +91-966-583-1250
>> >>> Pune, India
>> >>> Self-designed learner at Swaraj University <
>> http://www.swarajuniversity.
>> >>> org>
>> >>> Blog <http://nikhilsheth.blogspot.in> | Contribute
>> >>> <https://www.payumoney.com/webfronts/#/index/NikhilVJ>
>> >>>
>> >>> --
>> >>> Datameet is a community of Data Science enthusiasts in India. Know
>> >>> more
>> >>> about us by visiting http://datameet.org
>> >>> ---
>> >>> You received this message because you are subscribed to the Google
>> >>> Groups
>> >>> "datameet" group.
>> >>> To unsubscribe from this group and stop receiving emails from it,
>> >>> send
>> >>> an
>> >>> email to datameet+unsubscr...@googlegroups.com.
>> >>> For more options, visit https://groups.google.com/d/optout.
>> >>>
>> >>
>> >> --
>> >> Datameet is a community of Data Science enthusiasts in India. Know
>> >> more
>> >> about us by visiting http://datameet.org
>> >> ---
>> >> You received this message because you are subscribed to the Google
>> Groups
>> >> "datameet" group.
>> >> To unsubscribe from this group and stop receiving emails from it, send
>> an
>> >> email to datameet+unsubscr...@googlegroups.com.
>> >> For more options, visit https://groups.google.com/d/optout.
>> >>
>> >
>> >
>> > --
>> > --
>> > Cheers,
>> > Nikhil
>> > +91-966-583-1250
>> > Pune, India
>> > Self-designed learner at Swaraj University
>> > <http://www.swarajuniversity.org>
>> > Blog <http://nikhilsheth.blogspot.in> | Contribute
>> > <https://www.payumoney.com/webfronts/#/index/NikhilVJ>
>> >
>>
>>
>> --
>> --
>> Cheers,
>> Nikhil
>> +91-966-583-1250
>> Pune, India
>> Self-designed learner at Swaraj University <http://www.swarajuniversity.o
>> rg>
>> Blog <http://nikhilsheth.blogspot.in> | Contribute
>> <https://www.payumoney.com/webfronts/#/index/NikhilVJ>
>>
>> --
>> Datameet is a community of Data Science enthusiasts in India. Know more
>> about us by visiting http://datameet.org
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "datameet" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to datameet+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> Datameet is a community of Data Science enthusiasts in India. Know more
> about us by visiting http://datameet.org
> ---
> You received this message because you are subscribed to the Google Groups
> "datameet" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to datameet+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>


-- 
--
Cheers,
Nikhil
+91-966-583-1250
Pune, India
Self-designed learner at Swaraj University <http://www.swarajuniversity.org>
Blog <http://nikhilsheth.blogspot.in> | Contribute
<https://www.payumoney.com/webfronts/#/index/NikhilVJ>

-- 
Datameet is a community of Data Science enthusiasts in India. Know more about 
us by visiting http://datameet.org
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to datameet+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to