Hi Ankit,

Thanks for the R lead! I checked it out.. am already doing something
like it using some quick shell/bash commands and a python script that
converts any html table to csv (http://stackoverflow.com/a/16697784).
Once we have the data down in HTMLs it's fairly straightforward. This
part come after the scraping.

The data in this case is not in permanent HTMLs that we can just save
in batch. It's being generated at server-side on Mahabhulekh server
depending on form inputs in an authenticated user session and then
being rendered as html at one constant URL. So what I'm looking for is
something that would simulate / automate (with due time intervals
between each call of course, we must not overload the server) the
calls to the mahabhulekh server, and capture the output it is
returning.

So far I'm not able to programmatically capture the HTML coming in the
popup window it is generating. The POST request returns a generic null
response or the site's main webpage in all the wget and curl commands
I've tried. Folks who have done some scraping earlier might be able to
help.

Another track worth exploring might be iMacros or other ways to
automate browser sessions. Foiks working in testing departments of
ticketing / booking sites etc might know and could help, so please
share this with your friends working in such projects!

I've read at some places R can be used to simulate this.. so yes it'll
be worth to keep exploring but I know shell scripting more so hoping
something comes there.

-- 
--
Cheers,
Nikhil
+91-966-583-1250
Pune, India
Self-designed learner at Swaraj University <http://www.swarajuniversity.org>
Blog <http://nikhilsheth.blogspot.in> | Contribute
<https://www.payumoney.com/webfronts/#/index/NikhilVJ>



On 10/25/16, Ankit Gaur <gauran...@gmail.com> wrote:
> Though I am not very well conversant with Data Sciences and web scraping,
> we had a recent DataKind meetup
> https://www.meetup.com/DataKind-Bangalore/events/234855978/ in Bangalore,
> where Bargava talked about using R's rvest library
> <https://blog.rstudio.org/2014/11/24/rvest-easy-web-scraping-with-r/>. We
> were able to do some basic scraping on goodreads with this. See if this
> fits your needs.
>
> Thanks,
> Ankit
>
> On Mon, Oct 24, 2016 at 10:09 PM, Nikhil VJ <nikhil...@gmail.com> wrote:
>
>> Hi,
>>
>> I'm looking at Maharashtra's land records portal :
>> https://mahabhulekh.maharashtra.gov.in
>>
>> .. and wondering if it's possible to scrape data from here?
>>
>> Will share a workflow:
>> choose 7/12 (७/१२) > select any जिल्हा > तालुका > गाव
>> select शोध :  सर्वे नंबर / गट नंबर (first option)
>> type 1 in the text box and press the "शोधा" button
>> Then we get a dropdown with options like 1/1 , 1/2, 1/3 etc.
>>
>> On selecting any and clicking "७/१२ पहा",
>> a new window/tab opens up (you have to enable popups), having static
>> HTML content (some tables). I need to capture this content.
>>
>> The URL is always the same:
>> https://mahabhulekh.maharashtra.gov.in/Konkan/pg712.aspx
>> ..but the content changes depending on the options chosen.
>>
>> On using the browser's "Inspect Element"> Network and clicking the
>> final button, there is a request to this URL:
>>
>> https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx/call712
>>
>> and the request Params / Payload is like:
>>
>> {'sno':'1','vid':'273200030398260000','dn':'रत्
>> नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32','
>> did':'32','tid':'3'}
>>
>> when you change the survey/gat number to 1/10, the params change like so:
>> {'sno':'1#10','vid':'273200030398260000','dn':'रत्
>> नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32','
>> did':'32','tid':'3'}
>>
>> for 1/1अ:
>> {'sno':'1#1अ','vid':'273200030398260000','dn':'रत्
>> नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32','
>> did':'32','tid':'3'}
>>
>> I tried some wget and curl commands but no luck so far. Do let me know
>> if you can make some headway.
>>
>> Also, it would be great to also learn how to extract on the list of
>> districts, talukas (subdistricts) in each district, and villages in
>> each taluka.
>>
>> dumping other info at bottom if it helps.
>>
>> Why do this:
>> At present it's just an exploration following on from our work on
>> village shapefiles.
>> The district > taluka > village mapping data from official Land
>> Records data could serve as a good source for triangulation.
>> Then, while I don't see myself going deeper into this right now, I am
>> aware that land records / ownership has major corruption,
>> entanglements and other issues precisely because of the lack of
>> transparency. The mahabhulekh website itself is a significant step
>> forward in making this sector a little more transparent, and more push
>> in this direction would probably do more good IMHO. At some point
>> GIS/lat-long info might come in, and it would be good to bring the
>> data to a level that is ready for it.
>>
>>
>> Data dump:
>> When we press the button to fetch the 7/12 (saatbarah) record, the
>> console records a POST with these parameters:
>>
>> Copy as cURL:
>> curl 'https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx/call712'
>> -H 'Host: mahabhulekh.maharashtra.gov.in' -H 'User-Agent: Mozilla/5.0
>> (X11; Ubuntu; Linux i686; rv:42.0) Gecko/20100101 Firefox/42.0' -H
>> 'Accept: application/json, text/plain, */*' -H 'Accept-Language:
>> en-US,en;q=0.5' --compressed -H 'Content-Type:
>> application/json;charset=utf-8' -H 'Referer:
>> https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx' -H
>> 'Content-Length: 170' -H 'Cookie:
>> ASP.NET_SessionId=3ozsnwd3nhh4py4hmiqcjeoc' -H 'Connection:
>> keep-alive' -H 'Pragma: no-cache' -H 'Cache-Control: no-cache'
>>
>> Copy POST data:
>> {'sno':'1#1अ','vid':'273200030398260000','dn':'रत्
>> नागिरी','tn':'खेड','vn':'वाळंजवाडी','tc':'3','dc':'32','
>> did':'32','tid':'3'}
>>
>> request headers:
>> POST /Konkan/Home.aspx/call712 HTTP/1.1
>> Host: mahabhulekh.maharashtra.gov.in
>> User-Agent: Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:42.0)
>> Gecko/20100101 Firefox/42.0
>> Accept: application/json, text/plain, */*
>> Accept-Language: en-US,en;q=0.5
>> Accept-Encoding: gzip, deflate
>> Content-Type: application/json;charset=utf-8
>> Referer: https://mahabhulekh.maharashtra.gov.in/Konkan/Home.aspx
>> Content-Length: 170
>> Cookie: ASP.NET_SessionId=3ozsnwd3nhh4py4hmiqcjeoc
>> Connection: keep-alive
>> Pragma: no-cache
>> Cache-Control: no-cache
>>
>> response headers:
>> HTTP/1.1 200 OK
>> Cache-Control: private, max-age=0
>> Content-Type: application/json; charset=utf-8
>> Server: Microsoft-IIS/8.0
>> X-Powered-By: ASP.NET
>> Date: Mon, 24 Oct 2016 15:31:40 GMT
>> Content-Length: 10
>>
>> Copy Response:
>> {"d":null}
>>
>>
>> --
>> --
>> Cheers,
>> Nikhil
>> +91-966-583-1250
>> Pune, India
>> Self-designed learner at Swaraj University <http://www.swarajuniversity.
>> org>
>> Blog <http://nikhilsheth.blogspot.in> | Contribute
>> <https://www.payumoney.com/webfronts/#/index/NikhilVJ>
>>
>> --
>> Datameet is a community of Data Science enthusiasts in India. Know more
>> about us by visiting http://datameet.org
>> ---
>> You received this message because you are subscribed to the Google Groups
>> "datameet" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to datameet+unsubscr...@googlegroups.com.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
> --
> Datameet is a community of Data Science enthusiasts in India. Know more
> about us by visiting http://datameet.org
> ---
> You received this message because you are subscribed to the Google Groups
> "datameet" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to datameet+unsubscr...@googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.
>


-- 
--
Cheers,
Nikhil
+91-966-583-1250
Pune, India
Self-designed learner at Swaraj University <http://www.swarajuniversity.org>
Blog <http://nikhilsheth.blogspot.in> | Contribute
<https://www.payumoney.com/webfronts/#/index/NikhilVJ>

-- 
Datameet is a community of Data Science enthusiasts in India. Know more about 
us by visiting http://datameet.org
--- 
You received this message because you are subscribed to the Google Groups 
"datameet" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to datameet+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to