Jim Sheppard wrote:
> 1.Select a location: [Stogursey]
> 1.Grab some basic background
> 2.Choose an Administrative Unit [Stogursey AP/CP]
> 3.Select a Theme [Population]
> 4.Choose [Total Population]
> 1.Select Table View
> 2.Grab data
> 5. Choose [Area (acres)] 
> 1.Select Table View
> 2.Grab data
>
> Ideally I would like to be able to automate the procedure by feeding the
> parish names (Location) to the scraper from a database table and return the
> collected data to the same table 
>
> In theory all things are possible, but is this a practical exercise to
> attempt with Piggy Bank and Solvent?
>   
I would say it's possible. But if you do it the way you are proposing, 
then it will take a long time, particularly if you have
lots of parish names, and would be a fairly complicated scraper to write.
What you really want is to be able to go from here:
http://www.visionofbritain.org.uk/place/multi_place_page.jsp?st=Stogursey
Where you have to disambiguate to pick which place you mean ( which 
would seem to require a human decision, or go to all of 'em), and go to 
here:
http://www.visionofbritain.org.uk/place/place_page.jsp?p_id=13246&st=Stogursey
[where you grab the basic info]
Now you either have to choose a parish (which would seem to require a 
human decision), or go to all of 'em.
A typical parish link looks like 
http://www.visionofbritain.org.uk/unit_page.jsp?u_id=10442140
And your population link looks like this:
http://www.visionofbritain.org.uk/data_cube_table_page.jsp?data_theme=T_POP&data_cube=N_TPop&u_id=10442140&c_id=10107260&add=N
and the table for crop area looks like:
http://www.visionofbritain.org.uk/data_cube_table_page.jsp?data_theme=T_LAND&data_cube=N_CROP1801_PAR_AREA&u_id=10442140&c_id=10001043&add=N

The thing to notice in all of them is the u_id parameter
(u_id=10442140)

so what you want is to find out the u_id of each of the parishes you are 
interested in, then you can replace the u_id value to go straight to the 
pages you are interested in.

So to recap, if you can get a list of the parish names from your 
database, outputted in the form of a javascript array that you can paste 
inline into your script (eg: var parishes = 
['Stogursey','Westminster','Hull', 'Newcastle']).

You can then iterate through, constructing urls like 
http://www.visionofbritain.org.uk/place/multi_place_page.jsp?st=*ParishName
*go to that address, get your links to all the administrative units, 
doing a regex to get the u_id, with which you can then construct urls to 
get to your table pages.

It will still take a long time I think, so you might want to split it 
into several stages, maybe use python or something to do enough scraping 
to gather a list of all the urls of table pages you need to go to.

If you don't want the data in RDF/XML though, it might be simpler to 
scrape with  perl/python/php/ruby/something, and write directly into 
your prefered data format (maybe just save all the html table pages as 
they are, and use  a script or spreadsheet app to convert to CSV.

HTH

Keith

> Thanks
> Jim
> Ottawa, ON. Canada
> http://www.stoneyburn.ca/ 
> Cornwall OPC Antony/Torpoint 
> http://www.secornwallopc.stoneyburn.ca 
> Coordinator Somerset OPC Project
>   don't just visit join us
> http://www.wsom-opc.org.uk/ 
>
>
>
> _______________________________________________
> General mailing list
> [email protected]
> http://simile.mit.edu/mailman/listinfo/general
>
>   

_______________________________________________
General mailing list
[email protected]
http://simile.mit.edu/mailman/listinfo/general

Reply via email to