Jim Sheppard wrote: > 1.Select a location: [Stogursey] > 1.Grab some basic background > 2.Choose an Administrative Unit [Stogursey AP/CP] > 3.Select a Theme [Population] > 4.Choose [Total Population] > 1.Select Table View > 2.Grab data > 5. Choose [Area (acres)] > 1.Select Table View > 2.Grab data > > Ideally I would like to be able to automate the procedure by feeding the > parish names (Location) to the scraper from a database table and return the > collected data to the same table > > In theory all things are possible, but is this a practical exercise to > attempt with Piggy Bank and Solvent? > I would say it's possible. But if you do it the way you are proposing, then it will take a long time, particularly if you have lots of parish names, and would be a fairly complicated scraper to write. What you really want is to be able to go from here: http://www.visionofbritain.org.uk/place/multi_place_page.jsp?st=Stogursey Where you have to disambiguate to pick which place you mean ( which would seem to require a human decision, or go to all of 'em), and go to here: http://www.visionofbritain.org.uk/place/place_page.jsp?p_id=13246&st=Stogursey [where you grab the basic info] Now you either have to choose a parish (which would seem to require a human decision), or go to all of 'em. A typical parish link looks like http://www.visionofbritain.org.uk/unit_page.jsp?u_id=10442140 And your population link looks like this: http://www.visionofbritain.org.uk/data_cube_table_page.jsp?data_theme=T_POP&data_cube=N_TPop&u_id=10442140&c_id=10107260&add=N and the table for crop area looks like: http://www.visionofbritain.org.uk/data_cube_table_page.jsp?data_theme=T_LAND&data_cube=N_CROP1801_PAR_AREA&u_id=10442140&c_id=10001043&add=N
The thing to notice in all of them is the u_id parameter (u_id=10442140) so what you want is to find out the u_id of each of the parishes you are interested in, then you can replace the u_id value to go straight to the pages you are interested in. So to recap, if you can get a list of the parish names from your database, outputted in the form of a javascript array that you can paste inline into your script (eg: var parishes = ['Stogursey','Westminster','Hull', 'Newcastle']). You can then iterate through, constructing urls like http://www.visionofbritain.org.uk/place/multi_place_page.jsp?st=*ParishName *go to that address, get your links to all the administrative units, doing a regex to get the u_id, with which you can then construct urls to get to your table pages. It will still take a long time I think, so you might want to split it into several stages, maybe use python or something to do enough scraping to gather a list of all the urls of table pages you need to go to. If you don't want the data in RDF/XML though, it might be simpler to scrape with perl/python/php/ruby/something, and write directly into your prefered data format (maybe just save all the html table pages as they are, and use a script or spreadsheet app to convert to CSV. HTH Keith > Thanks > Jim > Ottawa, ON. Canada > http://www.stoneyburn.ca/ > Cornwall OPC Antony/Torpoint > http://www.secornwallopc.stoneyburn.ca > Coordinator Somerset OPC Project > don't just visit join us > http://www.wsom-opc.org.uk/ > > > > _______________________________________________ > General mailing list > [email protected] > http://simile.mit.edu/mailman/listinfo/general > > _______________________________________________ General mailing list [email protected] http://simile.mit.edu/mailman/listinfo/general
