On 2/11/07, Keith Alexander <[EMAIL PROTECTED]> wrote: > Jim Sheppard wrote: > > 1.Select a location: [Stogursey] > > 1.Grab some basic background > > 2.Choose an Administrative Unit [Stogursey AP/CP] > > 3.Select a Theme [Population] > > 4.Choose [Total Population] > > 1.Select Table View > > 2.Grab data > > 5. Choose [Area (acres)] > > 1.Select Table View > > 2.Grab data
[...] > It will still take a long time I think, so you might want to split it > into several stages, maybe use python or something to do enough scraping > to gather a list of all the urls of table pages you need to go to. > > If you don't want the data in RDF/XML though, it might be simpler to > scrape with perl/python/php/ruby/something, and write directly into > your prefered data format (maybe just save all the html table pages as > they are, and use a script or spreadsheet app to convert to CSV. Should you opt for Ruby, you will most likely have plenty of use of this article: http://www.rubyrailways.com/data-extraction-for-web-20-screen-scraping-in-rubyrails-episode1/ which in part takes an approach to scraping in part similar to that of Sifter (http://simile.mit.edu/wiki/Sifter) in their aided XPath generation, in part similar to Chickenfoot (http://groups.csail.mit.edu/uid/chickenfoot/quickstart.html), in its similarity with expressing the actions in near human language, along the same lines of your recipe above. -- / Johan Sundström, http://ecmanaut.blogspot.com/ _______________________________________________ General mailing list [email protected] http://simile.mit.edu/mailman/listinfo/general
