On 2/11/07, Keith Alexander <[EMAIL PROTECTED]> wrote:
> Jim Sheppard wrote:
> > 1.Select a location: [Stogursey]
> > 1.Grab some basic background
> > 2.Choose an Administrative Unit [Stogursey AP/CP]
> > 3.Select a Theme [Population]
> > 4.Choose [Total Population]
> > 1.Select Table View
> > 2.Grab data
> > 5. Choose [Area (acres)]
> > 1.Select Table View
> > 2.Grab data

[...]

> It will still take a long time I think, so you might want to split it
> into several stages, maybe use python or something to do enough scraping
> to gather a list of all the urls of table pages you need to go to.
>
> If you don't want the data in RDF/XML though, it might be simpler to
> scrape with  perl/python/php/ruby/something, and write directly into
> your prefered data format (maybe just save all the html table pages as
> they are, and use  a script or spreadsheet app to convert to CSV.

Should you opt for Ruby, you will most likely have plenty of use of
this article:

http://www.rubyrailways.com/data-extraction-for-web-20-screen-scraping-in-rubyrails-episode1/

which in part takes an approach to scraping in part similar to that of
Sifter (http://simile.mit.edu/wiki/Sifter) in their aided XPath
generation, in part similar to Chickenfoot
(http://groups.csail.mit.edu/uid/chickenfoot/quickstart.html), in its
similarity with expressing the actions in near human language, along
the same lines of your recipe above.

-- 
 / Johan Sundström, http://ecmanaut.blogspot.com/

_______________________________________________
General mailing list
[email protected]
http://simile.mit.edu/mailman/listinfo/general

Reply via email to