> Wondered whether I should make this NF or not, but seeing how 
> it'll be done from VFP, I figured "yeah, it's on topic."
> 
> There's been talk recently and in the past about 
> screen-scraping web pages.  Does anyone have a "best 
> practice" way of doing this?


I've wrestled with this one, to pull a list of RV sites in the US into
VFP. 

At first it looked like a cakewalk, because name, address, tel, contact
info was arranged vertically on the (long) page separated by blank
lines. I think it was provided as one state per (long) page.

What I did was, a page/state at a time, copy the page to the clipboard
and then run the VFP process that read the clipboard and parsed it's
contents into individual records.

After about 100 problems, I finally got it to work, more or less,
because the data on these pages is not necessarily structured in the way
it appears to be. In the cases I ran with, sometimes there would be one
blank line separator, sometimes multiple blank lines. Fine, fix that,
then discover the data has never been validated, so it's incomplete and
contains missing/transposed fields, etc. basically required manual
cleaning afterwards.

And this was a case where the data appeared to be structured and
amenable to screen scraping. 

I suppose the flip side is where data IS properly structured and
formatted/validated, perhaps by organizations intended to distribute
data this way, but then you'd think they would support other ways to get
it then screen scraping.


Bill


 
> tia,
> --Michael
> 



_______________________________________________
Post Messages to: [email protected]
Subscription Maintenance: http://leafe.com/mailman/listinfo/profox
OT-free version of this list: http://leafe.com/mailman/listinfo/profoxtech
** All postings, unless explicitly stated otherwise, are the opinions of the 
author, and do not constitute legal or medical advice. This statement is added 
to the messages for those lawyers who are too stupid to see the obvious.

Reply via email to