Op 29 sep. 2016 13:43 schreef "Edward Ned Harvey (lopser)" <
lop...@nedharvey.com>:
>
> > From: discuss-boun...@lists.lopsa.org [mailto:discuss-
> > boun...@lists.lopsa.org] On Behalf Of Ski Kacoroski
> >
> > So do any of you have any great ideas, wonderful software, etc that can
> > scrape a website on a regular basis so I could at least have provided
> > the content back to the teacher.  I will need to get the pages
> > (including pages buried behind javascript and ajax buttons and menus)
> > along with attached files).
>
> If some of the relevant content is behind javascript/ajax, etc, then your
usual crawlers (curl,wget,etc) aren't going to cut the butter for you...
You'll probably need a "real" web scraping solution, like using selenium
and writing your own custom scraping app.

After thinking it through I found it surprisingly perplexing problem.

AFAIK the two options are a (custom) crawler and their backups. Both have
their own drawbacks (thinking on long term here, for relatively short term,
plain backups rule).

The problem with a CMS backup is that you'd need a second instance to
restore your backup to (or risk losing other changes). Which can be a
problem with precise versions/extensions/configuration...

A (custom) webcrawler mostly solves the above problem as the results are
(probably) ready-to-go HTML. The tricky point here is how to keep it
accessible. Some content may change often, while other changes perhaps
every couple of years.
Another point could be the load it creates on the webserver when crawling.

I'm sure both are long-ago solved problems, I just never thought of these
before. We usually "just" create backups and store those according to
policy. With some luck we hardly ever touch them after creation and even
then only to restore a few documents.

Retention is tricky too...
I found that educational institutions have some surprising schedules. In
business it's not unusual to store backups for a fixed number of
days/weeks/whatever. In education some vacations (is that the correct
word?) are longer then that...

Mvg, Guus Snijders
_______________________________________________
Discuss mailing list
Discuss@lists.lopsa.org
https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss
This list provided by the League of Professional System Administrators
 http://lopsa.org/

Reply via email to