Op 29 sep. 2016 13:43 schreef "Edward Ned Harvey (lopser)" < lop...@nedharvey.com>: > > > From: discuss-boun...@lists.lopsa.org [mailto:discuss- > > boun...@lists.lopsa.org] On Behalf Of Ski Kacoroski > > > > So do any of you have any great ideas, wonderful software, etc that can > > scrape a website on a regular basis so I could at least have provided > > the content back to the teacher. I will need to get the pages > > (including pages buried behind javascript and ajax buttons and menus) > > along with attached files). > > If some of the relevant content is behind javascript/ajax, etc, then your usual crawlers (curl,wget,etc) aren't going to cut the butter for you... You'll probably need a "real" web scraping solution, like using selenium and writing your own custom scraping app.
After thinking it through I found it surprisingly perplexing problem. AFAIK the two options are a (custom) crawler and their backups. Both have their own drawbacks (thinking on long term here, for relatively short term, plain backups rule). The problem with a CMS backup is that you'd need a second instance to restore your backup to (or risk losing other changes). Which can be a problem with precise versions/extensions/configuration... A (custom) webcrawler mostly solves the above problem as the results are (probably) ready-to-go HTML. The tricky point here is how to keep it accessible. Some content may change often, while other changes perhaps every couple of years. Another point could be the load it creates on the webserver when crawling. I'm sure both are long-ago solved problems, I just never thought of these before. We usually "just" create backups and store those according to policy. With some luck we hardly ever touch them after creation and even then only to restore a few documents. Retention is tricky too... I found that educational institutions have some surprising schedules. In business it's not unusual to store backups for a fixed number of days/weeks/whatever. In education some vacations (is that the correct word?) are longer then that... Mvg, Guus Snijders
_______________________________________________ Discuss mailing list Discuss@lists.lopsa.org https://lists.lopsa.org/cgi-bin/mailman/listinfo/discuss This list provided by the League of Professional System Administrators http://lopsa.org/