Re: [CODE4LIB] How to archive selected pages from a site requiring authentication

Tom Hutchinson Fri, 13 Jan 2017 12:09:02 -0800

Hi Alex,

This is doable. How to proceed will depend on how authentication is performed.


Most sites used to have simple authentication over http. You could
just do http://user:p...@example.com. wget has flags for login/pass
that do this. However these days many sites don't work that way.

A common pattern is logging in through a form and then a cookie is
saved identifying you as a particular logged in user. You'll run your
offline browser with that cookie. On some sites you can even do this
manually. Log in with your browser, get the new cookie, and give it to
you crawler tool. Some tools let you submit a form (POST the filled
out values). wget and curl do this. I don't know the dedicated offline
browsers well enough for a recommendation there.

This stack overflow seems like a good example for wget:
https://stackoverflow.com/questions/4272770/wget-with-authentication

In practice this may be easier said than done. For instance you might
need to have the scraper identify as the browser you saved the cookie
with. Redirects could be a pain too.

Let us know how it goes.

Regards,

Tom

On Fri, Jan 13, 2017 at 9:12 AM, Alex Armstrong <armstr...@amicalnet.org> wrote:
> Interesting project! But not I had in mind. I’m looking to archive the actual 
> pages, so I can refer to them (and possibly extract information from them).
>
> Alex
>
> On 13 January 2017 at 15:25:43, Schmitz Fuhrig, Lynda (schmitzfuhr...@si.edu) 
> wrote:
>
> Check out https://webrecorder.io/
>
>
>
>
> Lynda Schmitz Fuhrig
> Electronic Records Archivist
> Digital Services Division
> Smithsonian Institution Archives
> Capital Gallery Building
> 600 Maryland Ave SW
> Suite 3000
> MRC 507
> Washington, DC 20024-2520
>
> siarchives.si.edu <http://siarchives.si.edu/> | @SmithsonianArch
> <https://twitter.com/smithsonianarch> | Facebook
> <https://www.facebook.com/SmithsonianInstitutionArchives> | e-newsletter
> <http://visitor.r20.constantcontact.com/manage/optin/ea?v=0010Oqxbncv4Wpyhe
> Eee3Q9DHdF_192SxMMIWgsXuMG1qJ5yKPErzu0TI5d4qyMxK4iLMccSoQG5ck%3D>
>
> A gift
> <http://siarchives.si.edu/about/donate-smithsonian-institution-archives>
> in support of the Archives will help make more of our collections
> accessible!
>
>
>
>
>
> On 1/13/17, 2:43 AM, "Code for Libraries on behalf of Alex Armstrong"
> <CODE4LIB@LISTS.CLIR.ORG on behalf of armstr...@amicalnet.org> wrote:
>
>>Has anyone had to archive selected pages from a login-protected site? How
>>did you do it?
>>
>>I've used the CLI tool httrack in the past for archiving sites. But in
>>this
>>case, accessing the pages require logging in. There's some vague
>>documentation about how to do this with httrack, but I haven't cracked it
>>yet. (The instructions are better for the Windows version of the
>>application, but I only have ready access to a Mac.)
>>
>>Before I go on a wild goose chase, any help would be much appreciated.
>>
>>Alex
>>
>>--
>>Alex Armstrong
>>Web Developer & Digital Strategist, AMICAL Consortium
>>armstr...@amicalnet.org

Re: [CODE4LIB] How to archive selected pages from a site requiring authentication

Reply via email to