Hi Alex, This is doable. How to proceed will depend on how authentication is performed.
Most sites used to have simple authentication over http. You could just do http://user:p...@example.com. wget has flags for login/pass that do this. However these days many sites don't work that way. A common pattern is logging in through a form and then a cookie is saved identifying you as a particular logged in user. You'll run your offline browser with that cookie. On some sites you can even do this manually. Log in with your browser, get the new cookie, and give it to you crawler tool. Some tools let you submit a form (POST the filled out values). wget and curl do this. I don't know the dedicated offline browsers well enough for a recommendation there. This stack overflow seems like a good example for wget: https://stackoverflow.com/questions/4272770/wget-with-authentication In practice this may be easier said than done. For instance you might need to have the scraper identify as the browser you saved the cookie with. Redirects could be a pain too. Let us know how it goes. Regards, Tom On Fri, Jan 13, 2017 at 9:12 AM, Alex Armstrong <armstr...@amicalnet.org> wrote: > Interesting project! But not I had in mind. I’m looking to archive the actual > pages, so I can refer to them (and possibly extract information from them). > > Alex > > On 13 January 2017 at 15:25:43, Schmitz Fuhrig, Lynda (schmitzfuhr...@si.edu) > wrote: > > Check out https://webrecorder.io/ > > > > > Lynda Schmitz Fuhrig > Electronic Records Archivist > Digital Services Division > Smithsonian Institution Archives > Capital Gallery Building > 600 Maryland Ave SW > Suite 3000 > MRC 507 > Washington, DC 20024-2520 > > siarchives.si.edu <http://siarchives.si.edu/> | @SmithsonianArch > <https://twitter.com/smithsonianarch> | Facebook > <https://www.facebook.com/SmithsonianInstitutionArchives> | e-newsletter > <http://visitor.r20.constantcontact.com/manage/optin/ea?v=0010Oqxbncv4Wpyhe > Eee3Q9DHdF_192SxMMIWgsXuMG1qJ5yKPErzu0TI5d4qyMxK4iLMccSoQG5ck%3D> > > A gift > <http://siarchives.si.edu/about/donate-smithsonian-institution-archives> > in support of the Archives will help make more of our collections > accessible! > > > > > > On 1/13/17, 2:43 AM, "Code for Libraries on behalf of Alex Armstrong" > <CODE4LIB@LISTS.CLIR.ORG on behalf of armstr...@amicalnet.org> wrote: > >>Has anyone had to archive selected pages from a login-protected site? How >>did you do it? >> >>I've used the CLI tool httrack in the past for archiving sites. But in >>this >>case, accessing the pages require logging in. There's some vague >>documentation about how to do this with httrack, but I haven't cracked it >>yet. (The instructions are better for the Windows version of the >>application, but I only have ready access to a Mac.) >> >>Before I go on a wild goose chase, any help would be much appreciated. >> >>Alex >> >>-- >>Alex Armstrong >>Web Developer & Digital Strategist, AMICAL Consortium >>armstr...@amicalnet.org