Ángel, The Sharepoint view offers a "next 100 page" button. Upon further reflection - as always, soon after posting - if became apparent that it was a pretty tall order to expect wget to be able to discern such a thing in the HTML it received from the site. So of course it was only ever going to be able to see what the HTML linked to and no more.
The site _does_ offer a so-called "Explorer View" which does indeed show _all_ the directories/files in the traditional scrolled rather than paged view, but when I fed the URL displayed by IE (which had the form http://server/top directory/Forms/WebFldr.aspx?RootFolder=top_directory) all I got was a mess of HTML/JS files. Oh well. << You could make a similar mount in the Unix server (if it's eg. available through smb) >> Alas, HP CIFS knows nought about the wonders of Sharepoint, it only deals with Windows shares. In the end I just used my desktop to trawl the site for the filenames (dir /s /b \\server\top directory\*.pdf) and, with a bit of massaging, presented that file list to wget, with no directory tree walking. It was all a pretty tacky kludge but it got the job done in the end. Thanks anyway. Rocket J. Squirrel: "... we're going to have to think!" Bullwinkle J. Moose: "There must be an easier way than that." HOWARD BRYDEN Senior Unix Administrator Data Centre Information and Communication Systems Corporate Support Division Department of Community Safety PHONE: 07 3635 3087 POSTAL: GPO Box 1425, Brisbane, QLD 4001 | EMAIL: [email protected] P Please consider the environment before printing this email - then print it -----Original Message----- From: Ángel González [mailto:[email protected]] Sent: Sunday, 29 April 2012 12:02 AM To: Howard Bryden Cc: bug-wget Subject: Re: [Bug-wget] wget On 27/04/12 06:25, Howard Bryden wrote: > Folks, > > I'm using wget 1.13.4 to attempt to recursively download a Sharepoint site. > The commandline is just the wget command verb; the contents of ~/.wgetrc are: > > > > Initially all appeared to work as expected yet it turns out I'm > receiving only a subset of the filespace, namely > > a) only the first 100 directories are visited, and > b) only the first 100 files from each directory are actually downloaded. > > This pretty much corresponds to the Internet Explorer view, which presents > the site in pages of 100 items (directories and files within directories). How are the next pages accessed? Can you view those "next pages" if you disable javascript in your browser? (wget doesn't parse javascript) I think the problem lies in the way those next pages are linked, so such a page would be more helpful than the full list of files. Also, if you can view the full site as mounted on the computer, do you really need to crawl it with wget? You could make a similar mount in the Unix server (if it's eg. available through smb) or simply zip everything locally and transfer that to the HP server. This correspondence is for the named persons only. It may contain confidential or privileged information or both. No confidentiality or privilege is waived or lost by any mis transmission. If you receive this correspondence in error please delete it from your system immediately and notify the sender. You must not disclose, copy or relay on any part of this correspondence, if you are not the intended recipient. Any opinions expressed in this message are those of the individual sender except where the sender expressly, and with the authority, states them to be the opinions of the Department of Community Safety, Queensland. All reasonable precautions will be taken to respect the privacy of individuals in accordance with the Information Privacy Act 2009 (Qld). Details on how personal information may be used or disclosed by the Department of Community Safety, Queensland are available from www.communitysafety.qld.gov.au/info/privacy.htm
