Fwiw, I played a little bit with some approaches, unsuccessfully. But, the problem might yield under a little more pressure. The problem I eventually encountered and gave up at was that: a) the structure of their site isn't consistent; and b) there are links with embedded spaces or something. This *might* be 75% of a solution or it might be a dead end:
for d in $(curl http://ph-public-data.com/ | grep href | grep document | cut -d\' -f2) ; do for f in $(curl http://ph-public-data.com$d | grep href | grep download | cut -d\" -f2) ; do echo http://ph-public-data.com/$f ; done ; done I kind of agree with Keith. Either just ask the site administrator for all the data in one blob or start clicking. I don't have any better solutions. -- Russell Senior [email protected] On Fri, Nov 17, 2023 at 1:44 PM Bill Barry <[email protected]> wrote: > On Fri, Nov 17, 2023 at 3:17 PM Rich Shepard <[email protected]> > wrote: > > > > On Fri, 17 Nov 2023, Michael Barnes wrote: > > > > > I have used this command string successfully in the past to download > > > complete websites. > > > > > > $ wget --recursive --no-clobber --page-requisites > > > --html-extension --convert-links > --restrict-file-names=windows > > > --domains website.com --no-parent website.com > > > > Michael, > > > > This returned only the site's index page: > > $ wget -r --no-clobber --page-requisites --html-extension > --convert-links --restrict-file-names=windows --domains > http://ph-public-data.com --no-parent ph-public-data.com > > Both --no-clobber and --convert-links were specified, only > --convert-links will be used. > > --2023-11-17 13:16:41-- http://ph-public-data.com/ > > Resolving ph-public-data.com... 138.68.58.192 > > Connecting to ph-public-data.com|138.68.58.192|:80... connected. > > HTTP request sent, awaiting response... 200 OK > > Length: 19486 (19K) [text/html] > > Saving to: ‘ph-public-data.com/index.html’ > > > > ph-public-data.com/index 100%[==================================>] > 19.03K --.-KB/s in 0.02s > > > > 2023-11-17 13:16:41 (904 KB/s) - ‘ph-public-data.com/index.html’ saved > [19486/19486] > > > > FINISHED --2023-11-17 13:16:41-- > > Total wall clock time: 0.2s > > Downloaded: 1 files, 19K in 0.02s (904 KB/s) > > Converting links in ph-public-data.com/index.html... 0-52 > > Converted links in 1 files in 0.001 seconds. > > > > Regards, > > > > Rich > > Limiting how deep to recurse is helpful. You may want just the page > you start with and one level down from that. > --level= depth > --level=1 would be a good place to start. > > Bill >
