Fwiw, I played a little bit with some approaches, unsuccessfully. But, the
problem might yield under a little more pressure. The problem I eventually
encountered and gave up at was that: a) the structure of their site isn't
consistent; and b) there are links with embedded spaces or something. This
*might* be 75% of a solution or it might be a dead end:

  for d in $(curl http://ph-public-data.com/ | grep href | grep document |
cut -d\' -f2) ; do for f in $(curl http://ph-public-data.com$d | grep href
| grep download | cut -d\" -f2) ; do echo http://ph-public-data.com/$f ;
done ; done

I kind of agree with Keith. Either just ask the site administrator for all
the data in one blob or start clicking. I don't have any better solutions.

-- 
Russell Senior
[email protected]


On Fri, Nov 17, 2023 at 1:44 PM Bill Barry <[email protected]> wrote:

> On Fri, Nov 17, 2023 at 3:17 PM Rich Shepard <[email protected]>
> wrote:
> >
> > On Fri, 17 Nov 2023, Michael Barnes wrote:
> >
> > > I have used this command string successfully in the past to download
> > > complete websites.
> > >
> > > $ wget      --recursive      --no-clobber      --page-requisites
> > > --html-extension      --convert-links
> --restrict-file-names=windows
> > >   --domains website.com      --no-parent website.com
> >
> > Michael,
> >
> > This returned only the site's index page:
> > $ wget -r --no-clobber --page-requisites  --html-extension
> --convert-links --restrict-file-names=windows --domains
> http://ph-public-data.com --no-parent ph-public-data.com
> > Both --no-clobber and --convert-links were specified, only
> --convert-links will be used.
> > --2023-11-17 13:16:41--  http://ph-public-data.com/
> > Resolving ph-public-data.com... 138.68.58.192
> > Connecting to ph-public-data.com|138.68.58.192|:80... connected.
> > HTTP request sent, awaiting response... 200 OK
> > Length: 19486 (19K) [text/html]
> > Saving to: ‘ph-public-data.com/index.html’
> >
> > ph-public-data.com/index 100%[==================================>]
> 19.03K  --.-KB/s    in 0.02s
> >
> > 2023-11-17 13:16:41 (904 KB/s) - ‘ph-public-data.com/index.html’ saved
> [19486/19486]
> >
> > FINISHED --2023-11-17 13:16:41--
> > Total wall clock time: 0.2s
> > Downloaded: 1 files, 19K in 0.02s (904 KB/s)
> > Converting links in ph-public-data.com/index.html... 0-52
> > Converted links in 1 files in 0.001 seconds.
> >
> > Regards,
> >
> > Rich
>
> Limiting how deep to recurse is helpful. You may want just the page
> you start with and one level down from that.
> --level= depth
> --level=1 would be a good place to start.
>
> Bill
>

Reply via email to