On Sat, Feb 4, 2017 at 11:28 AM, Nelson H. F. Beebe <[email protected]> wrote:
> For several years, I have used lynx (and also wget, and rarely, curl)
> to access publisher Web pages for new journal issues.  Recently, I
> noticed that a lynx pull of an page from Elsevier ScienceDirect would
> never complete:
>
>         % lynx -source -accept_all_cookies -cookies  --trace 
> http://www.sciencedirect.com/science/journal/00978493/62 > foo.62
>         
> parse_arg(arg_name=http://www.sciencedirect.com/science/journal/00978493/62, 
> mask=1, count=5)
>         parse_arg 
> startfile:http://www.sciencedirect.com/science/journal/00978493/62
>         ... no further output, and no job completion ...
>
> Similarly, I also find that wget and curl fail to complete.
>
> This new behavior suggests that the publisher site has thrown up
> http-agent-specific, rather than IP-address-specific blocks, because
> accessing the same URL in a GUI browser on the SAME machine gets an
> immediate return of the expected journal issue contents.
>
> If I add the --debug option to wget, I find that it reports
>
>         ---request begin---
>         GET /science/journal/00978493/62 HTTP/1.1
>         User-Agent: Wget/1.14 (linux-gnu)
>         Accept: */*
>         Host: www.sciencedirect.com
>         Connection: Keep-Alive
>
>         ---request end---
>
> Thus, it identifies itself as wget, and I assume that lynx probably
> self identifies as well.
>
> Does anyone on this list have an idea how to circumvent these apparent
> blocks?
>

put -useragent="Googlebot" or "Mozilla" in your command line:

lynx -useragent="Mozilla"  -accept_all_cookies -dump
http://www.sciencedirect.com/science/journal/00978493/62

gets me a long list of links in the html result

_______________________________________________
Lynx-dev mailing list
[email protected]
https://lists.nongnu.org/mailman/listinfo/lynx-dev

Reply via email to