On Sat, Feb 4, 2017 at 11:28 AM, Nelson H. F. Beebe <[email protected]> wrote: > For several years, I have used lynx (and also wget, and rarely, curl) > to access publisher Web pages for new journal issues. Recently, I > noticed that a lynx pull of an page from Elsevier ScienceDirect would > never complete: > > % lynx -source -accept_all_cookies -cookies --trace > http://www.sciencedirect.com/science/journal/00978493/62 > foo.62 > > parse_arg(arg_name=http://www.sciencedirect.com/science/journal/00978493/62, > mask=1, count=5) > parse_arg > startfile:http://www.sciencedirect.com/science/journal/00978493/62 > ... no further output, and no job completion ... > > Similarly, I also find that wget and curl fail to complete. > > This new behavior suggests that the publisher site has thrown up > http-agent-specific, rather than IP-address-specific blocks, because > accessing the same URL in a GUI browser on the SAME machine gets an > immediate return of the expected journal issue contents. > > If I add the --debug option to wget, I find that it reports > > ---request begin--- > GET /science/journal/00978493/62 HTTP/1.1 > User-Agent: Wget/1.14 (linux-gnu) > Accept: */* > Host: www.sciencedirect.com > Connection: Keep-Alive > > ---request end--- > > Thus, it identifies itself as wget, and I assume that lynx probably > self identifies as well. > > Does anyone on this list have an idea how to circumvent these apparent > blocks? >
put -useragent="Googlebot" or "Mozilla" in your command line: lynx -useragent="Mozilla" -accept_all_cookies -dump http://www.sciencedirect.com/science/journal/00978493/62 gets me a long list of links in the html result _______________________________________________ Lynx-dev mailing list [email protected] https://lists.nongnu.org/mailman/listinfo/lynx-dev
