Re: Scrapy with https proxy

陈伟伟 Tue, 21 Jun 2016 16:23:59 -0700


在 2011年8月26日星期五 UTC+8上午11:04:23，Oana Goga写道：
>
> Hi, 
>
> I am trying to use scrapy to access https web pages over a proxy and I 
> have some problems getting it to work. 
> When I am trying to fetch/view https://www.paypal.com with scrapy I am 
> getting the 501 error (Not Implemented), but when I fetch the page with 
> wget everything is working well.  Here are the steps that I am doing:
>
> $ export http_proxy="http://us.proxymesh.com:31280"; 
> <http://us.proxymesh.com:31280>
> $ export https_proxy="http://us.proxymesh.com:31280"; 
> <http://us.proxymesh.com:31280>
> $ scrapy view https://www.paypal.com
> 2011-08-25 19:41:43-0700 [scrapy] INFO: Scrapy 0.12.0.2545 started (bot: 
> nice_bot)
> 2011-08-25 19:41:43-0700 [scrapy] DEBUG: Enabled extensions: FeedExporter, 
> TelnetConsole, SpiderContext, WebService, CoreStats, MemoryUsage, 
> CloseSpider
> 2011-08-25 19:41:43-0700 [scrapy] DEBUG: Enabled scheduler middlewares: 
> DuplicatesFilterMiddleware
> 2011-08-25 19:41:43-0700 [scrapy] DEBUG: Enabled downloader middlewares: 
> HttpProxyMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, 
> UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, 
> RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, 
> DownloaderStats
> 2011-08-25 19:41:43-0700 [scrapy] DEBUG: Enabled spider middlewares: 
> HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, 
> UrlCanonicalizerMiddleware, UrlLengthMiddleware, DepthMiddleware
> 2011-08-25 19:41:43-0700 [scrapy] DEBUG: Enabled item pipelines: 
> 2011-08-25 19:41:43-0700 [scrapy] DEBUG: Telnet console listening on 
> 0.0.0.0:6023
> 2011-08-25 19:41:43-0700 [scrapy] DEBUG: Web service listening on 
> 0.0.0.0:6080
> 2011-08-25 19:41:43-0700 [default] INFO: Spider opened
> 2011-08-25 19:41:43-0700 [scrapy] DEBUG: Cookie: None for 
> https://www.paypal.com
> 2011-08-25 19:41:44-0700 [scrapy] INFO: Set-Cookie: [] from 
> https://www.paypal.com
> 2011-08-25 19:41:44-0700 [default] *DEBUG: Crawled (501) <GET 
> https://www.paypal.com <https://www.paypal.com>>* (referer: None)
> 2011-08-25 19:41:44-0700 [default] INFO: Closing spider (finished)
> 2011-08-25 19:41:48-0700 [default] INFO: Spider closed (finished)
>
>
> $ wget https://www.paypal.com
> --2011-08-25 19:44:08--  https://www.paypal.com/
> Resolving us.proxymesh.com... 184.106.76.204
> Connecting to us.proxymesh.com|184.106.76.204|:31280... connected.
> Proxy request sent, awaiting response*... 200 OK*
> Length: unspecified [text/html]
> Saving to: `index.html'
>
> I have scrapy 0.12.0.2545 , twisted 11.0.0 and python 2.7.
>
> After some investigation, it appears that scrapy instead of issuing a 
> CONNECT method and then doing a GET it is only issuing a GET requests which 
> causes the fetch to fail.
>
> Do you have any idea why this happens and how it can be fixed?
>
> Thanks,
> Oana
>
>
Does Scrapy work with HTTP proxies? 
<http://doc.scrapy.org/en/latest/faq.html?highlight=proxy> 
<http://doc.scrapy.org/en/latest/faq.html?highlight=proxy#does-scrapy-work-with-http-proxies>
Yes. Support for HTTP proxies is provided (since Scrapy 0.8) through the 
HTTP Proxy downloader middleware. See HttpProxyMiddleware 
<http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware>
.


-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Scrapy with https proxy

Reply via email to