在 2011年8月26日星期五 UTC+8上午11:04:23,Oana Goga写道: > > Hi, > > I am trying to use scrapy to access https web pages over a proxy and I > have some problems getting it to work. > When I am trying to fetch/view https://www.paypal.com with scrapy I am > getting the 501 error (Not Implemented), but when I fetch the page with > wget everything is working well. Here are the steps that I am doing: > > $ export http_proxy="http://us.proxymesh.com:31280" > <http://us.proxymesh.com:31280> > $ export https_proxy="http://us.proxymesh.com:31280" > <http://us.proxymesh.com:31280> > $ scrapy view https://www.paypal.com > 2011-08-25 19:41:43-0700 [scrapy] INFO: Scrapy 0.12.0.2545 started (bot: > nice_bot) > 2011-08-25 19:41:43-0700 [scrapy] DEBUG: Enabled extensions: FeedExporter, > TelnetConsole, SpiderContext, WebService, CoreStats, MemoryUsage, > CloseSpider > 2011-08-25 19:41:43-0700 [scrapy] DEBUG: Enabled scheduler middlewares: > DuplicatesFilterMiddleware > 2011-08-25 19:41:43-0700 [scrapy] DEBUG: Enabled downloader middlewares: > HttpProxyMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, > UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, > RedirectMiddleware, CookiesMiddleware, HttpCompressionMiddleware, > DownloaderStats > 2011-08-25 19:41:43-0700 [scrapy] DEBUG: Enabled spider middlewares: > HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, > UrlCanonicalizerMiddleware, UrlLengthMiddleware, DepthMiddleware > 2011-08-25 19:41:43-0700 [scrapy] DEBUG: Enabled item pipelines: > 2011-08-25 19:41:43-0700 [scrapy] DEBUG: Telnet console listening on > 0.0.0.0:6023 > 2011-08-25 19:41:43-0700 [scrapy] DEBUG: Web service listening on > 0.0.0.0:6080 > 2011-08-25 19:41:43-0700 [default] INFO: Spider opened > 2011-08-25 19:41:43-0700 [scrapy] DEBUG: Cookie: None for > https://www.paypal.com > 2011-08-25 19:41:44-0700 [scrapy] INFO: Set-Cookie: [] from > https://www.paypal.com > 2011-08-25 19:41:44-0700 [default] *DEBUG: Crawled (501) <GET > https://www.paypal.com <https://www.paypal.com>>* (referer: None) > 2011-08-25 19:41:44-0700 [default] INFO: Closing spider (finished) > 2011-08-25 19:41:48-0700 [default] INFO: Spider closed (finished) > > > $ wget https://www.paypal.com > --2011-08-25 19:44:08-- https://www.paypal.com/ > Resolving us.proxymesh.com... 184.106.76.204 > Connecting to us.proxymesh.com|184.106.76.204|:31280... connected. > Proxy request sent, awaiting response*... 200 OK* > Length: unspecified [text/html] > Saving to: `index.html' > > I have scrapy 0.12.0.2545 , twisted 11.0.0 and python 2.7. > > After some investigation, it appears that scrapy instead of issuing a > CONNECT method and then doing a GET it is only issuing a GET requests which > causes the fetch to fail. > > Do you have any idea why this happens and how it can be fixed? > > Thanks, > Oana > > Does Scrapy work with HTTP proxies? <http://doc.scrapy.org/en/latest/faq.html?highlight=proxy> <http://doc.scrapy.org/en/latest/faq.html?highlight=proxy#does-scrapy-work-with-http-proxies> Yes. Support for HTTP proxies is provided (since Scrapy 0.8) through the HTTP Proxy downloader middleware. See HttpProxyMiddleware <http://doc.scrapy.org/en/latest/topics/downloader-middleware.html#scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware> .
-- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.