Hi Anuj,
вторник, 17 марта 2015 г., 16:04:37 UTC+5 пользователь Anuj Bansal написал: > > Hi , > > I'm working towards adding Python 3 support to scrapy. I went through a > lot of blogs and projects related to adding Python 3 support and found that > currently twisted is also working towards creating a version of twisted > that is source-compatible with Python 2.6, Python 2.7, and Python 3.3 [1]. > There are various tools like "2to3" that read Python 2.x source code and > appliy a series of fixers to transform it into valid Python 3.x code. > Although it is more helpful for those who are porting to Python 3 rather > than adding support for it. > > Currently, I'm working towards a plan on how all this should be carried > out and how much time each part of scrapy would take. Also I'm reading > through [2] to see what all changes are required. > > I also had some questions: > > 1. Why don't we completely port scrapy to Python 3 rather than adding > support for it ? Would it be to much for a GSoC Project ? > It would likely result in a cleaner code as compared to adding support. > > Making Scrapy Python3-only is easier than adding Python 3 support while keeping Python 2.7 support. But there are large codebases written in Python 2.x; it is not the time to drop Python 2.x support yet. Maybe we'll be able to drop 2.x support ~5 years later, if all will go well :) 2. Is it recommended to use tools like 2to3 to convert the code ? > On twisted page [1] they mention not to use the tool whereas various > projects and also the website [2] recommend its use. > The recommended way is to use "six" Python module. Some parts of Scrapy are already ported to Python 3 - see e.g. https://travis-ci.org/scrapy/scrapy/jobs/54761340 - 235 tests pass in Python 3.3. To get started try cloning Scrapy and running some tests using tox (as described in docs). You can also check https://github.com/scrapy/scrapy/blob/master/tests/py3-ignores.txt file - try uncommenting something and run tests again to see what's not ported. We can't rely only on tests when porting, but they are a good start. By the way, project description may be a bit misleading. It can make you feel that the main issue is Twisted. But this is not where the existing porting efforts stopped. Currently we stopped at porting scrapy.Request, and specifically at deciding how to represent URLs. There is an existing PR (https://github.com/scrapy/scrapy/pull/837), but I think it took a wrong path (and it seems Daniel agrees). In the PR URLs are considered bytes. It is not entirely unreasonable (in the end, you get bytes from the internet, and you send URL as bytes when doing HTTP requests, and often they must be the same bytes). The problem is that such URLs are hard to work with in Python 3.x (unwanted unicode promotion from urllib, no .format method, etc), and that you get unicode URLs if they are extracted from HTML using scrapy selectors. Scrapy only sends ASCII-clean URLs (they are escaped using w3lib) because this is what browsers do. There is some value in allowing binary non-escaped URLs though (see e.g. https://github.com/scrapy/scrapy/issues/833) - maybe "new" URL handling could have a solution for thatas well. So we're thinking of using unicode URLs in Python 3.x. This could require changes to https://github.com/scrapy/w3lib because we made it work on byte urls (but maybe not). Also, the method w3lib uses to encode URLs to ASCII is incorrect, i.e. it doesn't match what browsers do. Browsers are crazy here - it seems I lost the demo source code, but browsers can use different encodings for different parts of URL, something like "encode GET argument values using UTF8, but encode /path using web page encoding". This URL encoding thing is where we stopped. Without having a solid solution we can't port scrapy.Request, and without scrapy.Request most other Scrapy components don't work. > > It would be really helpful if you could guide me where to start and > provide some useful links as well. > > [1] - http://twistedmatrix.com/trac/wiki/Plan/Python3 > [2] - http://python3porting.com/ > > Regards, > Anuj Bansal > Github - ahhda > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
