Re: Queries regarding adding Python 3 support for scrapy.

Mikhail Korobov Tue, 17 Mar 2015 12:25:08 -0700

Hi Anuj,


вторник, 17 марта 2015 г., 16:04:37 UTC+5 пользователь Anuj Bansal написал:
>
> Hi ,
>
> I'm working towards adding Python 3 support to scrapy. I went through a 
> lot of blogs and projects related to adding Python 3 support and found that 
> currently twisted is also working towards creating a version of twisted 
> that is source-compatible with Python 2.6, Python 2.7, and Python 3.3 [1]. 
> There are various tools like "2to3" that read Python 2.x source code and 
> appliy a series of fixers to transform it into valid Python 3.x code. 
> Although it is more helpful for those who are porting to Python 3 rather 
> than adding support for it.
>
> Currently, I'm working towards a plan on how all this should be carried 
> out and how much time each part of scrapy would take. Also I'm reading 
> through [2] to see what all changes are required.
>
> I also had some questions:
>
> 1. Why don't we completely port scrapy to Python 3 rather than adding 
> support for it ? Would it be to much for a GSoC Project ?
> It would likely result in a cleaner code as compared to adding support.
>
>
Making Scrapy Python3-only is easier than adding Python 3 support while 
keeping Python 2.7 support. But there are large codebases written in Python 
2.x; it is not the time to drop Python 2.x support yet. Maybe we'll be able 
to drop 2.x support ~5 years later, if all will go well :) 


2. Is it recommended to use tools like 2to3 to convert the code ?
> On twisted page [1] they mention not to use the tool whereas various 
> projects and also the website [2] recommend its use.
>

The recommended way is to use "six" Python module. Some parts of Scrapy are 
already ported to Python 3 - see e.g. 
https://travis-ci.org/scrapy/scrapy/jobs/54761340 - 235 tests pass in 
Python 3.3. To get started try cloning Scrapy and running some tests using 
tox (as described in docs). You can also check 
https://github.com/scrapy/scrapy/blob/master/tests/py3-ignores.txt file - 
try uncommenting something and run tests again to see what's not ported. We 
can't rely only on tests when porting, but they are a good start.

By the way, project description may be a bit misleading. It can make you 
feel that the main issue is Twisted. But this is not where the existing 
porting efforts stopped. Currently we stopped at porting scrapy.Request, 
and specifically at deciding how to represent URLs. There is an existing PR 
(https://github.com/scrapy/scrapy/pull/837), but I think it took a wrong 
path (and it seems Daniel agrees). In the PR URLs are considered bytes. 

It is not entirely unreasonable (in the end, you get bytes from the 
internet, and you send URL as bytes when doing HTTP requests, and often 
they must be the same bytes). The problem is that such URLs are hard to 
work with in Python 3.x (unwanted unicode promotion from urllib, no .format 
method, etc), and that you get unicode URLs if they are extracted from HTML 
using scrapy selectors. Scrapy only sends ASCII-clean URLs (they are 
escaped using w3lib) because this is what browsers do. There is some value 
in allowing binary non-escaped URLs though (see e.g. 
https://github.com/scrapy/scrapy/issues/833) - maybe "new" URL handling 
could have a solution for thatas well.

So we're thinking of using unicode URLs in Python 3.x. This could require 
changes to https://github.com/scrapy/w3lib because we made it work on byte 
urls (but maybe not). Also, the method w3lib uses to encode URLs to ASCII 
is incorrect, i.e. it doesn't match what browsers do. Browsers are crazy 
here - it seems I lost the demo source code, but browsers can use different 
encodings for different parts of URL, something like "encode GET argument 
values using UTF8, but encode /path using web page encoding". 

This URL encoding thing is where we stopped. Without having a solid 
solution we can't port scrapy.Request, and without scrapy.Request most 
other Scrapy components don't work.
 

>
> It would be really helpful if you could guide me where to start and 
> provide some useful links as well.
>
> [1] - http://twistedmatrix.com/trac/wiki/Plan/Python3
> [2] - http://python3porting.com/
>
> Regards,
> Anuj Bansal
> Github - ahhda
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Queries regarding adding Python 3 support for scrapy.

Reply via email to