Hi Edwin,

That's great you're considering us for GSoC, nice to meet you! 
Familiarizing yourself with the API is a good start. 
I'm listed as a mentor for Python 3 porting project, so let me try answer 
your question.

Scrapy itself is written in Python, but using extensions that depend on C 
libraries is OK. We use them already - lxml and pyOpenSSL are not 
pure-python, and twisted also has C modules. 

Scraping bottleneck is rarely in the event loop (maybe even never). I don't 
think moving from twisted can provide big performance benefits, if any. 
Also, Twisted is much more than a basic async networking library. It seems 
that you already can make Twisted use libuv for reactor - check 
https://github.com/saghul/twisted-pyuv (haven't tried it myself).  

You're right that replacing twisted with something else (tornado?) could 
make porting easier. But it is very ambitiuous project because a lot of 
Scrapy inner details depend on twisted. There is a semi-documented feature 
that twisted Deferred can be returned from spider and middleware methods, 
so twisted almost made its way into Scrapy public interface. This means 
replacing twisted with something else is a big change, and it will 
inevitably break some user code. It is doable, and it may have some 
benefits, but the barrier to entry is high. If you want us to consider this 
as a project, you need to provide a detailed plan on how you're going to 
implement it (creating such plan could easily take more than a day of 
full-time work), and a good description of the benefits of this project. 

I'm 80% sure it is a bad idea to replace twisted - it looks hard, it will 
break code and there is not a lot of visible benefits. But I haven't 
checked how hard it is exactly, what will it break exactly (e.g. can 
http://www.tornadoweb.org/en/stable/twisted.html help?), and I may miss 
some benefits, so there are remaining 20% :)

Collaborating with Twisted people looks like a more straightforward way to 
make Scrapy work in Python 3: Scrapy doesn't use all of Twisted, many parts 
of Twisted are already ported, and a subset of Twisted is already usable in 
Python 3.

вторник, 4 марта 2014 г., 4:35:35 UTC+6 пользователь Edwin Marshall написал:
>
> *Quick Intro*
> My name is Edwin Marshall and currently only write code in my free time 
> GSOC seems like a good way to both improve an open source project and add 
> applicable experience to my resume so that I can start coding 
> professionally. I don't have any experience with scrapy, but have done some 
> web scraping in the past. As such, I indent on taking the next few days in 
> between work and school to familiarize myself with the API before 
> application submissions are open on the 10th. 
>
> *The Real Question*
> I saw that scrapy was interested in porting to python 3 (which I have 
> experience doing on multiple projects), but that might also require porting 
> of some twisted code. I was wondering if you guys (and gals) have 
> considered porting to something pyuv based, such as evergreen? It would be 
> interesting to see what performance improvements (if any) could be gained 
> by using lightweight threads. An alternative might be gruvi or raw pyuv. I 
> was originally thinking about gevent, but last I checked it neither worked 
> on python3 nor reliably worked on windows. If the landscape has changed, 
> that could also be an option.
>
> Sorry if any of this sounds presumptuous. I'm horrible at introductions 
> and in a bit of a rush (have to go to my unfulfilling day job shortly)
>
> [1] Evergreen: https://pypi.python.org/pypi/evergreen
> [2] Gruvi: https://pypi.python.org/pypi/gruvi/0.9.2
> [3] pyuv: https://pypi.python.org/pypi/pyuv
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/groups/opt_out.

Reply via email to