Hi Edwin, That's great you're considering us for GSoC, nice to meet you! Familiarizing yourself with the API is a good start. I'm listed as a mentor for Python 3 porting project, so let me try answer your question.
Scrapy itself is written in Python, but using extensions that depend on C libraries is OK. We use them already - lxml and pyOpenSSL are not pure-python, and twisted also has C modules. Scraping bottleneck is rarely in the event loop (maybe even never). I don't think moving from twisted can provide big performance benefits, if any. Also, Twisted is much more than a basic async networking library. It seems that you already can make Twisted use libuv for reactor - check https://github.com/saghul/twisted-pyuv (haven't tried it myself). You're right that replacing twisted with something else (tornado?) could make porting easier. But it is very ambitiuous project because a lot of Scrapy inner details depend on twisted. There is a semi-documented feature that twisted Deferred can be returned from spider and middleware methods, so twisted almost made its way into Scrapy public interface. This means replacing twisted with something else is a big change, and it will inevitably break some user code. It is doable, and it may have some benefits, but the barrier to entry is high. If you want us to consider this as a project, you need to provide a detailed plan on how you're going to implement it (creating such plan could easily take more than a day of full-time work), and a good description of the benefits of this project. I'm 80% sure it is a bad idea to replace twisted - it looks hard, it will break code and there is not a lot of visible benefits. But I haven't checked how hard it is exactly, what will it break exactly (e.g. can http://www.tornadoweb.org/en/stable/twisted.html help?), and I may miss some benefits, so there are remaining 20% :) Collaborating with Twisted people looks like a more straightforward way to make Scrapy work in Python 3: Scrapy doesn't use all of Twisted, many parts of Twisted are already ported, and a subset of Twisted is already usable in Python 3. вторник, 4 марта 2014 г., 4:35:35 UTC+6 пользователь Edwin Marshall написал: > > *Quick Intro* > My name is Edwin Marshall and currently only write code in my free time > GSOC seems like a good way to both improve an open source project and add > applicable experience to my resume so that I can start coding > professionally. I don't have any experience with scrapy, but have done some > web scraping in the past. As such, I indent on taking the next few days in > between work and school to familiarize myself with the API before > application submissions are open on the 10th. > > *The Real Question* > I saw that scrapy was interested in porting to python 3 (which I have > experience doing on multiple projects), but that might also require porting > of some twisted code. I was wondering if you guys (and gals) have > considered porting to something pyuv based, such as evergreen? It would be > interesting to see what performance improvements (if any) could be gained > by using lightweight threads. An alternative might be gruvi or raw pyuv. I > was originally thinking about gevent, but last I checked it neither worked > on python3 nor reliably worked on windows. If the landscape has changed, > that could also be an option. > > Sorry if any of this sounds presumptuous. I'm horrible at introductions > and in a bit of a rush (have to go to my unfulfilling day job shortly) > > [1] Evergreen: https://pypi.python.org/pypi/evergreen > [2] Gruvi: https://pypi.python.org/pypi/gruvi/0.9.2 > [3] pyuv: https://pypi.python.org/pypi/pyuv > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/groups/opt_out.
