Hello Aditya, Sorry about the late feedback. Our GSoC page has guidelines on applications: http://gsoc2017.scrapinghub.com/guidelines/ The PSF page also is worth a read: http://python-gsoc.org/
We do appreciate pull requests to scrapy itself, even if it does not get merged yet. It helps us get an idea of how students communicate with the community and the maintainers (which are sometimes mentors for GSoC too). I replied in this mailing list about the benchmarking project. You can refer to that for my point of view (and it's only mine, I am not a mentor for this project) For AsyncIO, Scrapy has been working very well with Twisted in a single-threaded way. I believe that's one of the key things about asynchronous networking code, that you can achieve much higher throughput with a single thread. Web scraping is very often IO-bound, even for aynchronous code. You wait for web pages to get downloaded. Using multiple thread may not bring much to the game, unless you're doing a broad crawl with lots of domains. I don't think the aim of the GSoC project is to increase performance using asyncio. Twisted is already quite fast, works on Python 2 and Python 3, and has tons of protocols. A Scrapy-like framework using asyncio would mostly be for those who want to use the latest syntactic sugar for async code in Python 3. It may also help simplify some of the core engine, I'm not sure. I don't know much about asyncio to be honest. Hope this helps, Paul. On Tuesday, March 7, 2017 at 1:43:05 AM UTC+1, Aditya Agarwal wrote: > > Dear mentors, > I am 3rd year Computer Science student. I am a versatile programmer, but > most proficient in python (2.7/3.X) and Javascript. > > Related to GSOC Scrapy: > I have worked extensively on web-scraping (only in python). I have mainly > used BeautifulSoup for parsing with various page- request libraries > (urllib,urlib2,urllib3,requests). > I have used Scrapy in my later works, but in a limited fashion. > I have also worked with Selenium for complete automation and crawling. > Some of my scripts could be found on GitHub : vintageplayer > <http://www.github.com/vintageplayer> > > I am really interested in working for this project. This is my first > attempt at GSOC and request you to provide some guidance to get through > selection process. If there is some extra-topic I need to be aware of, or > if I need to submit a patch. > > I like to confirm things from very basics without leaving any room for any > form of assumption. Thus I might sometimes asked very trivial questions. > Also I am enquiring about two of the ideas. Hope that isn't a problem. > > For Benchmarking: > I have going through Scrapy features and understand that currently Scrapy > has a simple benchmarking tool which needs to enhanced. Can you tell me if > there is a particular part I should be on focussing on first? Or some > standards which are must or some existing libraries' implementation which > provides similar information? > > For Asyncio Prototype: > I have been reading about coroutines. Also, I was wondering why is the > idea restricted to single-threaded code and not multi-threaded? I know > thread support in python is limited, but it's still present. Wouldn't it > improve performance if that's the actual aim of the project? > > > Thanks, > Aditya > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at https://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.