Re: GSOC Discussion for Benchmarking and/or Asyncio Prototype

Paul Tremberth Mon, 13 Mar 2017 07:58:32 -0700

Hello Aditya,

Sorry about the late feedback.
Our GSoC page has guidelines on 
applications: http://gsoc2017.scrapinghub.com/guidelines/
The PSF page also is worth a read: http://python-gsoc.org/

We do appreciate pull requests to scrapy itself, even if it does not get 
merged yet.
It helps us get an idea of how students communicate with the community and 
the maintainers (which are sometimes mentors for GSoC too).

I replied in this mailing list about the benchmarking project. You can 
refer to that for my point of view (and it's only mine, I am not a mentor 
for this project)

For AsyncIO, Scrapy has been working very well with Twisted in a 
single-threaded way.
I believe that's one of the key things about asynchronous networking code, 
that you can achieve much higher throughput with a single thread.
Web scraping is very often IO-bound, even for aynchronous code. You wait 
for web pages to get downloaded.
Using multiple thread may not bring much to the game, unless you're doing a 
broad crawl with lots of domains.

I don't think the aim of the GSoC project is to increase performance using 
asyncio.
Twisted is already quite fast, works on Python 2 and Python 3, and has tons 
of protocols.
A Scrapy-like framework using asyncio would mostly be for those who want to 
use the latest syntactic sugar for async code in Python 3.
It may also help simplify some of the core engine, I'm not sure. I don't 
know much about asyncio to be honest.

Hope this helps,
Paul.

On Tuesday, March 7, 2017 at 1:43:05 AM UTC+1, Aditya Agarwal wrote:
>
> Dear mentors,
> I am 3rd year Computer Science student. I am a versatile programmer, but 
> most proficient in python (2.7/3.X) and Javascript.
>
> Related to GSOC Scrapy:
> I have worked extensively on web-scraping (only in python). I have mainly 
> used BeautifulSoup for parsing with various page- request libraries 
> (urllib,urlib2,urllib3,requests).
> I have used Scrapy in my later works, but in a limited fashion.
> I have also worked with Selenium for complete automation and crawling. 
> Some of my scripts could be found on GitHub : vintageplayer 
> <http://www.github.com/vintageplayer>
>
> I am really interested in working for this project. This is my first 
> attempt at GSOC and request you to provide some guidance to get through 
> selection process. If there is some extra-topic I need to be aware of, or 
> if I need to submit a patch.
>
> I like to confirm things from very basics without leaving any room for any 
> form of assumption. Thus I might sometimes asked very trivial questions. 
> Also I am enquiring about two of the ideas. Hope that isn't a problem.
>
> For Benchmarking:
> I have going through Scrapy features and understand that currently Scrapy 
> has a simple benchmarking tool which needs to enhanced. Can you tell me if 
> there is a particular part I should be on focussing on first? Or some 
> standards which are must or some existing libraries' implementation which 
> provides similar information?
>
> For Asyncio Prototype:
> I have been reading about coroutines. Also, I was wondering why is the 
> idea restricted to single-threaded code and not multi-threaded? I know 
> thread support in python is limited, but it's still present. Wouldn't it 
> improve performance if that's the actual aim of the project?
>
>
> Thanks,
> Aditya
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: GSOC Discussion for Benchmarking and/or Asyncio Prototype

Reply via email to