Re: Ideas for GSOC '17

Paul Tremberth Fri, 03 Mar 2017 08:47:08 -0800

Hello Parth,

Sorry we did not reply to your first message in February.
It's great that you're interested in participating in GSoC with a Scrapy
project!


For "Scrapy benchmarking suite" idea, you may want to get in touch with
Daniel and Mikhail who are listed as potential mentors for the project.

A few pointers in the meantime:
Scrapy currently has a `scrapy bench` command that tries to fetch pages at
maximum speed:
https://docs.scrapy.org/en/latest/topics/benchmarking.html#benchmarking
You can check how that is implemented and what is does and does not.
It's quite naive and may not represent a realistic use-case with large or
broken HTML files, or broad crawls with lots of domains visited

Scrapy commands also have a (undocumented?) --profile option to write
cProfile stats.
you can try it out to see what you can get out of it.

There are (at least) a couple of issues about potential memory leaks:
- https://github.com/scrapy/scrapy/issues/482
- https://github.com/scrapy/scrapy/issues/482

Another question: maybe Python 2 and Python 3 show differences in terms of
CPU and memory usage?

I would assume a succesful project for GSoC would allow investigating such
issues and find the root causes (if not fixing them).

Hope this helps,
Paul.


On Fri, Mar 3, 2017 at 11:51 AM, Parth Verma <vermapart...@gmail.com> wrote:

> Hi,
>
> I'm interested in "Scrapy benchmarking suite" idea in the ideas list for
> GSoC '17.
> Please help with what are the prerequisites for the same.
>
> Thanks.
>
>
> On Saturday, 11 February 2017 21:20:19 UTC+5:30, Parth Verma wrote:
>>
>> Hi,
>>
>> I am Parth Verma, a second year undergraduate pursuing MSc. in
>> Mathematics and Computing at IIT Kharagpur, India.
>> I have been doing open-source programming for a year. My github profile
>> is https://github.com/Parth-Vader.
>> My programming knowledge includes Python (Intermediate) , C
>> (Intermediate) , C++(Intermediate), HTML/CSS (basic) and Bash. I use Ubuntu
>> 16.04 as my main operating system and Windows 8 for gaming.
>> I have been doing Data Analytics, and for that, I need to collect data
>> from various online sources and that's why I used Scrapy.
>>
>> I am interested in Scrapy benchmarking suite, since I have prior
>> knowledge of various algorithms and I want to learn memory management in
>> CPUs. What should be my next steps?
>>
>> Furthermore, I would like to suggest an idea.
>>
>> A new section in the official documentation could be added where people
>> could share their configuration files that they used to successfully scrape
>> data from a specific website (by successful, I mean not getting banned and
>> getting a good speed.) This way, I believe , it would be easier for people
>> without any prior knowledge of HTML, Python or Shell, could easily use
>> scrapy to get data from those specific sites.
>> In addition, we could create benchmarking for those sites as well.
>>
>> Thanks.
>>
> --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to scrapy-users+unsubscr...@googlegroups.com.
> To post to this group, send email to scrapy-users@googlegroups.com.
> Visit this group at https://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Ideas for GSOC '17

Reply via email to