Re: Running scrapy in scale

Travis Leleu Wed, 03 Dec 2014 07:24:57 -0800

Hi Adi,

I believe scrapy would meet your needs, especially since you have a
decentralized queue to feed the urls into it.

1. If you use the start_requests() method (see more:
http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spider.Spider.start_requests),
you can just consume from the queue to feed URLs into scrapy.  You can pop
the queue, modify the URL as needed, and yield it to the scrapy core engine.

2. scrapyd is a convenient way to send jobs around to different systems,
without having to copy your codebase.  It's essentially a deployment tool.
Scrapy is pretty efficient for web scraping.  Scraping is I/O bound, and
scrapy uses Twisted, an async http framework.  So scrapy fires off a
request, then forgets about it until the request comes back through
Twisted.  In the interim, it can process or fire off other requests.

Processing requirements vary, but I would expect you could have hundreds,
if not thousands, of concurrent scraping requests using a medium sized ec2
server.

In my experience, the only shortcomings of scrapy are the architectural
complexity (takes some time to master), and the lack of javascript
support.  So many sites are one page apps that load their content via js,
and scrapy (to my knowledge) can't do anything with that.

Hope this helps,
Travis

On Wed, Dec 3, 2014 at 4:01 AM, <[email protected]> wrote:

> Hi,
> I am building a back-end which one of its modules needs to do web scraping
> of various sites. The URL is originated by an end user, therefore the
> domain is known before-hand, but the full URL is dynamic.
>
> The back-end is planned to support thousands of requests per second.
> I like what I see for scrapy regarding feature coverage, extensibility,
> ease of use and more, but I am concerned of those 2 points:
>
> 1. Passing the URL  in real-time as an argument to scrapy, where only the
> domain (therefore, the specific spider) is known
> 2. I've read that in order to invoke scrapy via API one should use scrapyd
> with json API, which invokes a process per scraping. It means that a
> process per request runs, and this is not scalable (imagine each request
> takes 1.5 second).
>
> Please advise,
>
> --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Running scrapy in scale

Reply via email to