Re: Running scrapy in scale

Andres Vargas - zodman Mon, 11 May 2015 17:50:59 -0700

what about the speed ?

2015-05-10 4:59 GMT-05:00 Erik van de Ven <[email protected]>:


> Use this for javascript, works perfect:
> https://github.com/brandicted/scrapy-webdriver
>
> Op woensdag 3 december 2014 16:24:36 UTC+1 schreef Travis Leleu:
>>
>> Hi Adi,
>>
>> I believe scrapy would meet your needs, especially since you have a
>> decentralized queue to feed the urls into it.
>>
>> 1. If you use the start_requests() method (see more:
>> http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spider.Spider.start_requests),
>> you can just consume from the queue to feed URLs into scrapy.  You can pop
>> the queue, modify the URL as needed, and yield it to the scrapy core engine.
>>
>> 2. scrapyd is a convenient way to send jobs around to different systems,
>> without having to copy your codebase.  It's essentially a deployment tool.
>> Scrapy is pretty efficient for web scraping.  Scraping is I/O bound, and
>> scrapy uses Twisted, an async http framework.  So scrapy fires off a
>> request, then forgets about it until the request comes back through
>> Twisted.  In the interim, it can process or fire off other requests.
>>
>> Processing requirements vary, but I would expect you could have hundreds,
>> if not thousands, of concurrent scraping requests using a medium sized ec2
>> server.
>>
>> In my experience, the only shortcomings of scrapy are the architectural
>> complexity (takes some time to master), and the lack of javascript
>> support.  So many sites are one page apps that load their content via js,
>> and scrapy (to my knowledge) can't do anything with that.
>>
>> Hope this helps,
>> Travis
>>
>> On Wed, Dec 3, 2014 at 4:01 AM, <[email protected]> wrote:
>>
>>> Hi,
>>> I am building a back-end which one of its modules needs to do web
>>> scraping of various sites. The URL is originated by an end user, therefore
>>> the domain is known before-hand, but the full URL is dynamic.
>>>
>>> The back-end is planned to support thousands of requests per second.
>>> I like what I see for scrapy regarding feature coverage, extensibility,
>>> ease of use and more, but I am concerned of those 2 points:
>>>
>>> 1. Passing the URL  in real-time as an argument to scrapy, where only
>>> the domain (therefore, the specific spider) is known
>>> 2. I've read that in order to invoke scrapy via API one should use
>>> scrapyd with json API, which invokes a process per scraping. It means that
>>> a process per request runs, and this is not scalable (imagine each request
>>> takes 1.5 second).
>>>
>>> Please advise,
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "scrapy-users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Andres Vargas
www.zodman.com.mx

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: Running scrapy in scale

Reply via email to