what about the speed ? 2015-05-10 4:59 GMT-05:00 Erik van de Ven <[email protected]>:
> Use this for javascript, works perfect: > https://github.com/brandicted/scrapy-webdriver > > Op woensdag 3 december 2014 16:24:36 UTC+1 schreef Travis Leleu: >> >> Hi Adi, >> >> I believe scrapy would meet your needs, especially since you have a >> decentralized queue to feed the urls into it. >> >> 1. If you use the start_requests() method (see more: >> http://doc.scrapy.org/en/latest/topics/spiders.html#scrapy.spider.Spider.start_requests), >> you can just consume from the queue to feed URLs into scrapy. You can pop >> the queue, modify the URL as needed, and yield it to the scrapy core engine. >> >> 2. scrapyd is a convenient way to send jobs around to different systems, >> without having to copy your codebase. It's essentially a deployment tool. >> Scrapy is pretty efficient for web scraping. Scraping is I/O bound, and >> scrapy uses Twisted, an async http framework. So scrapy fires off a >> request, then forgets about it until the request comes back through >> Twisted. In the interim, it can process or fire off other requests. >> >> Processing requirements vary, but I would expect you could have hundreds, >> if not thousands, of concurrent scraping requests using a medium sized ec2 >> server. >> >> In my experience, the only shortcomings of scrapy are the architectural >> complexity (takes some time to master), and the lack of javascript >> support. So many sites are one page apps that load their content via js, >> and scrapy (to my knowledge) can't do anything with that. >> >> Hope this helps, >> Travis >> >> On Wed, Dec 3, 2014 at 4:01 AM, <[email protected]> wrote: >> >>> Hi, >>> I am building a back-end which one of its modules needs to do web >>> scraping of various sites. The URL is originated by an end user, therefore >>> the domain is known before-hand, but the full URL is dynamic. >>> >>> The back-end is planned to support thousands of requests per second. >>> I like what I see for scrapy regarding feature coverage, extensibility, >>> ease of use and more, but I am concerned of those 2 points: >>> >>> 1. Passing the URL in real-time as an argument to scrapy, where only >>> the domain (therefore, the specific spider) is known >>> 2. I've read that in order to invoke scrapy via API one should use >>> scrapyd with json API, which invokes a process per scraping. It means that >>> a process per request runs, and this is not scalable (imagine each request >>> takes 1.5 second). >>> >>> Please advise, >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "scrapy-users" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at http://groups.google.com/group/scrapy-users. >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- > You received this message because you are subscribed to the Google Groups > "scrapy-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/d/optout. > -- Andres Vargas www.zodman.com.mx -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
