Hi, I am building a back-end which one of its modules needs to do web scraping of various sites. The URL is originated by an end user, therefore the domain is known before-hand, but the full URL is dynamic.
The back-end is planned to support thousands of requests per second. I like what I see for scrapy regarding feature coverage, extensibility, ease of use and more, but I am concerned of those 2 points: 1. Passing the URL in real-time as an argument to scrapy, where only the domain (therefore, the specific spider) is known 2. I've read that in order to invoke scrapy via API one should use scrapyd with json API, which invokes a process per scraping. It means that a process per request runs, and this is not scalable (imagine each request takes 1.5 second). Please advise, -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
