K, so I found the problem (at least the major one for now). was using return instead of yield on some of the parser. Once I switched to yield I saw a 10-20 times improvement.
On Monday, November 9, 2015 at 9:18:58 PM UTC+2, Travis Leleu wrote: > > What kind of speed are you seeing if you just output to a CSV file? That > may be useful to know in debugging where the problem lies. > > On Mon, Nov 9, 2015 at 12:46 PM, Daniel Dubovski <dan...@hyprbrands.com > <javascript:>> wrote: > >> I have a simple pipeline that writes to s3 asynchronously (using twisted >> thread.deferToThread). >> >> also, I check against s3 if url was already scraped (save the re-write), >> also async. >> >> Besides, I tried commenting out any 'outer world' code like s3. seems to >> have some affect on the number of pages scraped but not on the items. >> >> >> On Sunday, November 8, 2015 at 9:55:57 PM UTC+2, Travis Leleu wrote: >>> >>> What pipelines are you using? If it's something like mysql, I think >>> it's both synchronous and single-threaded, so if you're maxing out that >>> pipeline, that could be your issue. >>> >>> On Sun, Nov 8, 2015 at 12:23 PM, Daniel Dubovski <dan...@hyprbrands.com> >>> wrote: >>> >>>> have a working scrapy spider deployed on an amazon ec2 instance >>>> (c4xlarge) and running using scrapyd. >>>> >>>> No matter what I do, I can't seem to top ~200 processed items per >>>> minute (according to scrapy logs). >>>> >>>> I tried playing around with scrapyd conccurency settings, nothing >>>> helped, tried playing around with scrapyd max_proc_per_cpu(lowered to 1 to >>>> avoid context switch), tried to run separate scrapy crawlers from command >>>> line, still, all of them together give the same results of an aggregate >>>> amount of around 200 items. >>>> >>>> I can see from scrapy logs that the aggregate amount of web pages hit >>>> is increasing almost linearly but the scraped items per minute seems stuck >>>> at 200. >>>> >>>> Any tips? has anybody come across this before? Have i missed a setting >>>> somewhere? >>>> >>>> Much appreciated, Daniel. >>>> >>>> *also asked on stackoverflow.com >>>> >>>> http://stackoverflow.com/questions/33595986/scrapy-scrpyd-cant-process-more-than-200-items-per-minute >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "scrapy-users" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to scrapy-users...@googlegroups.com. >>>> To post to this group, send email to scrapy...@googlegroups.com. >>>> Visit this group at http://groups.google.com/group/scrapy-users. >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.