Re: can't seem to pass ~200 processed items per minute

Daniel Dubovski Tue, 10 Nov 2015 01:23:28 -0800

K, so I found the problem (at least the major one for now). was using 
return instead of yield on some of the parser. Once I switched to yield I 
saw a 10-20 times improvement.


On Monday, November 9, 2015 at 9:18:58 PM UTC+2, Travis Leleu wrote:
>
> What kind of speed are you seeing if you just output to a CSV file?  That 
> may be useful to know in debugging where the problem lies.
>
> On Mon, Nov 9, 2015 at 12:46 PM, Daniel Dubovski <dan...@hyprbrands.com 
> <javascript:>> wrote:
>
>> I have a simple pipeline that writes to s3 asynchronously (using twisted 
>> thread.deferToThread).
>>
>> also, I check against s3 if url was already scraped (save the re-write), 
>> also async.
>>
>> Besides, I tried commenting out any 'outer world' code like s3. seems to 
>> have some affect on the number of pages scraped but not on the items.
>>
>>
>> On Sunday, November 8, 2015 at 9:55:57 PM UTC+2, Travis Leleu wrote:
>>>
>>> What pipelines are you using?  If it's something like mysql, I think 
>>> it's both synchronous and single-threaded, so if you're maxing out that 
>>> pipeline, that could be your issue.
>>>
>>> On Sun, Nov 8, 2015 at 12:23 PM, Daniel Dubovski <dan...@hyprbrands.com> 
>>> wrote:
>>>
>>>>  have a working scrapy spider deployed on an amazon ec2 instance 
>>>> (c4xlarge) and running using scrapyd.
>>>>
>>>> No matter what I do, I can't seem to top ~200 processed items per 
>>>> minute (according to scrapy logs).
>>>>
>>>> I tried playing around with scrapyd conccurency settings, nothing 
>>>> helped, tried playing around with scrapyd max_proc_per_cpu(lowered to 1 to 
>>>> avoid context switch), tried to run separate scrapy crawlers from command 
>>>> line, still, all of them together give the same results of an aggregate 
>>>> amount of around 200 items.
>>>>
>>>> I can see from scrapy logs that the aggregate amount of web pages hit 
>>>> is increasing almost linearly but the scraped items per minute seems stuck 
>>>> at 200.
>>>>
>>>> Any tips? has anybody come across this before? Have i missed a setting 
>>>> somewhere?
>>>>
>>>> Much appreciated, Daniel.
>>>>
>>>> *also asked on stackoverflow.com
>>>>
>>>> http://stackoverflow.com/questions/33595986/scrapy-scrpyd-cant-process-more-than-200-items-per-minute
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "scrapy-users" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to scrapy-users...@googlegroups.com.
>>>> To post to this group, send email to scrapy...@googlegroups.com.
>>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: can't seem to pass ~200 processed items per minute

Reply via email to