Set directory for intermediate jsonlines output to be uploaded to S3?

bsloan914 Thu, 11 Feb 2016 05:31:45 -0800

Hello,

I'm running a long-running web crawl using scrapyd and scrapy 1.0.3 on an 
Amazon EC2 instance. I'm exporting jsonlines files to S3 using these 
parameters in my spider/settings.py file:


FEED_FORMAT: jsonlines
FEED_URI: s3://my-bucket-name

I've done this a number of times successfully without any issues. But I'm 
running into a problem on one particularly large crawl: the local disk 
(which isn't particularly big) fills up with the in-progress crawl's data 
before it can fully complete, and thus before the results can be uploaded 
to S3.

I'm wondering if there is any way to configure where the "intermediate" 
results of this crawl can be written prior to being uploaded to S3? I'm 
assuming that however Scrapy internally represents the in-progress crawl 
data is not held entirely in RAM but put on disk somewhere, and if that's 
the case, I'd like to set that location to an external mount with enough 
space to hold the results before shipping the completed .jl file to S3.

Thanks, and apologies if this is in the documentation and I couldn't find 
it.

Brian

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Set directory for intermediate jsonlines output to be uploaded to S3?

Reply via email to