Streaming request that I can deal with large files

polvoazul Thu, 29 Jan 2015 20:37:35 -0800

Ok, I will tell you the whole, true story.
I need to scrap a page that contains some zips.
I want those zips, but they are potentially huge, and could change over 
time. Inside them there is a single, large, xml file.
Changes that interest me are on the top.


I would like to make a beatiful streaming process and stream the request to 
an unzipper then to a streming xml parser and interrupt the request as soon 
as I get what I want. I'm a bandwith pre-optimizing maybe...

I can't seem to make scrapy stream requests, i've went deep into the code 
and it does not look nice for me.

Current Options:
1 - Use scrapy to get the zip files urls, and use external program to make 
all of the above. (least desired solution, as i would lose the item 
pipeline and all of scrapy's flexibility)
2 - Create a downloader middleware that actually makes the request using 
python's requests library (it supports stream) and returns a 
requests.response. Is this viable? is this a pandora's box (mixing requests 
with twisted)?
3 - Create a download handler for zip files only and use method above. How 
does scrapy fowards a specific request to a downloadhandler? Would this be 
a good solution? (even if i need to subclass httpdownloadhandler)
4 - Fuck it, buy RAM, buy bandwith and drink less beers! 

Anyone has any ideas?!
Thks!

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Streaming request that I can deal with large files

Reply via email to