Ok, I will tell you the whole, true story. I need to scrap a page that contains some zips. I want those zips, but they are potentially huge, and could change over time. Inside them there is a single, large, xml file. Changes that interest me are on the top.
I would like to make a beatiful streaming process and stream the request to an unzipper then to a streming xml parser and interrupt the request as soon as I get what I want. I'm a bandwith pre-optimizing maybe... I can't seem to make scrapy stream requests, i've went deep into the code and it does not look nice for me. Current Options: 1 - Use scrapy to get the zip files urls, and use external program to make all of the above. (least desired solution, as i would lose the item pipeline and all of scrapy's flexibility) 2 - Create a downloader middleware that actually makes the request using python's requests library (it supports stream) and returns a requests.response. Is this viable? is this a pandora's box (mixing requests with twisted)? 3 - Create a download handler for zip files only and use method above. How does scrapy fowards a specific request to a downloadhandler? Would this be a good solution? (even if i need to subclass httpdownloadhandler) 4 - Fuck it, buy RAM, buy bandwith and drink less beers! Anyone has any ideas?! Thks! -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
