Need advice: How to handle binary response bodies?

James Ford Fri, 09 May 2014 06:47:33 -0700

Hello,

I have a generic spider that basically conducts deep crawls with additional 
funtionality to download and convert certain documents to text.


I know we have a file-pipeline for this in scrapy.contrib. But this does 
not fit my needs because I want the files to be treated as single items, 
ie: I do not want to wrap them in a parent item as "downloaded content", 
and I'm not willing to store them to disk.

What I want to achieve is to parse responses with binary bodies, convert 
them to text(pdfs) and then store them in my item class. Right now I'm 
sniffing the mime of the response.body with python-magic, and based on 
content i'm having a "binary response parser" in my spider.

This is probably not the best way to do it though, and I'm looking for 
advice how you can do this in a better fashion? Maybe by utilizing some of 
the scrapy architecture like downloader-middleware or in my pipeline.

Another issue is with my generic ItemLoader which does not like binary as 
input? Is there some way to get around this?

Thanks,

 

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Need advice: How to handle binary response bodies?

Reply via email to