Hello, I have a generic spider that basically conducts deep crawls with additional funtionality to download and convert certain documents to text.
I know we have a file-pipeline for this in scrapy.contrib. But this does not fit my needs because I want the files to be treated as single items, ie: I do not want to wrap them in a parent item as "downloaded content", and I'm not willing to store them to disk. What I want to achieve is to parse responses with binary bodies, convert them to text(pdfs) and then store them in my item class. Right now I'm sniffing the mime of the response.body with python-magic, and based on content i'm having a "binary response parser" in my spider. This is probably not the best way to do it though, and I'm looking for advice how you can do this in a better fashion? Maybe by utilizing some of the scrapy architecture like downloader-middleware or in my pipeline. Another issue is with my generic ItemLoader which does not like binary as input? Is there some way to get around this? Thanks, -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.