Re: how to download and save a file with scrapy

Matt Cialini Mon, 24 Mar 2014 22:35:27 -0700

Hi Casey,

I ended up using Paul's suggestion and expanded on it to fit my needs.
Basically my spider creates a single instance of FileDownloadItem
{'file_urls': [list of several of dict objects {file_url:url,
file_name:name} ] }. The individual dict objects have their file_url =
their web url, and file name = the save title. It yields the item to the
FilesPipeline, in which I just edited a few functions to better match the
item structure i passed in.


def _get_filesystem_path(self, path):
        str = self.basedir + path[0]
        return str

def file_path(self, request, response=None, info=None):
        def _warn():
            #print "_warn"
            from scrapy.exceptions import ScrapyDeprecationWarning
            import warnings
            warnings.warn('FilesPipeline.file_key(url) method is
deprecated, please use '
                          'file_path(request, response=None, info=None)
instead',
                          category=ScrapyDeprecationWarning, stacklevel=1)

        # check if called from file_key with url as first argument
        if not isinstance(request, Request):
            _warn()
            url = request
        else:
            url = request.url

        # detect if file_key() method has been overridden
        if not hasattr(self.file_key, '_base'):
            _warn()
            return self.file_key(url)
        media_ext = os.path.splitext(url)[1]  # change to request.url after
deprecation
        ret = request.meta["file_spec"]["file_name"]
        return ret[0] + media_ext



On Sun, Mar 23, 2014 at 10:00 PM, Casey Klimkowsky <[email protected]>wrote:

> Hi Matt,
>
> I was wondering if you ever figured out your problem? I am also looking to
> use the FilesPipeline with custom file names. I was able to edit
> FilesPipeline itself to achieve this result, but obviously it would be a
> better practice to extend FilesPipeline and override the necessary methods
> instead. When I use a solution similar to Paul's, my files are not
> downloaded to my hard drive.
>
> Thank you!
>
>
> On Tuesday, February 25, 2014 9:03:20 AM UTC-6, Matt Cialini wrote:
>
>> Hi Paul,
>>
>> Thanks for the suggestion. I'm trying to implement it now but the files
>> aren't being written to disk correctly. What function in files.py handles
>> the actual saving of the file?
>>
>> Every item I pass into files.py eventually is a FileDownloadItem
>> {'file_urls': [list of several of these dict objects {file_url:url,
>> file_name:name}]}
>>
>> I'll attach my code to this if you have time to look it over. Basically I
>> think something is not being passed in correctly in files.py, but it's hard
>> to search through and determine where.
>>
>> Thanks so much Paul!
>>
>> - Matt C
>>
>>
>> On Tue, Feb 25, 2014 at 4:28 AM, Paul Tremberth <[email protected]>wrote:
>>
>>> Hi Matt,
>>>
>>> one way to do that is to play with the FilesPipeline
>>> *get_media_requests()*,
>>> passing additional data through the meta dict
>>> and then using a custom *file_path()* method
>>>
>>> Below, I use a dict in *file_urls *and not a list, so that I can pass a
>>> URL and a custom *file_name*
>>>
>>> Using the same IETF example I used above in the thread:
>>>
>>> A simple spider downloading some files from IETF.org
>>>
>>> from scrapy.spider import Spider
>>> from scrapy.http import Request
>>> from scrapy.item import Item, Field
>>>
>>>
>>> class IetfItem(Item):
>>>     files = Field()
>>>     file_urls = Field()
>>>
>>>
>>> class IETFSpider(Spider):
>>>     name = 'ietfpipe'
>>>     allowed_domains = ['ietf.org']
>>>     start_urls = ['http://www.ietf.org']
>>>     file_urls = [
>>>         'http://www.ietf.org/images/ietflogotrans.gif',
>>>         'http://www.ietf.org/rfc/rfc2616.txt',
>>>         'http://www.rfc-editor.org/rfc/rfc2616.ps',
>>>         'http://www.rfc-editor.org/rfc/rfc2616.pdf',
>>>         'http://tools.ietf.org/html/rfc2616.html',
>>>     ]
>>>
>>>     def parse(self, response):
>>>         for cnt, furl in enumerate(self.file_urls, start=1):
>>>             yield IetfItem(file_urls=[{"file_url": furl, "file_name":
>>> "file_%03d" % cnt}])
>>>
>>>
>>>
>>> Custom FilesPipeline
>>>
>>> from scrapy.contrib.pipeline.files import FilesPipeline
>>> from scrapy.http import Request
>>>
>>> class MyFilesPipeline(FilesPipeline):
>>>
>>>     def get_media_requests(self, item, info):
>>>         for file_spec in item['file_urls']:
>>>             yield Request(url=file_spec["file_url"], meta={"file_spec":
>>> file_spec})
>>>
>>>     def file_path(self, request, response=None, info=None):
>>>         return request.meta["file_spec"]["file_name"]
>>>
>>>
>>>
>>> Hope this helps
>>>
>>> /Paul.
>>>
>>> On Friday, February 21, 2014 6:44:20 AM UTC+1, Matt Cialini wrote:
>>>>
>>>> Hello Paul!
>>>>
>>>> I'm Matt. I know this is a somewhat old group now but I have found your
>>>> advice about FilesPipeline and it works great. I had one question though.
>>>> Do you know of an easy way to pass in a file_name field for each url so
>>>> that the FilesPipeline will save each url with the correct name?
>>>>
>>>> Thanks!
>>>>
>>>> On Saturday, September 21, 2013 1:03:09 PM UTC-4, Paul Tremberth wrote:
>>>>>
>>>>> Hi Ana,
>>>>>
>>>>> if you want to use the FilesPipeline, before it's in an official
>>>>> Scrapy release,
>>>>> here's one way to do it:
>>>>>
>>>>> 1) download https://raw.github.com/scrapy/scrapy/master/scrapy/
>>>>> contrib/pipeline/files.py
>>>>> and save it somewhere in your Scrapy project,
>>>>> let's say at the root of your project (but that's not the best
>>>>> location...)
>>>>> yourproject/files.py
>>>>>
>>>>> 2) then, enable this pipeline by adding this to your settings.py
>>>>>
>>>>> ITEM_PIPELINES = [
>>>>>     'yourproject.files.FilesPipeline',
>>>>> ]
>>>>> FILES_STORE = '/path/to/yourproject/downloads'
>>>>>
>>>>> FILES_STORE needs to point to a location where Scrapy can write
>>>>> (create it beforehand)
>>>>>
>>>>> 3) add 2 special fields to your item definition
>>>>>     file_urls = Field()
>>>>>     files = Field()
>>>>>
>>>>> 4) in your spider, when you have an URL for a file to download,
>>>>> add it to your Item instance before returning it
>>>>>
>>>>> ...
>>>>>     myitem = YourProjectItem()
>>>>>     ...
>>>>>     myitem["file_urls"] = ["http://www.example.com/somefileiwant.csv";]
>>>>>     yield myitem
>>>>>
>>>>> 5) run your spider and you should see files in the FILES_STORE folder
>>>>>
>>>>> Here's an example that download a few files from the IETF website
>>>>>
>>>>> the scrapy project is called "filedownload"
>>>>>
>>>>> items.py looks like this:
>>>>>
>>>>> from scrapy.item import Item, Field
>>>>>
>>>>> class FiledownloadItem(Item):
>>>>>     file_urls = Field()
>>>>>     files = Field()
>>>>>
>>>>>
>>>>> this is the code for the spider:
>>>>>
>>>>> from scrapy.spider import BaseSpider
>>>>> from filedownload.items import FiledownloadItem
>>>>>
>>>>> class IetfSpider(BaseSpider):
>>>>>     name = "ietf"
>>>>>     allowed_domains = ["ietf.org"]
>>>>>     start_urls = (
>>>>>         'http://www.ietf.org/',
>>>>>         )
>>>>>
>>>>>     def parse(self, response):
>>>>>         yield FiledownloadItem(
>>>>>             file_urls=[
>>>>>                 'http://www.ietf.org/images/ietflogotrans.gif',
>>>>>                 'http://www.ietf.org/rfc/rfc2616.txt',
>>>>>                 'http://www.rfc-editor.org/rfc/rfc2616.ps',
>>>>>                 'http://www.rfc-editor.org/rfc/rfc2616.pdf',
>>>>>                 'http://tools.ietf.org/html/rfc2616.html',
>>>>>             ]
>>>>>         )
>>>>>
>>>>> When you run the spider, at the end, you should see in the console
>>>>> something like this:
>>>>>
>>>>> 2013-09-21 18:30:42+0200 [ietf] DEBUG: Scraped from <200
>>>>> http://www.ietf.org/>
>>>>> {'file_urls': ['http://www.ietf.org/images/ietflogotrans.gif',
>>>>>                'http://www.ietf.org/rfc/rfc2616.txt',
>>>>>                'http://www.rfc-editor.org/rfc/rfc2616.ps',
>>>>>                'http://www.rfc-editor.org/rfc/rfc2616.pdf',
>>>>>                'http://tools.ietf.org/html/rfc2616.html'],
>>>>>  'files': [{'checksum': 'e4b6ca0dd271ce887e70a1a2a5d681df',
>>>>>             'path': 'full/4f7f3e96b2dda337913105cd751a2d
>>>>> 05d7e64b64.gif',
>>>>>             'url': 'http://www.ietf.org/images/ietflogotrans.gif'},
>>>>>            {'checksum': '9fa63f5083e4d2112d2e71b008e387e8',
>>>>>             'path': 'full/454ea89fbeaf00219fbcae49960d8b
>>>>> d1016994b0.txt',
>>>>>             'url': 'http://www.ietf.org/rfc/rfc2616.txt'},
>>>>>            {'checksum': '5f0dc88aced3b0678d702fb26454e851',
>>>>>             'path': 'full/f76736e9f1f22d7d5563208d97d13e7cc7a3a633.ps
>>>>> ',
>>>>>             'url': 'http://www.rfc-editor.org/rfc/rfc2616.ps'},
>>>>>            {'checksum': '2d555310626966c3521cda04ae2fe76f',
>>>>>             'path': 'full/6ff52709da9514feb13211b6eb0504
>>>>> 58f353b49a.pdf',
>>>>>             'url': 'http://www.rfc-editor.org/rfc/rfc2616.pdf'},
>>>>>            {'checksum': '735820b4f0f4df7048b288ba36612295',
>>>>>             'path': 'full/7192dd9a00a8567bf3dc4c21ababdc
>>>>> ec6c69ce7f.html',
>>>>>             'url': 'http://tools.ietf.org/html/rfc2616.html'}]}
>>>>> 2013-09-21 18:30:42+0200 [ietf] INFO: Closing spider (finished)
>>>>>
>>>>> which tells you what files were downloaded, and where they were stored.
>>>>>
>>>>> Hope this helps.
>>>>>
>>>>> On Tuesday, September 17, 2013 1:46:15 PM UTC+2, Ana Carolina Assis
>>>>> Jesus wrote:
>>>>>>
>>>>>> Hi Paul,
>>>>>>
>>>>>> Could you give me an example on how to use the pipeline, please?
>>>>>>
>>>>>> Thanks,
>>>>>> Ana
>>>>>>
>>>>>> On Tue, Sep 17, 2013 at 12:19 PM, Ana Carolina Assis Jesus
>>>>>> <[email protected]> wrote:
>>>>>> > well, I installed about two weeks ago, but a tagged version... so
>>>>>> > maybe I dont have it...
>>>>>> > But I really need pipeline even if get button, at principle, at
>>>>>> least,
>>>>>> > should just download a file! I mean, it is what it does manualy...
>>>>>> > ???
>>>>>> >
>>>>>> > Thanks!
>>>>>> >
>>>>>> > On Tue, Sep 17, 2013 at 12:14 PM, Paul Tremberth
>>>>>> > <[email protected]> wrote:
>>>>>> >> Well, the FilesPipeline is a module inside
>>>>>> scrapy.contrib.pipelines
>>>>>> >> It was committed less than 2 weeks ago.(Scrapy is being improved
>>>>>> all the
>>>>>> >> time by the community)
>>>>>> >>
>>>>>> >> It depends when and how you installed scrapy:
>>>>>> >> - if you install a tagged version using pip or easy_install (as
>>>>>> it's
>>>>>> >> recommended;
>>>>>> >> http://doc.scrapy.org/en/latest/intro/install.html#installin
>>>>>> g-scrapy)
>>>>>> >> you won't have the Pipeline and you have to add it yourself
>>>>>> >>
>>>>>> >> - if you installed from source less than 2 weeks ago (git clone
>>>>>> >> [email protected]:scrapy/scrapy.git; cd scrapy; sudo python setup.py
>>>>>> install)
>>>>>> >> you should be good (but Scrapy from latest source code might be
>>>>>> unstable and
>>>>>> >> not fully tested)
>>>>>> >>
>>>>>> >>
>>>>>> >> On Tuesday, September 17, 2013 12:04:31 PM UTC+2, Ana Carolina
>>>>>> Assis Jesus
>>>>>> >> wrote:
>>>>>> >>>
>>>>>> >>> Hi Paul.
>>>>>> >>>
>>>>>> >>> What do you mean by installing scrapy from source?
>>>>>> >>> I need a new version from it?
>>>>>> >>>
>>>>>> >>> On Tue, Sep 17, 2013 at 12:01 PM, Paul Tremberth
>>>>>> >>> <[email protected]> wrote:
>>>>>> >>> > Hi Ana,
>>>>>> >>> > to download files, you should have a look at the new
>>>>>> FilesPipeline
>>>>>> >>> > https://github.com/scrapy/scrapy/pull/370
>>>>>> >>> >
>>>>>> >>> > It's in the master branch though, not in a tagged version of
>>>>>> Scrapy, so
>>>>>> >>> > you'll have to install scrapy from source.
>>>>>> >>> >
>>>>>> >>> > Paul.
>>>>>> >>> >
>>>>>> >>> >
>>>>>> >>> > On Tuesday, September 17, 2013 11:50:05 AM UTC+2, Ana Carolina
>>>>>> Assis
>>>>>> >>> > Jesus
>>>>>> >>> > wrote:
>>>>>> >>> >>
>>>>>> >>> >> Hi!
>>>>>> >>> >>
>>>>>> >>> >> I am trying to download a csv file with scrapy.
>>>>>> >>> >> I could crawl inside the site and get to the form I need and
>>>>>> then I
>>>>>> >>> >> find
>>>>>> >>> >> two buttons to click.
>>>>>> >>> >> One will list the transactions while the second one will
>>>>>> download a
>>>>>> >>> >> XXX.cvs file.
>>>>>> >>> >>
>>>>>> >>> >> How do I save this file within scrapy?
>>>>>> >>> >>
>>>>>> >>> >> I mean, if I choose the list transactions, I will get another
>>>>>> webpage
>>>>>> >>> >> and
>>>>>> >>> >> this I can see.
>>>>>> >>> >> But what if I choose the action to download? I guess I should
>>>>>> not use
>>>>>> >>> >> the
>>>>>> >>> >> return self.parse_dosomething but something else to save the
>>>>>> file it
>>>>>> >>> >> should
>>>>>> >>> >> give me (???)
>>>>>> >>> >>
>>>>>> >>> >> Or should the download start by itself?
>>>>>> >>> >>
>>>>>> >>> >> Thanks,
>>>>>> >>> >> Ana
>>>>>> >>> >
>>>>>> >>> > --
>>>>>> >>> > You received this message because you are subscribed to a topic
>>>>>> in the
>>>>>> >>> > Google Groups "scrapy-users" group.
>>>>>> >>> > To unsubscribe from this topic, visit
>>>>>> >>> > https://groups.google.com/d/topic/scrapy-users/kzGHFjXywuY/
>>>>>> unsubscribe.
>>>>>> >>> > To unsubscribe from this group and all its topics, send an
>>>>>> email to
>>>>>> >>> > [email protected].
>>>>>> >>> > To post to this group, send email to [email protected].
>>>>>>
>>>>>> >>> > Visit this group at http://groups.google.com/group/scrapy-users.
>>>>>>
>>>>>> >>> > For more options, visit https://groups.google.com/grou
>>>>>> ps/opt_out.
>>>>>> >>
>>>>>> >> --
>>>>>> >> You received this message because you are subscribed to a topic in
>>>>>> the
>>>>>> >> Google Groups "scrapy-users" group.
>>>>>> >> To unsubscribe from this topic, visit
>>>>>> >> https://groups.google.com/d/topic/scrapy-users/kzGHFjXywuY/
>>>>>> unsubscribe.
>>>>>> >> To unsubscribe from this group and all its topics, send an email
>>>>>> to
>>>>>> >> [email protected].
>>>>>> >> To post to this group, send email to [email protected].
>>>>>> >> Visit this group at http://groups.google.com/group/scrapy-users.
>>>>>> >> For more options, visit https://groups.google.com/groups/opt_out.
>>>>>>
>>>>>  --
>>> You received this message because you are subscribed to a topic in the
>>> Google Groups "scrapy-users" group.
>>> To unsubscribe from this topic, visit https://groups.google.com/d/
>>> topic/scrapy-users/kzGHFjXywuY/unsubscribe.
>>> To unsubscribe from this group and all its topics, send an email to
>>> [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>> For more options, visit https://groups.google.com/groups/opt_out.
>>>
>>
>>  --
> You received this message because you are subscribed to a topic in the
> Google Groups "scrapy-users" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/scrapy-users/kzGHFjXywuY/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: how to download and save a file with scrapy

Reply via email to