Re: How to identify Cache location and delete it ?

Paul Tremberth Tue, 28 Feb 2017 02:40:15 -0800

Hi again cristimocean,

You have copy-pasted outputs from my comment 
in https://github.com/scrapy/scrapy/issues/2601#issuecomment-282993991
Can you share your real logs? (with LOG_LEVEL='DEBUG', scrapy startup logs 
with middleware and settings, example of crawled page with "cached" flag...)


Can you also share your crawl stats (that appear at the end)? They provide 
useful information on what happened.
Maybe your have duplicate requests, redirections to a single page, some 
non-200 responses  etc. That can also explain a 15s crawl.
HTTP cache stats for cache responses will appear with httpcache/hit and 
httpcache/miss

The default HTTP cache storage is one the filesystem:
HTTPCACHE_STORAGE = 'scrapy.extensions.httpcache.FilesystemCacheStorage'

There exists a storage using LevelDB: LeveldbCacheStorage
but it's not on by default. You'd have to have set it yourself.


On Tuesday, February 28, 2017 at 11:24:32 AM UTC+1, cristimocean wrote:
>
> I am trying to delete cache and I can't find it's location.
> It's not in .scrapy directory because:
> 1.I deleted .scrapy folder
> 2.I created a blank scrapy project
> so there is absolutelly no way the cache is coming from the .scrapy folder
>
> This method:
>
> $ scrapy shell -s HTTPCACHE_ENABLED=True
> 2017-02-28 10:41:41 [scrapy.utils.log] INFO: Scrapy 1.3.2 started (bot: 
> httpbin)
> (...)
> >>> from scrapy.utils.misc import load_object
> >>> storage = load_object(settings['HTTPCACHE_STORAGE'])(settings)
> >>> storage.cachedir
> '/home/paul/scrapy/httpbin/.scrapy/httpcache'
>
> only gives me the same location inside .scrapy folder.
>
> I am 100% positive data is taken from a cache somewhere because:
> 1.it takes 15 seconds to run the spider (as opposed to 10-15 hours)
> 2.I stopped the internet and the spider continues to get data
> 3.requests have the cached flag 
>
> 2017-02-28 10:47:04 [scrapy.core.engine] DEBUG: Crawled (200) <GET 
> http://www.example.com> (referer: None) ['cached']
>
>
> I tried to use a windows tool (process monitor) to see which files are 
> accessed.
> I see a lot of files created in the blank .scrapy folder that I just 
> created but no other large amounts of files being read from anywhere else.
> So the only sane explanation would be that Scrapy has a database which is 
> a single file and just reads from it (so this is why I don't see lots of 
> cache files being read because scrapy's source is a single database file)
>
>  So my question is : is there such a thing as a default Scrapy database 
> where Scrapy keeps cache ?
>  If not then from where is my cache magically reapearing back ?
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at https://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: How to identify Cache location and delete it ?

Reply via email to