Re: How to supply spider with a list of paths to ignore?

Travis Leleu Tue, 03 Mar 2015 08:10:56 -0800

A bloom filter is a memory efficient data structure commonly used to track
URLs. Is this exact situation


Did you try w 300k URLs? Even if they were 1kb you're only at 300mb of
memory.
On Mar 3, 2015 1:14 AM, "Italo Maia" <[email protected]> wrote:

> Hello Morad, thanks for your answer.
>
> My deny list would a little to big to handle if I did that. Something
> around 300.000 records to add. Memory would probably go down on it's knees
> too.
> Looking at this group's history, there are some suggestions regarding the
> duplicate filter. I'll try that first. Maybe preloading fingerprints from
> the database.
>
> Em quinta-feira, 26 de fevereiro de 2015 16:27:45 UTC-3, Italo Maia
> escreveu:
>>
>> I have a few spiders here that scrape quite a lot of links. I now that
>> scrapy uses by default a "fingerprint" approach to avoid visiting the same
>> URL more than once. Is there a way for me to supply a previously harvest
>> list of fingerprints/urls to it in order to speed up scraping?
>>
>  --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: How to supply spider with a list of paths to ignore?

Reply via email to