A bloom filter is a memory efficient data structure commonly used to track URLs. Is this exact situation
Did you try w 300k URLs? Even if they were 1kb you're only at 300mb of memory. On Mar 3, 2015 1:14 AM, "Italo Maia" <[email protected]> wrote: > Hello Morad, thanks for your answer. > > My deny list would a little to big to handle if I did that. Something > around 300.000 records to add. Memory would probably go down on it's knees > too. > Looking at this group's history, there are some suggestions regarding the > duplicate filter. I'll try that first. Maybe preloading fingerprints from > the database. > > Em quinta-feira, 26 de fevereiro de 2015 16:27:45 UTC-3, Italo Maia > escreveu: >> >> I have a few spiders here that scrape quite a lot of links. I now that >> scrapy uses by default a "fingerprint" approach to avoid visiting the same >> URL more than once. Is there a way for me to supply a previously harvest >> list of fingerprints/urls to it in order to speed up scraping? >> > -- > You received this message because you are subscribed to the Google Groups > "scrapy-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
