Someone please correct me if they feel otherwise, but I don't really think that's scrapys strength.
I think of it as a great framework for the spidering and data extraction. I usually do any post processing (like dupe id) in a separate script. That way if you improve your dupe detection algo, it's not tied with your data acquisition. I could see a situation where you'd want to limit spidering based on dupe content. Is that what you want to do? Or is it more of a content survey? > On Nov 9, 2015, at 12:38 PM, Jim Priest <pri...@thecrumb.com> wrote: > > I'm just getting started with Scrapy and hacking up some proof of concepts > for a bigger project... > > So far I have a basic spider going and saving my data to a MySQL db. So far > so good! :) > > So now I'm trying to figure out the following... > > Say I have two domains companyA / companyB and each site has the same page - > with possibly the same or different content on each page. > > companyA.com/about-us.html > companyB.com/about-us.html > > How would you go about spidering both and comparing pages? > > I don't really need to know WHAT is different - just that either the pages > are the same or not. > > Right now while I'm spidering companyA I'm storing a hash or the page in my > db - but I'm not sure where in the process I could check companyB? > > Do I do that while spidering? Do I run two spiders over each site and then > compare afterwards? > > Thanks for the help! > Jim > > > > -- > You received this message because you are subscribed to the Google Groups > "scrapy-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to scrapy-users+unsubscr...@googlegroups.com. > To post to this group, send email to scrapy-users@googlegroups.com. > Visit this group at http://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to scrapy-users+unsubscr...@googlegroups.com. To post to this group, send email to scrapy-users@googlegroups.com. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.