Someone please correct me if they feel otherwise, but I don't really think 
that's scrapys strength. 

I think of it as a great framework for the spidering and data extraction. I 
usually do any post processing (like dupe id) in a separate script. That way if 
you improve your dupe detection algo, it's not tied with your data acquisition. 

I could see a situation where you'd want to limit spidering based on dupe 
content. Is that what you want to do? Or is it more of a content survey?

> On Nov 9, 2015, at 12:38 PM, Jim Priest <pri...@thecrumb.com> wrote:
> 
> I'm just getting started with Scrapy and hacking up some proof of concepts 
> for a bigger project...
> 
> So far I have a basic spider going and saving my data to a MySQL db. So far 
> so good! :)
> 
> So now I'm trying to figure out the following...
> 
> Say I have two domains companyA / companyB and each site has the same page - 
> with possibly the same or different content on each page.
> 
> companyA.com/about-us.html
> companyB.com/about-us.html
> 
> How would you go about spidering both and comparing pages?
> 
> I don't really need to know WHAT is different - just that either the pages 
> are the same or not. 
> 
> Right now while I'm spidering companyA I'm storing a hash or the page in my 
> db - but I'm not sure where in the process I could check companyB?
> 
> Do I do that while spidering? Do I run two spiders over each site and then 
> compare afterwards?
> 
> Thanks for the help!
> Jim
> 
> 
> 
> -- 
> You received this message because you are subscribed to the Google Groups 
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an 
> email to scrapy-users+unsubscr...@googlegroups.com.
> To post to this group, send email to scrapy-users@googlegroups.com.
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to scrapy-users+unsubscr...@googlegroups.com.
To post to this group, send email to scrapy-users@googlegroups.com.
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to