Thank you, Travis. This information certainly points me in the right direction and clears up the confusion I had between ElasticSearch and MongoDB.
I'll start with mongo as my primary data store and work from there. I don't need to store full version of the content, but I do need to store hashes - ssdeep hashes to be exact. These hashes will be used to create a baseline of how much change to expect each hour/day. If the change percentage is above that threshold then a notification will be generated. At this point, I'm not sure ElasticSearch is even needed, but i'll wait and see. On Tuesday, January 27, 2015 at 9:40:42 PM UTC-5, Travis Leleu wrote: > > Are you planning on just storing the diffs, or full versions? If you're > just storing the diffs, I'd use something flexible and queryable. I like > JSON, but flat files are sometimes hard to get the section you need. > Therefore, I use mongo -- schemaless, easy to setup. I think it's a great > data storage layer despite its other flaws. > > Elastic isn't great as a primary data store. I usually couple it (via > streams or other connectors) with a primary store (usually MySQL, sometimes > Mongo), and set up a "river" (i think that's what Elastic calls it) from > the primary to ES. I query structured records on the primary, and search > on the ES instance. > > If all you're trying to do is detect if a page has changed (rather than > computing the diff), and space is at a premium, you could just hash the > HTML (or parts of the html -- I recommend identifying the areas you what > want to follow changes on, and hashing that). > > Finally, if you are trying to point out the sections where the page > changed, I'd use a prebuilt python diff library rather than rolling your > own. I don't have any advice on which one to use. > > On Tue, Jan 27, 2015 at 4:59 PM, JS <[email protected] <javascript:>> > wrote: > >> Hi, >> >> I would like to crawl a particular set of websites every hour to detect >> content changes, but i'm not sure what storage method would be best for my >> use case. I could potentially store crawl results in json or csv files, >> use mongodb, or some other solution like elasticsearch (if it supports >> historical records). But I'm not sure which pathway is the best option. >> Is anyone currently storing and keeping a historical record of crawled >> content? If so, what strategy are you using? >> >> -- >> You received this message because you are subscribed to the Google Groups >> "scrapy-users" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To post to this group, send email to [email protected] >> <javascript:>. >> Visit this group at http://groups.google.com/group/scrapy-users. >> For more options, visit https://groups.google.com/d/optout. >> > > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
