Thank you, Travis.  This information certainly points me in the right 
direction and clears up the confusion I had between ElasticSearch and 
MongoDB.

I'll start with mongo as my primary data store and work from there.  I 
don't need to store full version of the content, but I do need to store 
hashes - ssdeep hashes to be exact.  These hashes will be used to create a 
baseline of how much change to expect each hour/day.  If the change 
percentage is above that threshold then a notification will be generated.

At this point, I'm not sure ElasticSearch is even needed, but i'll wait and 
see.

On Tuesday, January 27, 2015 at 9:40:42 PM UTC-5, Travis Leleu wrote:
>
> Are you planning on just storing the diffs, or full versions?  If you're 
> just storing the diffs, I'd use something flexible and queryable.  I like 
> JSON, but flat files are sometimes hard to get the section you need.  
> Therefore, I use mongo -- schemaless, easy to setup.  I think it's a great 
> data storage layer despite its other flaws.
>
> Elastic isn't great as a primary data store.  I usually couple it (via 
> streams or other connectors) with a primary store (usually MySQL, sometimes 
> Mongo), and set up a "river" (i think that's what Elastic calls it) from 
> the primary to ES.  I query structured records on the primary, and search 
> on the ES instance.
>
> If all you're trying to do is detect if a page has changed (rather than 
> computing the diff), and space is at a premium, you could just hash the 
> HTML (or parts of the html -- I recommend identifying the areas you what 
> want to follow changes on, and hashing that).
>
> Finally, if you are trying to point out the sections where the page 
> changed, I'd use a prebuilt python diff library rather than rolling your 
> own.  I don't have any advice on which one to use.
>
> On Tue, Jan 27, 2015 at 4:59 PM, JS <[email protected] <javascript:>> 
> wrote:
>
>> Hi,
>>
>> I would like to crawl a particular set of websites every hour to detect 
>> content changes, but i'm not sure what storage method would be best for my 
>> use case.  I could potentially store crawl results in json or csv files, 
>> use mongodb, or some other solution like elasticsearch (if it supports 
>> historical records).  But I'm not sure which pathway is the best option.  
>> Is anyone currently storing and keeping a historical record of crawled 
>> content?  If so, what strategy are you using?
>>
>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "scrapy-users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to [email protected] <javascript:>.
>> To post to this group, send email to [email protected] 
>> <javascript:>.
>> Visit this group at http://groups.google.com/group/scrapy-users.
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Reply via email to