Re: best storage method for change detection

Travis Leleu Wed, 28 Jan 2015 16:40:03 -0800

No problem.  I've tried a number of different solutions, but what seems to
work best for me is to use the scrapy mongodb pipeline to just store all
the extracted data in a queryable format.  I then write adapters after the
scrape to wrangle it into the format i need.  I believe a reason it works
is because I can get a feel for the data as it's scraping with Mongo, which
informs design decisions.  Otherwise, I find myself putting effort up front
to define the data schema, then need to change it.


YMMV, but that's my preferred method.


On Wed, Jan 28, 2015 at 4:31 PM, JS <[email protected]> wrote:

> Thank you, Travis.  This information certainly points me in the right
> direction and clears up the confusion I had between ElasticSearch and
> MongoDB.
>
> I'll start with mongo as my primary data store and work from there.  I
> don't need to store full version of the content, but I do need to store
> hashes - ssdeep hashes to be exact.  These hashes will be used to create a
> baseline of how much change to expect each hour/day.  If the change
> percentage is above that threshold then a notification will be generated.
>
> At this point, I'm not sure ElasticSearch is even needed, but i'll wait
> and see.
>
> On Tuesday, January 27, 2015 at 9:40:42 PM UTC-5, Travis Leleu wrote:
>>
>> Are you planning on just storing the diffs, or full versions?  If you're
>> just storing the diffs, I'd use something flexible and queryable.  I like
>> JSON, but flat files are sometimes hard to get the section you need.
>> Therefore, I use mongo -- schemaless, easy to setup.  I think it's a great
>> data storage layer despite its other flaws.
>>
>> Elastic isn't great as a primary data store.  I usually couple it (via
>> streams or other connectors) with a primary store (usually MySQL, sometimes
>> Mongo), and set up a "river" (i think that's what Elastic calls it) from
>> the primary to ES.  I query structured records on the primary, and search
>> on the ES instance.
>>
>> If all you're trying to do is detect if a page has changed (rather than
>> computing the diff), and space is at a premium, you could just hash the
>> HTML (or parts of the html -- I recommend identifying the areas you what
>> want to follow changes on, and hashing that).
>>
>> Finally, if you are trying to point out the sections where the page
>> changed, I'd use a prebuilt python diff library rather than rolling your
>> own.  I don't have any advice on which one to use.
>>
>> On Tue, Jan 27, 2015 at 4:59 PM, JS <[email protected]> wrote:
>>
>>> Hi,
>>>
>>> I would like to crawl a particular set of websites every hour to detect
>>> content changes, but i'm not sure what storage method would be best for my
>>> use case.  I could potentially store crawl results in json or csv files,
>>> use mongodb, or some other solution like elasticsearch (if it supports
>>> historical records).  But I'm not sure which pathway is the best option.
>>> Is anyone currently storing and keeping a historical record of crawled
>>> content?  If so, what strategy are you using?
>>>
>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "scrapy-users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to [email protected].
>>> To post to this group, send email to [email protected].
>>> Visit this group at http://groups.google.com/group/scrapy-users.
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
> You received this message because you are subscribed to the Google Groups
> "scrapy-users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To post to this group, send email to [email protected].
> Visit this group at http://groups.google.com/group/scrapy-users.
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"scrapy-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To post to this group, send email to [email protected].
Visit this group at http://groups.google.com/group/scrapy-users.
For more options, visit https://groups.google.com/d/optout.

Re: best storage method for change detection

Reply via email to