No problem. I've tried a number of different solutions, but what seems to work best for me is to use the scrapy mongodb pipeline to just store all the extracted data in a queryable format. I then write adapters after the scrape to wrangle it into the format i need. I believe a reason it works is because I can get a feel for the data as it's scraping with Mongo, which informs design decisions. Otherwise, I find myself putting effort up front to define the data schema, then need to change it.
YMMV, but that's my preferred method. On Wed, Jan 28, 2015 at 4:31 PM, JS <[email protected]> wrote: > Thank you, Travis. This information certainly points me in the right > direction and clears up the confusion I had between ElasticSearch and > MongoDB. > > I'll start with mongo as my primary data store and work from there. I > don't need to store full version of the content, but I do need to store > hashes - ssdeep hashes to be exact. These hashes will be used to create a > baseline of how much change to expect each hour/day. If the change > percentage is above that threshold then a notification will be generated. > > At this point, I'm not sure ElasticSearch is even needed, but i'll wait > and see. > > On Tuesday, January 27, 2015 at 9:40:42 PM UTC-5, Travis Leleu wrote: >> >> Are you planning on just storing the diffs, or full versions? If you're >> just storing the diffs, I'd use something flexible and queryable. I like >> JSON, but flat files are sometimes hard to get the section you need. >> Therefore, I use mongo -- schemaless, easy to setup. I think it's a great >> data storage layer despite its other flaws. >> >> Elastic isn't great as a primary data store. I usually couple it (via >> streams or other connectors) with a primary store (usually MySQL, sometimes >> Mongo), and set up a "river" (i think that's what Elastic calls it) from >> the primary to ES. I query structured records on the primary, and search >> on the ES instance. >> >> If all you're trying to do is detect if a page has changed (rather than >> computing the diff), and space is at a premium, you could just hash the >> HTML (or parts of the html -- I recommend identifying the areas you what >> want to follow changes on, and hashing that). >> >> Finally, if you are trying to point out the sections where the page >> changed, I'd use a prebuilt python diff library rather than rolling your >> own. I don't have any advice on which one to use. >> >> On Tue, Jan 27, 2015 at 4:59 PM, JS <[email protected]> wrote: >> >>> Hi, >>> >>> I would like to crawl a particular set of websites every hour to detect >>> content changes, but i'm not sure what storage method would be best for my >>> use case. I could potentially store crawl results in json or csv files, >>> use mongodb, or some other solution like elasticsearch (if it supports >>> historical records). But I'm not sure which pathway is the best option. >>> Is anyone currently storing and keeping a historical record of crawled >>> content? If so, what strategy are you using? >>> >>> -- >>> You received this message because you are subscribed to the Google >>> Groups "scrapy-users" group. >>> To unsubscribe from this group and stop receiving emails from it, send >>> an email to [email protected]. >>> To post to this group, send email to [email protected]. >>> Visit this group at http://groups.google.com/group/scrapy-users. >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> -- > You received this message because you are subscribed to the Google Groups > "scrapy-users" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To post to this group, send email to [email protected]. > Visit this group at http://groups.google.com/group/scrapy-users. > For more options, visit https://groups.google.com/d/optout. > -- You received this message because you are subscribed to the Google Groups "scrapy-users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To post to this group, send email to [email protected]. Visit this group at http://groups.google.com/group/scrapy-users. For more options, visit https://groups.google.com/d/optout.
