Re: Handling increasingly-intensive processes
Now that someone's said it, just store tweets seems like such a duh move. Thanks! -sam On Monday, December 15, 2014 6:35:13 AM UTC-5, Thomas Heller wrote: Hey, without knowing much about your application/business needs its hard to speculate what might be good for you. The root of your problem might be CouchDB since it was never meant for Big Data and since we are talking tweets I generally think a lot. I'm not sure how your map value looks but I think you do something like obj = (couch/get hash-tag) obj = (my-app/update obj new-tweet) (couch/put hash-tag obj) Which will always perform badly since you cannot do this concurrently, except with CRDTs which CouchDB doesn't support since it does its own MVCC. Don't remember exaclty how their conflict resolution works but I think it was last write wins. Caching will not save you for long, since writes will eventually become the bottleneck. Why do you not use a CouchDB view to create the hash-tag map on the server and then just append-only the tweets? The views map function can then just emit each tweet under the hash-tag key (once for each tag) and the reduce function can build your map. That should perform alot better up to a certain point and you can control how up-to-date your view index has to be. Anyways, might be best to choose another Database. Regardless of what database you are using, updating a single place concurrently is going to be a problem. An Atom in Clojure makes this look like a no-brainer but under high load it can still blow up since it has no back-pressure in any way. Bit Data and Distributed Systems are hard and cannot be described in short. Without exact knowledge of what your app/business needs look like it is impossible to make the correct recommendation. HTH, /thomas On Monday, December 15, 2014 4:54:04 AM UTC+1, Sam Raker wrote: I'm (still) pulling tweets from twitter, processing them, and storing them in CouchDB with hashtags as doc ids, such that if a tweet contains 3 hashtags, that tweet will be indexed under each of those 3 hashtags. My application hits CouchDB for the relevant document and uses Cheshire to convert the resulting string to a map. The map's values consist of a few string values and an array that consists of all the tweets that contain that hashtag. The problem is thus with common hashtags: the more tweets contain a given hashtag, the long that hashtag's tweets array will be, and, additionally, the more often that document will be retrieved from CouchDB. The likelihood and magnitude of performance hits on my app are therefore correlated, which is Bad. I'm reaching out to you all for suggestions about how best to deal with this situation. Some way of caching something, somehow? I'm at a loss, but I want to believe there's a solution. Thanks, -sam -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Handling increasingly-intensive processes
Hey, without knowing much about your application/business needs its hard to speculate what might be good for you. The root of your problem might be CouchDB since it was never meant for Big Data and since we are talking tweets I generally think a lot. I'm not sure how your map value looks but I think you do something like obj = (couch/get hash-tag) obj = (my-app/update obj new-tweet) (couch/put hash-tag obj) Which will always perform badly since you cannot do this concurrently, except with CRDTs which CouchDB doesn't support since it does its own MVCC. Don't remember exaclty how their conflict resolution works but I think it was last write wins. Caching will not save you for long, since writes will eventually become the bottleneck. Why do you not use a CouchDB view to create the hash-tag map on the server and then just append-only the tweets? The views map function can then just emit each tweet under the hash-tag key (once for each tag) and the reduce function can build your map. That should perform alot better up to a certain point and you can control how up-to-date your view index has to be. Anyways, might be best to choose another Database. Regardless of what database you are using, updating a single place concurrently is going to be a problem. An Atom in Clojure makes this look like a no-brainer but under high load it can still blow up since it has no back-pressure in any way. Bit Data and Distributed Systems are hard and cannot be described in short. Without exact knowledge of what your app/business needs look like it is impossible to make the correct recommendation. HTH, /thomas On Monday, December 15, 2014 4:54:04 AM UTC+1, Sam Raker wrote: I'm (still) pulling tweets from twitter, processing them, and storing them in CouchDB with hashtags as doc ids, such that if a tweet contains 3 hashtags, that tweet will be indexed under each of those 3 hashtags. My application hits CouchDB for the relevant document and uses Cheshire to convert the resulting string to a map. The map's values consist of a few string values and an array that consists of all the tweets that contain that hashtag. The problem is thus with common hashtags: the more tweets contain a given hashtag, the long that hashtag's tweets array will be, and, additionally, the more often that document will be retrieved from CouchDB. The likelihood and magnitude of performance hits on my app are therefore correlated, which is Bad. I'm reaching out to you all for suggestions about how best to deal with this situation. Some way of caching something, somehow? I'm at a loss, but I want to believe there's a solution. Thanks, -sam -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Handling increasingly-intensive processes
I'm (still) pulling tweets from twitter, processing them, and storing them in CouchDB with hashtags as doc ids, such that if a tweet contains 3 hashtags, that tweet will be indexed under each of those 3 hashtags. My application hits CouchDB for the relevant document and uses Cheshire to convert the resulting string to a map. The map's values consist of a few string values and an array that consists of all the tweets that contain that hashtag. The problem is thus with common hashtags: the more tweets contain a given hashtag, the long that hashtag's tweets array will be, and, additionally, the more often that document will be retrieved from CouchDB. The likelihood and magnitude of performance hits on my app are therefore correlated, which is Bad. I'm reaching out to you all for suggestions about how best to deal with this situation. Some way of caching something, somehow? I'm at a loss, but I want to believe there's a solution. Thanks, -sam -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Handling increasingly-intensive processes
Apologies on the email flood, my email client decided to do the most useless of all possible actions. On Sun, Dec 14, 2014 at 11:53 PM, Ashton Kemerling ashtonkemerl...@gmail.com wrote: Honestly, it sounds like you'll either need to move the indexing into memor On Sun, Dec 14, 2014 at 8:54 PM, Sam Raker sam.ra...@gmail.com wrote: I'm (still) pulling tweets from twitter, processing them, and storing them in CouchDB with hashtags as doc ids, such that if a tweet contains 3 hashtags, that tweet will be indexed under each of those 3 hashtags. My application hits CouchDB for the relevant document and uses Cheshire to convert the resulting string to a map. The map's values consist of a few string values and an array that consists of all the tweets that contain that hashtag. The problem is thus with common hashtags: the more tweets contain a given hashtag, the long that hashtag's tweets array will be, and, additionally, the more often that document will be retrieved from CouchDB. The likelihood and magnitude of performance hits on my app are therefore correlated, which is Bad. I'm reaching out to you all for suggestions about how best to deal with this situation. Some way of caching something, somehow? I'm at a loss, but I want to believe there's a solution. Thanks, -sam -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.
Re: Handling increasingly-intensive processes
Sam, It sounds like you need to either find a caching strategy that works for your application's needs, or you'll need to adjust how your data is stored (model or data store). Without knowing more about your performance and business needs I can't really speculate with any confidence. -- Ashton On Sunday, December 14, 2014 8:54:04 PM UTC-7, Sam Raker wrote: I'm (still) pulling tweets from twitter, processing them, and storing them in CouchDB with hashtags as doc ids, such that if a tweet contains 3 hashtags, that tweet will be indexed under each of those 3 hashtags. My application hits CouchDB for the relevant document and uses Cheshire to convert the resulting string to a map. The map's values consist of a few string values and an array that consists of all the tweets that contain that hashtag. The problem is thus with common hashtags: the more tweets contain a given hashtag, the long that hashtag's tweets array will be, and, additionally, the more often that document will be retrieved from CouchDB. The likelihood and magnitude of performance hits on my app are therefore correlated, which is Bad. I'm reaching out to you all for suggestions about how best to deal with this situation. Some way of caching something, somehow? I'm at a loss, but I want to believe there's a solution. Thanks, -sam -- You received this message because you are subscribed to the Google Groups Clojure group. To post to this group, send email to clojure@googlegroups.com Note that posts from new members are moderated - please be patient with your first post. To unsubscribe from this group, send email to clojure+unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/clojure?hl=en --- You received this message because you are subscribed to the Google Groups Clojure group. To unsubscribe from this group and stop receiving emails from it, send an email to clojure+unsubscr...@googlegroups.com. For more options, visit https://groups.google.com/d/optout.