Re: Handling increasingly-intensive processes

2014-12-16 Thread Sam Raker
Now that someone's said it, just store tweets seems like such a duh 
move.

Thanks!
-sam

On Monday, December 15, 2014 6:35:13 AM UTC-5, Thomas Heller wrote:

 Hey,

 without knowing much about your application/business needs its hard to 
 speculate what might be good for you. The root of your problem might be 
 CouchDB since it was never meant for Big Data and since we are talking 
 tweets I generally think a lot. I'm not sure how your map value looks but 
 I think you do something like

 obj = (couch/get hash-tag)
 obj = (my-app/update obj new-tweet)
 (couch/put hash-tag obj)

 Which will always perform badly since you cannot do this concurrently, 
 except with CRDTs which CouchDB doesn't support since it does its own 
 MVCC.  Don't remember exaclty how their conflict resolution works but I 
 think it was last write wins. Caching will not save you for long, since 
 writes will eventually become the bottleneck.

 Why do you not use a CouchDB view to create the hash-tag map on the server 
 and then just append-only the tweets? The views map function can then just 
 emit each tweet under the hash-tag key (once for each tag) and the reduce 
 function can build your map. That should perform alot better up to a 
 certain point and you can control how up-to-date your view index has to be.

 Anyways, might be best to choose another Database. Regardless of what 
 database you are using, updating a single place concurrently is going to be 
 a problem. An Atom in Clojure makes this look like a no-brainer but under 
 high load it can still blow up since it has no back-pressure in any way.

 Bit Data and Distributed Systems are hard and cannot be described in 
 short. Without exact knowledge of what your app/business needs look like it 
 is impossible to make the correct recommendation.

 HTH,
 /thomas

 On Monday, December 15, 2014 4:54:04 AM UTC+1, Sam Raker wrote:

 I'm (still) pulling tweets from twitter, processing them, and storing 
 them in CouchDB with hashtags as doc ids, such that if a tweet contains 3 
 hashtags, that tweet will be indexed under each of those 3 hashtags. My 
 application hits CouchDB for the relevant document and uses Cheshire to 
 convert the resulting string to a map. The map's values consist of a few 
 string values and an array that consists of all the tweets that contain 
 that hashtag. The problem is thus with common hashtags: the more tweets 
 contain a given hashtag, the long that hashtag's tweets array will be, 
 and, additionally, the more often that document will be retrieved from 
 CouchDB. The likelihood and magnitude of performance hits on my app are 
 therefore correlated, which is Bad.

 I'm reaching out to you all for suggestions about how best to deal with 
 this situation. Some way of caching something, somehow? I'm at a loss, but 
 I want to believe there's a solution.


 Thanks,
 -sam



-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Handling increasingly-intensive processes

2014-12-15 Thread Thomas Heller
Hey,

without knowing much about your application/business needs its hard to 
speculate what might be good for you. The root of your problem might be 
CouchDB since it was never meant for Big Data and since we are talking 
tweets I generally think a lot. I'm not sure how your map value looks but 
I think you do something like

obj = (couch/get hash-tag)
obj = (my-app/update obj new-tweet)
(couch/put hash-tag obj)

Which will always perform badly since you cannot do this concurrently, 
except with CRDTs which CouchDB doesn't support since it does its own 
MVCC.  Don't remember exaclty how their conflict resolution works but I 
think it was last write wins. Caching will not save you for long, since 
writes will eventually become the bottleneck.

Why do you not use a CouchDB view to create the hash-tag map on the server 
and then just append-only the tweets? The views map function can then just 
emit each tweet under the hash-tag key (once for each tag) and the reduce 
function can build your map. That should perform alot better up to a 
certain point and you can control how up-to-date your view index has to be.

Anyways, might be best to choose another Database. Regardless of what 
database you are using, updating a single place concurrently is going to be 
a problem. An Atom in Clojure makes this look like a no-brainer but under 
high load it can still blow up since it has no back-pressure in any way.

Bit Data and Distributed Systems are hard and cannot be described in 
short. Without exact knowledge of what your app/business needs look like it 
is impossible to make the correct recommendation.

HTH,
/thomas

On Monday, December 15, 2014 4:54:04 AM UTC+1, Sam Raker wrote:

 I'm (still) pulling tweets from twitter, processing them, and storing them 
 in CouchDB with hashtags as doc ids, such that if a tweet contains 3 
 hashtags, that tweet will be indexed under each of those 3 hashtags. My 
 application hits CouchDB for the relevant document and uses Cheshire to 
 convert the resulting string to a map. The map's values consist of a few 
 string values and an array that consists of all the tweets that contain 
 that hashtag. The problem is thus with common hashtags: the more tweets 
 contain a given hashtag, the long that hashtag's tweets array will be, 
 and, additionally, the more often that document will be retrieved from 
 CouchDB. The likelihood and magnitude of performance hits on my app are 
 therefore correlated, which is Bad.

 I'm reaching out to you all for suggestions about how best to deal with 
 this situation. Some way of caching something, somehow? I'm at a loss, but 
 I want to believe there's a solution.


 Thanks,
 -sam


-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Handling increasingly-intensive processes

2014-12-14 Thread Sam Raker
I'm (still) pulling tweets from twitter, processing them, and storing them 
in CouchDB with hashtags as doc ids, such that if a tweet contains 3 
hashtags, that tweet will be indexed under each of those 3 hashtags. My 
application hits CouchDB for the relevant document and uses Cheshire to 
convert the resulting string to a map. The map's values consist of a few 
string values and an array that consists of all the tweets that contain 
that hashtag. The problem is thus with common hashtags: the more tweets 
contain a given hashtag, the long that hashtag's tweets array will be, 
and, additionally, the more often that document will be retrieved from 
CouchDB. The likelihood and magnitude of performance hits on my app are 
therefore correlated, which is Bad.

I'm reaching out to you all for suggestions about how best to deal with 
this situation. Some way of caching something, somehow? I'm at a loss, but 
I want to believe there's a solution.


Thanks,
-sam

-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Handling increasingly-intensive processes

2014-12-14 Thread Ashton Kemerling
Apologies on the email flood, my email client decided to do the most 
useless of all possible actions.


On Sun, Dec 14, 2014 at 11:53 PM, Ashton Kemerling 
ashtonkemerl...@gmail.com wrote:
Honestly, it sounds like you'll either need to move the indexing into 
memor


On Sun, Dec 14, 2014 at 8:54 PM, Sam Raker sam.ra...@gmail.com 
wrote:
I'm (still) pulling tweets from twitter, processing them, and 
storing them in CouchDB with hashtags as doc ids, such that if a 
tweet contains 3 hashtags, that tweet will be indexed under each of 
those 3 hashtags. My application hits CouchDB for the relevant 
document and uses Cheshire to convert the resulting string to a map. 
The map's values consist of a few string values and an array that 
consists of all the tweets that contain that hashtag. The problem is 
thus with common hashtags: the more tweets contain a given hashtag, 
the long that hashtag's tweets array will be, and, additionally, 
the more often that document will be retrieved from CouchDB. The 
likelihood and magnitude of performance hits on my app are therefore 
correlated, which is Bad.


I'm reaching out to you all for suggestions about how best to deal 
with this situation. Some way of caching something, somehow? I'm at 
a loss, but I want to believe there's a solution.



Thanks,
-sam
--
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient 
with your first post.

To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google 
Groups Clojure group.
To unsubscribe from this group and stop receiving emails from it, 
send an email to clojure+unsubscr...@googlegroups.com.

For more options, visit https://groups.google.com/d/optout.


--
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups Clojure group.

To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.


Re: Handling increasingly-intensive processes

2014-12-14 Thread Ashton Kemerling
Sam,

It sounds like you need to either find a caching strategy that works for 
your application's needs, or you'll need to adjust how your data is stored 
(model or data store). Without knowing more about your performance and 
business needs I can't really speculate with any confidence.

--
Ashton

On Sunday, December 14, 2014 8:54:04 PM UTC-7, Sam Raker wrote:

 I'm (still) pulling tweets from twitter, processing them, and storing them 
 in CouchDB with hashtags as doc ids, such that if a tweet contains 3 
 hashtags, that tweet will be indexed under each of those 3 hashtags. My 
 application hits CouchDB for the relevant document and uses Cheshire to 
 convert the resulting string to a map. The map's values consist of a few 
 string values and an array that consists of all the tweets that contain 
 that hashtag. The problem is thus with common hashtags: the more tweets 
 contain a given hashtag, the long that hashtag's tweets array will be, 
 and, additionally, the more often that document will be retrieved from 
 CouchDB. The likelihood and magnitude of performance hits on my app are 
 therefore correlated, which is Bad.

 I'm reaching out to you all for suggestions about how best to deal with 
 this situation. Some way of caching something, somehow? I'm at a loss, but 
 I want to believe there's a solution.


 Thanks,
 -sam


-- 
You received this message because you are subscribed to the Google
Groups Clojure group.
To post to this group, send email to clojure@googlegroups.com
Note that posts from new members are moderated - please be patient with your 
first post.
To unsubscribe from this group, send email to
clojure+unsubscr...@googlegroups.com
For more options, visit this group at
http://groups.google.com/group/clojure?hl=en
--- 
You received this message because you are subscribed to the Google Groups 
Clojure group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to clojure+unsubscr...@googlegroups.com.
For more options, visit https://groups.google.com/d/optout.