Re: abusing cassandra's multi DC abilities

2014-02-24 Thread Jared Biel
Have you heard of this https://github.com/Comcast/cmb? Maybe it's along the
path of what you're looking for.


On 22 February 2014 22:33, Jonathan Haddad j...@jonhaddad.com wrote:

 Upfront TLDR: We want to do stuff (reindex documents, bust cache) when
 changed data from DC1 shows up in DC2.

 Full Story:
 We're planning on adding data centers throughout the US.  Our platform is
 used for business communications.  Each DC currently utilizes elastic
 search and redis.  A message can be sent from one user to another, and the
 intent is that it would be seen in near-real-time.  This means that 2
 people may be using different data centers, and the messages need to
 propagate from one to the other.

 On the plus side, we know we get this with Cassandra (fist pump) but the
 other pieces, not so much.  Even if they did work, there's all sorts of
 race conditions that could pop up from having different pieces of our
 architecture communicating over different channels.  From this, we've
 arrived at the idea that since Cassandra is the authoritative data source,
 we might be able to trigger events in DC2 based on activity coming through
 either the commit log or some other means.  One idea was to use a CF with a
 low gc time as a means of transporting messages between DCs, and watching
 the commit logs for deletes to that CF in order to know when we need to do
 things like reindex a document (or a new document), bust cache, etc.
  Facebook did something similar with their modifications to MySQL to
 include cache keys in the replication log.

 Assuming this is sane, I'd want to avoid having the same event register on
 3 servers, thus registering 3 items in the queue when only one should be
 there.  So, for any piece of data replicated from the other DC, I'd need a
 way to determine if it was supposed to actually trigger the event or not.
  (Maybe it looks at the token and determines if the current server falls in
 the token range?)  Or is there a better way?

 So, my questions to all ye Cassandra users:

 1. Is this is even sane?
 2. Is anyone doing it?

 --
 Jon Haddad
 http://www.rustyrazorblade.com
 skype: rustyrazorblade



Re: abusing cassandra's multi DC abilities

2014-02-24 Thread Todd Fast
Hi Jonathan--

First, best wishes for success with your platform.

Frankly, I think the architecture you described is only going to cause
you major trouble. I'm left wondering why you don't either use something
like XMPP (of which several implementations can handle this kind of
federated scenario) or simply have internal (REST) APIs to send a message
from the backend in one DC to the backend in another DC.

There are a bunch of ways to approach this problem: You could also use
Redis pubsub (though a bit brittle), SQS, or any number of other approaches
that would be simpler and more robust than what you described. I'd urge you
to really consider another approach.

Best,
Todd

On Saturday, February 22, 2014, Jonathan Haddad j...@jonhaddad.com wrote:

 Upfront TLDR: We want to do stuff (reindex documents, bust cache) when
 changed data from DC1 shows up in DC2.

 Full Story:
 We're planning on adding data centers throughout the US.  Our platform is
 used for business communications.  Each DC currently utilizes elastic
 search and redis.  A message can be sent from one user to another, and the
 intent is that it would be seen in near-real-time.  This means that 2
 people may be using different data centers, and the messages need to
 propagate from one to the other.

 On the plus side, we know we get this with Cassandra (fist pump) but the
 other pieces, not so much.  Even if they did work, there's all sorts of
 race conditions that could pop up from having different pieces of our
 architecture communicating over different channels.  From this, we've
 arrived at the idea that since Cassandra is the authoritative data source,
 we might be able to trigger events in DC2 based on activity coming through
 either the commit log or some other means.  One idea was to use a CF with a
 low gc time as a means of transporting messages between DCs, and watching
 the commit logs for deletes to that CF in order to know when we need to do
 things like reindex a document (or a new document), bust cache, etc.
  Facebook did something similar with their modifications to MySQL to
 include cache keys in the replication log.

 Assuming this is sane, I'd want to avoid having the same event register on
 3 servers, thus registering 3 items in the queue when only one should be
 there.  So, for any piece of data replicated from the other DC, I'd need a
 way to determine if it was supposed to actually trigger the event or not.
  (Maybe it looks at the token and determines if the current server falls in
 the token range?)  Or is there a better way?

 So, my questions to all ye Cassandra users:

 1. Is this is even sane?
 2. Is anyone doing it?

 --
 Jon Haddad
 http://www.rustyrazorblade.com
 skype: rustyrazorblade



Re: abusing cassandra's multi DC abilities

2014-02-24 Thread Jonathan Haddad
Thanks for the input Todd.  I've considered a few of the options you've
listed.  I've ruled out redis because it's not really built for multi DC.
 I've got nothing against XMPP, or SQS.  However, they introduce race
conditions as well as all sorts of edge cases (missed messages, for
instance).  Since Cassandra is the source of truth, why not piggyback a
useful message within the true source of data itself?




On Mon, Feb 24, 2014 at 8:49 PM, Todd Fast t...@digitalexistence.comwrote:

 Hi Jonathan--

 First, best wishes for success with your platform.

 Frankly, I think the architecture you described is only going to cause
 you major trouble. I'm left wondering why you don't either use something
 like XMPP (of which several implementations can handle this kind of
 federated scenario) or simply have internal (REST) APIs to send a message
 from the backend in one DC to the backend in another DC.

 There are a bunch of ways to approach this problem: You could also use
 Redis pubsub (though a bit brittle), SQS, or any number of other approaches
 that would be simpler and more robust than what you described. I'd urge you
 to really consider another approach.

 Best,
 Todd


 On Saturday, February 22, 2014, Jonathan Haddad j...@jonhaddad.com wrote:

 Upfront TLDR: We want to do stuff (reindex documents, bust cache) when
 changed data from DC1 shows up in DC2.

 Full Story:
 We're planning on adding data centers throughout the US.  Our platform is
 used for business communications.  Each DC currently utilizes elastic
 search and redis.  A message can be sent from one user to another, and the
 intent is that it would be seen in near-real-time.  This means that 2
 people may be using different data centers, and the messages need to
 propagate from one to the other.

 On the plus side, we know we get this with Cassandra (fist pump) but the
 other pieces, not so much.  Even if they did work, there's all sorts of
 race conditions that could pop up from having different pieces of our
 architecture communicating over different channels.  From this, we've
 arrived at the idea that since Cassandra is the authoritative data source,
 we might be able to trigger events in DC2 based on activity coming through
 either the commit log or some other means.  One idea was to use a CF with a
 low gc time as a means of transporting messages between DCs, and watching
 the commit logs for deletes to that CF in order to know when we need to do
 things like reindex a document (or a new document), bust cache, etc.
  Facebook did something similar with their modifications to MySQL to
 include cache keys in the replication log.

 Assuming this is sane, I'd want to avoid having the same event register
 on 3 servers, thus registering 3 items in the queue when only one should be
 there.  So, for any piece of data replicated from the other DC, I'd need a
 way to determine if it was supposed to actually trigger the event or not.
  (Maybe it looks at the token and determines if the current server falls in
 the token range?)  Or is there a better way?

 So, my questions to all ye Cassandra users:

 1. Is this is even sane?
 2. Is anyone doing it?

 --
 Jon Haddad
 http://www.rustyrazorblade.com
 skype: rustyrazorblade




-- 
Jon Haddad
http://www.rustyrazorblade.com
skype: rustyrazorblade


abusing cassandra's multi DC abilities

2014-02-22 Thread Jonathan Haddad
Upfront TLDR: We want to do stuff (reindex documents, bust cache) when
changed data from DC1 shows up in DC2.

Full Story:
We're planning on adding data centers throughout the US.  Our platform is
used for business communications.  Each DC currently utilizes elastic
search and redis.  A message can be sent from one user to another, and the
intent is that it would be seen in near-real-time.  This means that 2
people may be using different data centers, and the messages need to
propagate from one to the other.

On the plus side, we know we get this with Cassandra (fist pump) but the
other pieces, not so much.  Even if they did work, there's all sorts of
race conditions that could pop up from having different pieces of our
architecture communicating over different channels.  From this, we've
arrived at the idea that since Cassandra is the authoritative data source,
we might be able to trigger events in DC2 based on activity coming through
either the commit log or some other means.  One idea was to use a CF with a
low gc time as a means of transporting messages between DCs, and watching
the commit logs for deletes to that CF in order to know when we need to do
things like reindex a document (or a new document), bust cache, etc.
 Facebook did something similar with their modifications to MySQL to
include cache keys in the replication log.

Assuming this is sane, I'd want to avoid having the same event register on
3 servers, thus registering 3 items in the queue when only one should be
there.  So, for any piece of data replicated from the other DC, I'd need a
way to determine if it was supposed to actually trigger the event or not.
 (Maybe it looks at the token and determines if the current server falls in
the token range?)  Or is there a better way?

So, my questions to all ye Cassandra users:

1. Is this is even sane?
2. Is anyone doing it?

-- 
Jon Haddad
http://www.rustyrazorblade.com
skype: rustyrazorblade


abusing cassandra's multi DC abilities

2014-02-21 Thread Jonathan Haddad
Upfront TLDR: We want to do stuff (reindex documents, bust cache) when
changed data from DC1 shows up in DC2.

Full Story:
We're planning on adding data centers throughout the US.  Our platform is
used for business communications.  Each DC currently utilizes elastic
search and redis.  A message can be sent from one user to another, and the
intent is that it would be seen in near-real-time.  This means that 2
people may be using different data centers, and the messages need to
propagate from one to the other.

On the plus side, we know we get this with Cassandra (fist pump) but the
other pieces, not so much.  Even if they did work, there's all sorts of
race conditions that could pop up from having different pieces of our
architecture communicating over different channels.  From this, we've
arrived at the idea that since Cassandra is the authoritative data source,
we might be able to trigger events in DC2 based on activity coming through
either the commit log or some other means.  One idea was to use a CF with a
low gc time as a means of transporting messages between DCs, and watching
the commit logs for deletes to that CF in order to know when we need to do
things like reindex a document (or a new document), bust cache, etc.
 Facebook did something similar with their modifications to MySQL to
include cache keys in the replication log.

Assuming this is sane, I'd want to avoid having the same event register on
3 servers, thus registering 3 items in the queue when only one should be
there.  So, for any piece of data replicated from the other DC, I'd need a
way to determine if it was supposed to actually trigger the event or not.
 (Maybe it looks at the token and determines if the current server falls in
the token range?)  Or is there a better way?

So, my questions to all ye Cassandra users:

1. Is this is even sane?
2. Is anyone doing it?


-- 
Jon Haddad
http://www.rustyrazorblade.com
skype: rustyrazorblade