Re: Achieving isolation on single row modifications with batch_mutate

2010-11-30 Thread E S
I'm chunking up a larger blob.  Basically the size of each row can vary 
(averages around 500K - 1MB), with some outliers in the 50 MB range.   However, 
when I do an update, I can usually just read/update a portion of that blob.  A 
lot of my read operations can also work on a smaller chunk.  The number of 
columns is going to depend on the size of the blob itself.  I'm also 
considering 
using supercolumns to have higher save granularity.

My biggest problem is that I will have to update these rows a lot (several 
times 
a day) and often very quickly (process 15 thousand in 2-3 minutes).  While I 
think I could probably scale up with a lot of hardware to meet that load, it 
seems like I'm doing much much more work than I need to (processing 15 GB of 
data in 2-3 minutes as opposes to 100 MB).  I also worry about handling our 
future data size needs.

I can split the blob up without a lot of extra complexity but am worried about 
how to have readers read a non-corrupted version of the object, since sometimes 
I'll have to update multiple chunks as one unit.





From: Tyler Hobbs ty...@riptano.com
To: user@cassandra.apache.org
Sent: Tue, November 30, 2010 12:57:07 AM
Subject: Re: Achieving isolation on single row modifications with batch_mutate

In this case, it sounds like you should combine columns A and B if you
are writing them both at the same time, reading them both at the same
time, and need them to be consistent.

Obviously, you're probably dealing with more than two columns here, but
there's generally not any value in splitting something into multiple columns
if you're always writing and reading all of them at the same time.

Or are you talking about chunking huge blobs across a row?

- Tyler


On Sat, Nov 27, 2010 at 10:12 AM, E S tr1skl...@yahoo.com wrote:

I'm trying to figure out the best way to achieve single row modification
isolation for readers.

As an example, I have 2 rows (1,2) with 2 columns (a,b).  If I modify both 
rows,
I don't care if the user sees the write operations completed on 1 and not on 2
for a short time period (seconds).  I also don't care if when reading row 1 the
user gets the new value, and then on a re-read gets the old value (within a few
seconds).  Because of this, I have been planning on using a consistency level 
of
one.

However, if I modify both columns A,B on a single row, I need both changes on
the row to be visible/invisible atomically.  It doesn't matter if they both
become visible and then both invisible as the data propagates across nodes, but
a half-completed state on an initial read will basically be returning corrupt
data given my apps consistency requirements.  My understanding from the FAQ 
that
this single row multicolumn change provides no read isolation, so I will have
this problem.  Is this correct?  If so:

Question 1:  Is there a way to get this type of isolation without using a
distributed locking mechanism like cages?

Question 2:  Are there any plans to implement this type of isolation within
Cassandra?

Question 3:  If I went with a distributed locking mechanism, what consistency
level would I need to use with Cassandra?  Could I still get away with a
consistency level of one?  It seems that if the initial write is done in a
non-isolated way, but if cross-node row synchronizations are done all or
nothing, I could still use one.

Question 4:  Does anyone know of a good c# alternative to cages/zookeeper?

Thanks for any help with this!








  

Re: Achieving isolation on single row modifications with batch_mutate

2010-11-30 Thread Jonathan Ellis
On Sat, Nov 27, 2010 at 10:12 AM, E S tr1skl...@yahoo.com wrote:
 I'm trying to figure out the best way to achieve single row modification
 isolation for readers.

I have a lot of No's for you. :)

 As an example, I have 2 rows (1,2) with 2 columns (a,b).  If I modify both 
 rows,
 I don't care if the user sees the write operations completed on 1 and not on 2
 for a short time period (seconds).  I also don't care if when reading row 1 
 the
 user gets the new value, and then on a re-read gets the old value (within a 
 few
 seconds).  Because of this, I have been planning on using a consistency level 
 of
 one.

 However, if I modify both columns A,B on a single row, I need both changes on
 the row to be visible/invisible atomically.  It doesn't matter if they both
 become visible and then both invisible as the data propagates across nodes, 
 but
 a half-completed state on an initial read will basically be returning corrupt
 data given my apps consistency requirements.  My understanding from the FAQ 
 that
 this single row multicolumn change provides no read isolation, so I will have
 this problem.  Is this correct?  If so:

 Question 1:  Is there a way to get this type of isolation without using a
 distributed locking mechanism like cages?

No.

 Question 2:  Are there any plans to implement this type of isolation within
 Cassandra?

No.

 Question 3:  If I went with a distributed locking mechanism, what consistency
 level would I need to use with Cassandra?  Could I still get away with a
 consistency level of one?

Maybe.  If you want to guarantee that you see the most recent write,
then ONE will not be high enough. But if all you care about is seeing
all of the update or none of it, then ONE + locking will be fine.

 Question 4:  Does anyone know of a good c# alternative to cages/zookeeper?

No.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com


Re: Achieving isolation on single row modifications with batch_mutate

2010-11-30 Thread Ed Anuff
It's hard to tell without knowing the the nature of the data you're writing,
but you might want to think about whether you can embed any sort of version
number and/or checksum into the column names of the chunk columns.  That
way, you could very easily determine that the data you wanted to retrieve
was not yet available for reading.  Are you able do your partial blob
updates on an entire chunk at a time or do you need to read the blob chunk,
modify a portion of it, and then write it back?  If it's the former, then it
might be possible for this to be accomplished without a locking solution.

Ed

On Sat, Nov 27, 2010 at 8:12 AM, E S tr1skl...@yahoo.com wrote:

 I'm trying to figure out the best way to achieve single row modification
 isolation for readers.

 As an example, I have 2 rows (1,2) with 2 columns (a,b).  If I modify both
 rows,
 I don't care if the user sees the write operations completed on 1 and not
 on 2
 for a short time period (seconds).  I also don't care if when reading row 1
 the
 user gets the new value, and then on a re-read gets the old value (within a
 few
 seconds).  Because of this, I have been planning on using a consistency
 level of
 one.

 However, if I modify both columns A,B on a single row, I need both changes
 on
 the row to be visible/invisible atomically.  It doesn't matter if they both
 become visible and then both invisible as the data propagates across nodes,
 but
 a half-completed state on an initial read will basically be returning
 corrupt
 data given my apps consistency requirements.  My understanding from the FAQ
 that
 this single row multicolumn change provides no read isolation, so I will
 have
 this problem.  Is this correct?  If so:

 Question 1:  Is there a way to get this type of isolation without using a
 distributed locking mechanism like cages?

 Question 2:  Are there any plans to implement this type of isolation within
 Cassandra?

 Question 3:  If I went with a distributed locking mechanism, what
 consistency
 level would I need to use with Cassandra?  Could I still get away with a
 consistency level of one?  It seems that if the initial write is done in a
 non-isolated way, but if cross-node row synchronizations are done all or
 nothing, I could still use one.

 Question 4:  Does anyone know of a good c# alternative to cages/zookeeper?

 Thanks for any help with this!







Re: Achieving isolation on single row modifications with batch_mutate

2010-11-30 Thread E S
I'm a little confused about #3.  Hopefully this clarifying question won't turn 
the one maybe into a no :).

I'm fine not reading the latest data, as long as on each individual read I see 
all or none of the operations that occurred for a single one row batch_mutate.

My concern is do I have to lock the reads until they have propagated to all 
nodes.  If I do a batch_mutate with a consistency of ONE onto one row, during 
the write operation to the one node a reader can see partial changes.  Once the 
batch mutate completes, the change has not been propagated to the other nodes. 
 On a per row basis, are the changes to other nodes pushed in an isolated 
manner?  If not, it seems like I would have to write with a consistency of ALL 
and lock around that.




- Original Message 
From: Jonathan Ellis jbel...@gmail.com
To: user user@cassandra.apache.org
Sent: Tue, November 30, 2010 9:50:51 AM
Subject: Re: Achieving isolation on single row modifications with batch_mutate

On Sat, Nov 27, 2010 at 10:12 AM, E S tr1skl...@yahoo.com wrote:
 I'm trying to figure out the best way to achieve single row modification
 isolation for readers.

I have a lot of No's for you. :)

 As an example, I have 2 rows (1,2) with 2 columns (a,b).  If I modify both 
rows,
 I don't care if the user sees the write operations completed on 1 and not on 2
 for a short time period (seconds).  I also don't care if when reading row 1 
the
 user gets the new value, and then on a re-read gets the old value (within a 
few
 seconds).  Because of this, I have been planning on using a consistency level 
of
 one.

 However, if I modify both columns A,B on a single row, I need both changes on
 the row to be visible/invisible atomically.  It doesn't matter if they both
 become visible and then both invisible as the data propagates across nodes, 
but
 a half-completed state on an initial read will basically be returning corrupt
 data given my apps consistency requirements.  My understanding from the FAQ 
that
 this single row multicolumn change provides no read isolation, so I will have
 this problem.  Is this correct?  If so:

 Question 1:  Is there a way to get this type of isolation without using a
 distributed locking mechanism like cages?

No.

 Question 2:  Are there any plans to implement this type of isolation within
 Cassandra?

No.

 Question 3:  If I went with a distributed locking mechanism, what consistency
 level would I need to use with Cassandra?  Could I still get away with a
 consistency level of one?

Maybe.  If you want to guarantee that you see the most recent write,
then ONE will not be high enough. But if all you care about is seeing
all of the update or none of it, then ONE + locking will be fine.

 Question 4:  Does anyone know of a good c# alternative to cages/zookeeper?

No.

-- 
Jonathan Ellis
Project Chair, Apache Cassandra
co-founder of Riptano, the source for professional Cassandra support
http://riptano.com



  


Re: Achieving isolation on single row modifications with batch_mutate

2010-11-29 Thread Tyler Hobbs
In this case, it sounds like you should combine columns A and B if you
are writing them both at the same time, reading them both at the same
time, and need them to be consistent.

Obviously, you're probably dealing with more than two columns here, but
there's generally not any value in splitting something into multiple columns
if you're always writing and reading all of them at the same time.

Or are you talking about chunking huge blobs across a row?

- Tyler

On Sat, Nov 27, 2010 at 10:12 AM, E S tr1skl...@yahoo.com wrote:

 I'm trying to figure out the best way to achieve single row modification
 isolation for readers.

 As an example, I have 2 rows (1,2) with 2 columns (a,b).  If I modify both
 rows,
 I don't care if the user sees the write operations completed on 1 and not
 on 2
 for a short time period (seconds).  I also don't care if when reading row 1
 the
 user gets the new value, and then on a re-read gets the old value (within a
 few
 seconds).  Because of this, I have been planning on using a consistency
 level of
 one.

 However, if I modify both columns A,B on a single row, I need both changes
 on
 the row to be visible/invisible atomically.  It doesn't matter if they both
 become visible and then both invisible as the data propagates across nodes,
 but
 a half-completed state on an initial read will basically be returning
 corrupt
 data given my apps consistency requirements.  My understanding from the FAQ
 that
 this single row multicolumn change provides no read isolation, so I will
 have
 this problem.  Is this correct?  If so:

 Question 1:  Is there a way to get this type of isolation without using a
 distributed locking mechanism like cages?

 Question 2:  Are there any plans to implement this type of isolation within
 Cassandra?

 Question 3:  If I went with a distributed locking mechanism, what
 consistency
 level would I need to use with Cassandra?  Could I still get away with a
 consistency level of one?  It seems that if the initial write is done in a
 non-isolated way, but if cross-node row synchronizations are done all or
 nothing, I could still use one.

 Question 4:  Does anyone know of a good c# alternative to cages/zookeeper?

 Thanks for any help with this!