Re: why set replica placement strategy at keyspace level ?
Many of my mental models bother people :) This particular one came from my understanding of Big Table and the code. For me this works, I think of (internal) rows as roughly containing the CF's. In the CQL world it works for me as well, the partition key (first part of the primary key) is important and identifies the storage container that has the columns. Your milage may vary - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 31/01/2013, at 4:43 PM, Edward Capriolo edlinuxg...@gmail.com wrote: That should not bother you. For example, if your doing an hbase scan that crosses two column families, that count end up being two (disk) seeks. Having an API that hides the seeks from you does not give you better performance, it only helps you when your debating with people that do not understand the fundamentals.
Re: why set replica placement strategy at keyspace level ?
I think a row mutation is isolated now, but is it across column families? Correct they are isolated, but only for an individual CF. By the way, the wiki page really needs updating. You can update if you would like to. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 30/01/2013, at 12:33 PM, Manu Zhang owenzhang1...@gmail.com wrote: On Tue 29 Jan 2013 03:39:17 PM CST, aaron morton wrote: So If I write to CF Users with rowkey=dean and to CF Schedules with rowkey=dean, it is actually one row? In my mental model that's correct. A RowMutation is a row key and a collection of (internal) ColumnFamilies which contain the columns to write for a single CF. This is the thing that is committed to the log, and then the changes in the ColumnFamilies are applied to each CF in an isolated way. .(must have missed that several times in the documentation). http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 29/01/2013, at 9:28 AM, Hiller, Dean dean.hil...@nrel.gov wrote: If you write to 4 CF's with the same row key that is considered one mutation Hm, I never considered this, never knew either.(very un-intuitive from a user perspective IMHO). So If I write to CF Users with rowkey=dean and to CF Schedules with rowkey=dean, it is actually one row? (it's so un-intuitive that I had to ask to make sure I am reading that correctly). I guess I really don't have that case since most of my row keys are GUID's anyways, but very interesting and unexpected (not sure I really mind, was just taken aback) Ps. Not sure I ever minded losting atomic commits to the same row across CF's as I never expected it in the first place having used cassandra for more than a year.(must have missed that several times in the documentation). Thanks, Dean On 1/28/13 12:41 PM, aaron morton aa...@thelastpickle.com wrote: Another thing that's been confusing me is that when we talk about the data model should the row key be inside or outside a column family? My mental model is: cluster == database keyspace == table row == a row in a table CF == a family of columns in one row (I think that's different to others, but it works for me) Is it important to store rows of different column families that share the same row key to the same node? Makes the failure models a little easier to understand. e.g. Everything key for user amorton is either available or not. Meanwhile, what's the drawback of setting RPS and RF at column family level? Other than it's baked in? We process all mutations for a row at the same time. If you write to 4 CF's with the same row key that is considered one mutation, for one row. That one RowMutation is directed to the replicas using the ReplicationStratagy and atomically applied to the commit log. If you have RS per CF that one mutation would be split into 4, which would then be sent to different replicas. Even if they went to the same replicas they would be written to the commit log as different mutations. So if you have RS per CF you lose atomic commits for writes to the same row. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 28/01/2013, at 11:22 PM, Manu Zhang owenzhang1...@gmail.com wrote: On Mon 28 Jan 2013 04:42:49 PM CST, aaron morton wrote: The row is the unit of replication, all values with the same storage engine row key in a KS are on the same nodes. if they were per CF this would not hold. Not that it would be the end of the world, but that is the first thing that comes to mind. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 27/01/2013, at 4:15 PM, Manu Zhang owenzhang1...@gmail.com wrote: Although I've got to know Cassandra for quite a while, this question only has occurred to me recently: Why are the replica placement strategy and replica factors set at the keyspace level? Would setting them at the column family level offers more flexibility? Is this because it's easier for user to manage an application? Or related to internal implementation? Or it's just that I've overlooked something? Is it important to store rows of different column families that share the same row key to the same node? AFAIK, Cassandra doesn't support get all of them in a single call. Meanwhile, what's the drawback of setting RPS and RF at column family level? Another thing that's been confusing me is that when we talk about the data model should the row key be inside or outside a column family? Thanks From that wiki page, mutations against a single key are atomic but not isolated. I think a row
Re: why set replica placement strategy at keyspace level ?
On Thu 31 Jan 2013 08:55:40 AM CST, aaron morton wrote: I think a row mutation is isolated now, but is it across column families? Correct they are isolated, but only for an individual CF. By the way, the wiki page really needs updating. You can update if you would like to. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 30/01/2013, at 12:33 PM, Manu Zhang owenzhang1...@gmail.com wrote: On Tue 29 Jan 2013 03:39:17 PM CST, aaron morton wrote: So If I write to CF Users with rowkey=dean and to CF Schedules with rowkey=dean, it is actually one row? In my mental model that's correct. A RowMutation is a row key and a collection of (internal) ColumnFamilies which contain the columns to write for a single CF. This is the thing that is committed to the log, and then the changes in the ColumnFamilies are applied to each CF in an isolated way. .(must have missed that several times in the documentation). http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 29/01/2013, at 9:28 AM, Hiller, Dean dean.hil...@nrel.gov wrote: If you write to 4 CF's with the same row key that is considered one mutation Hm, I never considered this, never knew either.(very un-intuitive from a user perspective IMHO). So If I write to CF Users with rowkey=dean and to CF Schedules with rowkey=dean, it is actually one row? (it's so un-intuitive that I had to ask to make sure I am reading that correctly). I guess I really don't have that case since most of my row keys are GUID's anyways, but very interesting and unexpected (not sure I really mind, was just taken aback) Ps. Not sure I ever minded losting atomic commits to the same row across CF's as I never expected it in the first place having used cassandra for more than a year.(must have missed that several times in the documentation). Thanks, Dean On 1/28/13 12:41 PM, aaron morton aa...@thelastpickle.com wrote: Another thing that's been confusing me is that when we talk about the data model should the row key be inside or outside a column family? My mental model is: cluster == database keyspace == table row == a row in a table CF == a family of columns in one row (I think that's different to others, but it works for me) Is it important to store rows of different column families that share the same row key to the same node? Makes the failure models a little easier to understand. e.g. Everything key for user amorton is either available or not. Meanwhile, what's the drawback of setting RPS and RF at column family level? Other than it's baked in? We process all mutations for a row at the same time. If you write to 4 CF's with the same row key that is considered one mutation, for one row. That one RowMutation is directed to the replicas using the ReplicationStratagy and atomically applied to the commit log. If you have RS per CF that one mutation would be split into 4, which would then be sent to different replicas. Even if they went to the same replicas they would be written to the commit log as different mutations. So if you have RS per CF you lose atomic commits for writes to the same row. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 28/01/2013, at 11:22 PM, Manu Zhang owenzhang1...@gmail.com wrote: On Mon 28 Jan 2013 04:42:49 PM CST, aaron morton wrote: The row is the unit of replication, all values with the same storage engine row key in a KS are on the same nodes. if they were per CF this would not hold. Not that it would be the end of the world, but that is the first thing that comes to mind. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 27/01/2013, at 4:15 PM, Manu Zhang owenzhang1...@gmail.com wrote: Although I've got to know Cassandra for quite a while, this question only has occurred to me recently: Why are the replica placement strategy and replica factors set at the keyspace level? Would setting them at the column family level offers more flexibility? Is this because it's easier for user to manage an application? Or related to internal implementation? Or it's just that I've overlooked something? Is it important to store rows of different column families that share the same row key to the same node? AFAIK, Cassandra doesn't support get all of them in a single call. Meanwhile, what's the drawback of setting RPS and RF at column family level? Another thing that's been confusing me is that when we talk about the data model should the row key be inside or outside a column family? Thanks From that wiki page, mutations against a single key are atomic but not isolated. I think a row mutation is isolated now, but is it across column families? By
Re: why set replica placement strategy at keyspace level ?
That should not bother you. For example, if your doing an hbase scan that crosses two column families, that count end up being two (disk) seeks. Having an API that hides the seeks from you does not give you better performance, it only helps you when your debating with people that do not understand the fundamentals.
Re: why set replica placement strategy at keyspace level ?
On Tue 29 Jan 2013 03:39:17 PM CST, aaron morton wrote: So If I write to CF Users with rowkey=dean and to CF Schedules with rowkey=dean, it is actually one row? In my mental model that's correct. A RowMutation is a row key and a collection of (internal) ColumnFamilies which contain the columns to write for a single CF. This is the thing that is committed to the log, and then the changes in the ColumnFamilies are applied to each CF in an isolated way. .(must have missed that several times in the documentation). http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 29/01/2013, at 9:28 AM, Hiller, Dean dean.hil...@nrel.gov wrote: If you write to 4 CF's with the same row key that is considered one mutation Hm, I never considered this, never knew either.(very un-intuitive from a user perspective IMHO). So If I write to CF Users with rowkey=dean and to CF Schedules with rowkey=dean, it is actually one row? (it's so un-intuitive that I had to ask to make sure I am reading that correctly). I guess I really don't have that case since most of my row keys are GUID's anyways, but very interesting and unexpected (not sure I really mind, was just taken aback) Ps. Not sure I ever minded losting atomic commits to the same row across CF's as I never expected it in the first place having used cassandra for more than a year.(must have missed that several times in the documentation). Thanks, Dean On 1/28/13 12:41 PM, aaron morton aa...@thelastpickle.com wrote: Another thing that's been confusing me is that when we talk about the data model should the row key be inside or outside a column family? My mental model is: cluster == database keyspace == table row == a row in a table CF == a family of columns in one row (I think that's different to others, but it works for me) Is it important to store rows of different column families that share the same row key to the same node? Makes the failure models a little easier to understand. e.g. Everything key for user amorton is either available or not. Meanwhile, what's the drawback of setting RPS and RF at column family level? Other than it's baked in? We process all mutations for a row at the same time. If you write to 4 CF's with the same row key that is considered one mutation, for one row. That one RowMutation is directed to the replicas using the ReplicationStratagy and atomically applied to the commit log. If you have RS per CF that one mutation would be split into 4, which would then be sent to different replicas. Even if they went to the same replicas they would be written to the commit log as different mutations. So if you have RS per CF you lose atomic commits for writes to the same row. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 28/01/2013, at 11:22 PM, Manu Zhang owenzhang1...@gmail.com wrote: On Mon 28 Jan 2013 04:42:49 PM CST, aaron morton wrote: The row is the unit of replication, all values with the same storage engine row key in a KS are on the same nodes. if they were per CF this would not hold. Not that it would be the end of the world, but that is the first thing that comes to mind. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 27/01/2013, at 4:15 PM, Manu Zhang owenzhang1...@gmail.com wrote: Although I've got to know Cassandra for quite a while, this question only has occurred to me recently: Why are the replica placement strategy and replica factors set at the keyspace level? Would setting them at the column family level offers more flexibility? Is this because it's easier for user to manage an application? Or related to internal implementation? Or it's just that I've overlooked something? Is it important to store rows of different column families that share the same row key to the same node? AFAIK, Cassandra doesn't support get all of them in a single call. Meanwhile, what's the drawback of setting RPS and RF at column family level? Another thing that's been confusing me is that when we talk about the data model should the row key be inside or outside a column family? Thanks From that wiki page, mutations against a single key are atomic but not isolated. I think a row mutation is isolated now, but is it across column families? By the way, the wiki page really needs updating.
Re: why set replica placement strategy at keyspace level ?
The row is the unit of replication, all values with the same storage engine row key in a KS are on the same nodes. if they were per CF this would not hold. Not that it would be the end of the world, but that is the first thing that comes to mind. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 27/01/2013, at 4:15 PM, Manu Zhang owenzhang1...@gmail.com wrote: Although I've got to know Cassandra for quite a while, this question only has occurred to me recently: Why are the replica placement strategy and replica factors set at the keyspace level? Would setting them at the column family level offers more flexibility? Is this because it's easier for user to manage an application? Or related to internal implementation? Or it's just that I've overlooked something?
Re: why set replica placement strategy at keyspace level ?
On Mon 28 Jan 2013 04:42:49 PM CST, aaron morton wrote: The row is the unit of replication, all values with the same storage engine row key in a KS are on the same nodes. if they were per CF this would not hold. Not that it would be the end of the world, but that is the first thing that comes to mind. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 27/01/2013, at 4:15 PM, Manu Zhang owenzhang1...@gmail.com wrote: Although I've got to know Cassandra for quite a while, this question only has occurred to me recently: Why are the replica placement strategy and replica factors set at the keyspace level? Would setting them at the column family level offers more flexibility? Is this because it's easier for user to manage an application? Or related to internal implementation? Or it's just that I've overlooked something? Is it important to store rows of different column families that share the same row key to the same node? AFAIK, Cassandra doesn't support get all of them in a single call. Meanwhile, what's the drawback of setting RPS and RF at column family level? Another thing that's been confusing me is that when we talk about the data model should the row key be inside or outside a column family? Thanks
Re: why set replica placement strategy at keyspace level ?
Another thing that's been confusing me is that when we talk about the data model should the row key be inside or outside a column family? My mental model is: cluster == database keyspace == table row == a row in a table CF == a family of columns in one row (I think that's different to others, but it works for me) Is it important to store rows of different column families that share the same row key to the same node? Makes the failure models a little easier to understand. e.g. Everything key for user amorton is either available or not. Meanwhile, what's the drawback of setting RPS and RF at column family level? Other than it's baked in? We process all mutations for a row at the same time. If you write to 4 CF's with the same row key that is considered one mutation, for one row. That one RowMutation is directed to the replicas using the ReplicationStratagy and atomically applied to the commit log. If you have RS per CF that one mutation would be split into 4, which would then be sent to different replicas. Even if they went to the same replicas they would be written to the commit log as different mutations. So if you have RS per CF you lose atomic commits for writes to the same row. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 28/01/2013, at 11:22 PM, Manu Zhang owenzhang1...@gmail.com wrote: On Mon 28 Jan 2013 04:42:49 PM CST, aaron morton wrote: The row is the unit of replication, all values with the same storage engine row key in a KS are on the same nodes. if they were per CF this would not hold. Not that it would be the end of the world, but that is the first thing that comes to mind. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 27/01/2013, at 4:15 PM, Manu Zhang owenzhang1...@gmail.com wrote: Although I've got to know Cassandra for quite a while, this question only has occurred to me recently: Why are the replica placement strategy and replica factors set at the keyspace level? Would setting them at the column family level offers more flexibility? Is this because it's easier for user to manage an application? Or related to internal implementation? Or it's just that I've overlooked something? Is it important to store rows of different column families that share the same row key to the same node? AFAIK, Cassandra doesn't support get all of them in a single call. Meanwhile, what's the drawback of setting RPS and RF at column family level? Another thing that's been confusing me is that when we talk about the data model should the row key be inside or outside a column family? Thanks
Re: why set replica placement strategy at keyspace level ?
If you write to 4 CF's with the same row key that is considered one mutation Hm, I never considered this, never knew either.(very un-intuitive from a user perspective IMHO). So If I write to CF Users with rowkey=dean and to CF Schedules with rowkey=dean, it is actually one row? (it's so un-intuitive that I had to ask to make sure I am reading that correctly). I guess I really don't have that case since most of my row keys are GUID's anyways, but very interesting and unexpected (not sure I really mind, was just taken aback) Ps. Not sure I ever minded losting atomic commits to the same row across CF's as I never expected it in the first place having used cassandra for more than a year.(must have missed that several times in the documentation). Thanks, Dean On 1/28/13 12:41 PM, aaron morton aa...@thelastpickle.com wrote: Another thing that's been confusing me is that when we talk about the data model should the row key be inside or outside a column family? My mental model is: cluster == database keyspace == table row == a row in a table CF == a family of columns in one row (I think that's different to others, but it works for me) Is it important to store rows of different column families that share the same row key to the same node? Makes the failure models a little easier to understand. e.g. Everything key for user amorton is either available or not. Meanwhile, what's the drawback of setting RPS and RF at column family level? Other than it's baked in? We process all mutations for a row at the same time. If you write to 4 CF's with the same row key that is considered one mutation, for one row. That one RowMutation is directed to the replicas using the ReplicationStratagy and atomically applied to the commit log. If you have RS per CF that one mutation would be split into 4, which would then be sent to different replicas. Even if they went to the same replicas they would be written to the commit log as different mutations. So if you have RS per CF you lose atomic commits for writes to the same row. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 28/01/2013, at 11:22 PM, Manu Zhang owenzhang1...@gmail.com wrote: On Mon 28 Jan 2013 04:42:49 PM CST, aaron morton wrote: The row is the unit of replication, all values with the same storage engine row key in a KS are on the same nodes. if they were per CF this would not hold. Not that it would be the end of the world, but that is the first thing that comes to mind. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 27/01/2013, at 4:15 PM, Manu Zhang owenzhang1...@gmail.com wrote: Although I've got to know Cassandra for quite a while, this question only has occurred to me recently: Why are the replica placement strategy and replica factors set at the keyspace level? Would setting them at the column family level offers more flexibility? Is this because it's easier for user to manage an application? Or related to internal implementation? Or it's just that I've overlooked something? Is it important to store rows of different column families that share the same row key to the same node? AFAIK, Cassandra doesn't support get all of them in a single call. Meanwhile, what's the drawback of setting RPS and RF at column family level? Another thing that's been confusing me is that when we talk about the data model should the row key be inside or outside a column family? Thanks
Re: why set replica placement strategy at keyspace level ?
So If I write to CF Users with rowkey=dean and to CF Schedules with rowkey=dean, it is actually one row? In my mental model that's correct. A RowMutation is a row key and a collection of (internal) ColumnFamilies which contain the columns to write for a single CF. This is the thing that is committed to the log, and then the changes in the ColumnFamilies are applied to each CF in an isolated way. .(must have missed that several times in the documentation). http://wiki.apache.org/cassandra/FAQ#batch_mutate_atomic Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 29/01/2013, at 9:28 AM, Hiller, Dean dean.hil...@nrel.gov wrote: If you write to 4 CF's with the same row key that is considered one mutation Hm, I never considered this, never knew either.(very un-intuitive from a user perspective IMHO). So If I write to CF Users with rowkey=dean and to CF Schedules with rowkey=dean, it is actually one row? (it's so un-intuitive that I had to ask to make sure I am reading that correctly). I guess I really don't have that case since most of my row keys are GUID's anyways, but very interesting and unexpected (not sure I really mind, was just taken aback) Ps. Not sure I ever minded losting atomic commits to the same row across CF's as I never expected it in the first place having used cassandra for more than a year.(must have missed that several times in the documentation). Thanks, Dean On 1/28/13 12:41 PM, aaron morton aa...@thelastpickle.com wrote: Another thing that's been confusing me is that when we talk about the data model should the row key be inside or outside a column family? My mental model is: cluster == database keyspace == table row == a row in a table CF == a family of columns in one row (I think that's different to others, but it works for me) Is it important to store rows of different column families that share the same row key to the same node? Makes the failure models a little easier to understand. e.g. Everything key for user amorton is either available or not. Meanwhile, what's the drawback of setting RPS and RF at column family level? Other than it's baked in? We process all mutations for a row at the same time. If you write to 4 CF's with the same row key that is considered one mutation, for one row. That one RowMutation is directed to the replicas using the ReplicationStratagy and atomically applied to the commit log. If you have RS per CF that one mutation would be split into 4, which would then be sent to different replicas. Even if they went to the same replicas they would be written to the commit log as different mutations. So if you have RS per CF you lose atomic commits for writes to the same row. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 28/01/2013, at 11:22 PM, Manu Zhang owenzhang1...@gmail.com wrote: On Mon 28 Jan 2013 04:42:49 PM CST, aaron morton wrote: The row is the unit of replication, all values with the same storage engine row key in a KS are on the same nodes. if they were per CF this would not hold. Not that it would be the end of the world, but that is the first thing that comes to mind. Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 27/01/2013, at 4:15 PM, Manu Zhang owenzhang1...@gmail.com wrote: Although I've got to know Cassandra for quite a while, this question only has occurred to me recently: Why are the replica placement strategy and replica factors set at the keyspace level? Would setting them at the column family level offers more flexibility? Is this because it's easier for user to manage an application? Or related to internal implementation? Or it's just that I've overlooked something? Is it important to store rows of different column families that share the same row key to the same node? AFAIK, Cassandra doesn't support get all of them in a single call. Meanwhile, what's the drawback of setting RPS and RF at column family level? Another thing that's been confusing me is that when we talk about the data model should the row key be inside or outside a column family? Thanks
why set replica placement strategy at keyspace level ?
Although I've got to know Cassandra for quite a while, this question only has occurred to me recently: Why are the replica placement strategy and replica factors set at the keyspace level? Would setting them at the column family level offers more flexibility? Is this because it's easier for user to manage an application? Or related to internal implementation? Or it's just that I've overlooked something?