Re: Wide rows splitting
You might find this interesting: https://medium.com/@foundev/synthetic-sharding-in-cassandra-to-deal-with-large-partitions-2124b2fd788b Cheers, Stefano On Mon, Sep 18, 2017 at 5:07 AM, Adam Smithwrote: > Dear community, > > I have a table with inlinks to URLs, i.e. many URLs point to > http://google.com, less URLs point to http://somesmallweb.page. > > It has very wide and very skinny rows - the distribution is following a > power law. I do not know a priori how many columns a row has. Also, I can't > identify a schema to introduce a good partitioning. > > Currently, I am thinking about introducing splits by: pk is like (URL, > splitnumber), where splitnumber is initially 1 and hash URL mod > splitnumber would determine the splitnumber on insert. I would need a > separate table to maintain the splitnumber and a spark-cassandra-connector > job counts the columns and and increases/doubles the number of splits on > demand. This means then that I would have to move e.g. (URL1,0) -> (URL1,1) > when splitnumber would be 2. > > Would you do the same? Is there a better way? > > Thanks! > Adam >
Re: wide rows
With CQL data modeling, everything is called a "row". But really in CQL, a row is just a logical concept. So if you think of "wide partition" instead of "wide row" (partition is what is determined by the has index of the partition key), it will help the understanding a bit: one wide-partition may contain multiple logical CQL rows - each CQL row just represents one actual storage column of the partition. Time-series data is usually a good fit for "wide-partition" data modeling, but please remember that don't go too crazy with it. Cheers, Yabin On Tue, Oct 18, 2016 at 11:23 AM, DuyHai Doanwrote: > // user table: skinny partition > CREATE TABLE user ( > user_id uuid, > firstname text, > lastname text, > > PRIMARY KEY ((user_id)) > ); > > // sensor_data table: wide partition > CREATE TABLE sensor_data ( > sensor_id uuid, > date timestamp, > value double, > PRIMARY KEY ((sensor_id), date) > ); > > On Tue, Oct 18, 2016 at 5:07 PM, S Ahmed wrote: > >> Hi, >> >> Can someone clarify how you would model a "wide" row cassandra table? >> From what I understand, a wide row table is where you keep appending >> columns to a given row. >> >> The other way to model a table would be the "regular" style where each >> row contains data so you would during a SELECT you would want multiple rows >> as oppose to a wide row where you would get a single row, but a subset of >> columns. >> >> Can someone show a simple data model that compares both styles? >> >> Thanks. >> > >
Re: wide rows
// user table: skinny partition CREATE TABLE user ( user_id uuid, firstname text, lastname text, PRIMARY KEY ((user_id)) ); // sensor_data table: wide partition CREATE TABLE sensor_data ( sensor_id uuid, date timestamp, value double, PRIMARY KEY ((sensor_id), date) ); On Tue, Oct 18, 2016 at 5:07 PM, S Ahmedwrote: > Hi, > > Can someone clarify how you would model a "wide" row cassandra table? > From what I understand, a wide row table is where you keep appending > columns to a given row. > > The other way to model a table would be the "regular" style where each row > contains data so you would during a SELECT you would want multiple rows as > oppose to a wide row where you would get a single row, but a subset of > columns. > > Can someone show a simple data model that compares both styles? > > Thanks. >
RE: wide rows
Hi, Can someone clarify how you would model a "wide" row cassandra table? From what I understand, a wide row table is where you keep appending columns to a given row. The other way to model a table would be the "regular" style where each row contains data so you would during a SELECT you would want multiple rows as oppose to a wide row where you would get a single row, but a subset of columns. Can someone show a simple data model that compares both styles? Thanks.
Re: Wide rows best practices and GC impact
Hello, I saw this earlier yesterday but didn't want to reply because I didn't know what the cause was. Basically I using wide rows with cassandra 1.x and was inserting data constantly. After about 18 hours the JVM would crash with a dump file. For some reason I removed the compaction throttling and the problem disappeared. I've never really found out what the root cause was. On Thu Dec 04 2014 at 2:49:57 AM Gianluca Borello gianl...@draios.com wrote: Thanks Robert, I really appreciate your help! I'm still unsure why Cassandra 2.1 seem to perform much better in that same scenario (even setting the same values of compaction threshold and number of compactors), but I guess we'll revise when we'll decide to upgrade 2.1 in production. On Dec 3, 2014 6:33 PM, Robert Coli rc...@eventbrite.com wrote: On Tue, Dec 2, 2014 at 5:01 PM, Gianluca Borello gianl...@draios.com wrote: We mainly store time series-like data, where each data point is a binary blob of 5-20KB. We use wide rows, and try to put in the same row all the data that we usually need in a single query (but not more than that). As a result, our application logic is very simple (since we have to do just one query to read the data on average) and read/write response times are very satisfactory. This is a cfhistograms and a cfstats of our heaviest CF: 100mb is not HYOOOGE but is around the size where large rows can cause heap pressure. You seem to be unclear on the implications of pending compactions, however. Briefly, pending compactions indicate that you have more SSTables than you should. As compaction both merges row versions and reduces the number of SSTables, a high number of pending compactions causes problems associated with both having too many row versions (fragmentation) and a large number of SSTables (per-SSTable heap/memory (depending on version) overhead like bloom filters and index samples). In your case, it seems the problem is probably just the compaction throttle being too low. My conjecture is that, given your normal data size and read/write workload, you are relatively close to GC pre-fail when compaction is working. When it stops working, you relatively quickly get into a state where you exhaust heap because you have too many SSTables. =Rob http://twitter.com/rcolidba PS - Given 30GB of RAM on the machine, you could consider investigating large-heap configurations, rbranson from Instagram has some slides out there on the topic. What you pay is longer stop the world GCs, IOW latency if you happen to be talking to a replica node when it pauses.
Re: Wide rows best practices and GC impact
On Tue, Dec 2, 2014 at 5:01 PM, Gianluca Borello gianl...@draios.com wrote: We mainly store time series-like data, where each data point is a binary blob of 5-20KB. We use wide rows, and try to put in the same row all the data that we usually need in a single query (but not more than that). As a result, our application logic is very simple (since we have to do just one query to read the data on average) and read/write response times are very satisfactory. This is a cfhistograms and a cfstats of our heaviest CF: 100mb is not HYOOOGE but is around the size where large rows can cause heap pressure. You seem to be unclear on the implications of pending compactions, however. Briefly, pending compactions indicate that you have more SSTables than you should. As compaction both merges row versions and reduces the number of SSTables, a high number of pending compactions causes problems associated with both having too many row versions (fragmentation) and a large number of SSTables (per-SSTable heap/memory (depending on version) overhead like bloom filters and index samples). In your case, it seems the problem is probably just the compaction throttle being too low. My conjecture is that, given your normal data size and read/write workload, you are relatively close to GC pre-fail when compaction is working. When it stops working, you relatively quickly get into a state where you exhaust heap because you have too many SSTables. =Rob http://twitter.com/rcolidba PS - Given 30GB of RAM on the machine, you could consider investigating large-heap configurations, rbranson from Instagram has some slides out there on the topic. What you pay is longer stop the world GCs, IOW latency if you happen to be talking to a replica node when it pauses.
Re: Wide rows best practices and GC impact
Thanks Robert, I really appreciate your help! I'm still unsure why Cassandra 2.1 seem to perform much better in that same scenario (even setting the same values of compaction threshold and number of compactors), but I guess we'll revise when we'll decide to upgrade 2.1 in production. On Dec 3, 2014 6:33 PM, Robert Coli rc...@eventbrite.com wrote: On Tue, Dec 2, 2014 at 5:01 PM, Gianluca Borello gianl...@draios.com wrote: We mainly store time series-like data, where each data point is a binary blob of 5-20KB. We use wide rows, and try to put in the same row all the data that we usually need in a single query (but not more than that). As a result, our application logic is very simple (since we have to do just one query to read the data on average) and read/write response times are very satisfactory. This is a cfhistograms and a cfstats of our heaviest CF: 100mb is not HYOOOGE but is around the size where large rows can cause heap pressure. You seem to be unclear on the implications of pending compactions, however. Briefly, pending compactions indicate that you have more SSTables than you should. As compaction both merges row versions and reduces the number of SSTables, a high number of pending compactions causes problems associated with both having too many row versions (fragmentation) and a large number of SSTables (per-SSTable heap/memory (depending on version) overhead like bloom filters and index samples). In your case, it seems the problem is probably just the compaction throttle being too low. My conjecture is that, given your normal data size and read/write workload, you are relatively close to GC pre-fail when compaction is working. When it stops working, you relatively quickly get into a state where you exhaust heap because you have too many SSTables. =Rob http://twitter.com/rcolidba PS - Given 30GB of RAM on the machine, you could consider investigating large-heap configurations, rbranson from Instagram has some slides out there on the topic. What you pay is longer stop the world GCs, IOW latency if you happen to be talking to a replica node when it pauses.
Re: Wide Rows - Data Model Design
Hello, Yes, this is a wide row table design. The first col is your Partition Key. The remaining 2 cols are clustering cols. You will receive ordered result sets based on client_name, record_date when running that query. Jonathan [image: datastax_logo.png] Jonathan Lacefield Solution Architect | (404) 822 3487 | jlacefi...@datastax.com [image: linkedin.png] http://www.linkedin.com/in/jlacefield/ [image: facebook.png] https://www.facebook.com/datastax [image: twitter.png] https://twitter.com/datastax [image: g+.png] https://plus.google.com/+Datastax/about http://feeds.feedburner.com/datastax https://github.com/datastax/ On Fri, Sep 19, 2014 at 10:41 AM, Check Peck comptechge...@gmail.com wrote: I am trying to use wide rows concept in my data modelling design for Cassandra. We are using Cassandra 2.0.6. CREATE TABLE test_data ( test_id int, client_name text, record_data text, creation_date timestamp, last_modified_date timestamp, PRIMARY KEY (test_id, client_name, record_data) ) So I came up with above table design. Does my above table falls under the category of wide rows in Cassandra or not? And is there any problem If I have three columns in my PRIMARY KEY? I guess PARTITION KEY will be test_id right? And what about other two? In this table, we can have multiple record_data for same client_name. Query Pattern will be - select client_name, record_data from test_data where test_id = 1;
Re: Wide Rows - Data Model Design
Does my above table falls under the category of wide rows in Cassandra or not? -- It depends on the cardinality. For each distinct test_id, how many combinations of client_name/record_data do you have ? By the way, why do you put the record_data as part of primary key ? In your table partiton key = test_id, client_name = first clustering column, record_data = second clustering column On Fri, Sep 19, 2014 at 5:41 PM, Check Peck comptechge...@gmail.com wrote: I am trying to use wide rows concept in my data modelling design for Cassandra. We are using Cassandra 2.0.6. CREATE TABLE test_data ( test_id int, client_name text, record_data text, creation_date timestamp, last_modified_date timestamp, PRIMARY KEY (test_id, client_name, record_data) ) So I came up with above table design. Does my above table falls under the category of wide rows in Cassandra or not? And is there any problem If I have three columns in my PRIMARY KEY? I guess PARTITION KEY will be test_id right? And what about other two? In this table, we can have multiple record_data for same client_name. Query Pattern will be - select client_name, record_data from test_data where test_id = 1;
Re: Wide Rows - Data Model Design
@DuyHai - I have put that because of this condition - In this table, we can have multiple record_data for same client_name. It can be multiple combinations of client_name and record_data for each distinct test_id. On Fri, Sep 19, 2014 at 8:48 AM, DuyHai Doan doanduy...@gmail.com wrote: Does my above table falls under the category of wide rows in Cassandra or not? -- It depends on the cardinality. For each distinct test_id, how many combinations of client_name/record_data do you have ? By the way, why do you put the record_data as part of primary key ? In your table partiton key = test_id, client_name = first clustering column, record_data = second clustering column On Fri, Sep 19, 2014 at 5:41 PM, Check Peck comptechge...@gmail.com wrote: I am trying to use wide rows concept in my data modelling design for Cassandra. We are using Cassandra 2.0.6. CREATE TABLE test_data ( test_id int, client_name text, record_data text, creation_date timestamp, last_modified_date timestamp, PRIMARY KEY (test_id, client_name, record_data) ) So I came up with above table design. Does my above table falls under the category of wide rows in Cassandra or not? And is there any problem If I have three columns in my PRIMARY KEY? I guess PARTITION KEY will be test_id right? And what about other two? In this table, we can have multiple record_data for same client_name. Query Pattern will be - select client_name, record_data from test_data where test_id = 1;
Re: Wide Rows - Data Model Design
Ahh yes, sorry, I read too fast, missed it. On Fri, Sep 19, 2014 at 5:54 PM, Check Peck comptechge...@gmail.com wrote: @DuyHai - I have put that because of this condition - In this table, we can have multiple record_data for same client_name. It can be multiple combinations of client_name and record_data for each distinct test_id. On Fri, Sep 19, 2014 at 8:48 AM, DuyHai Doan doanduy...@gmail.com wrote: Does my above table falls under the category of wide rows in Cassandra or not? -- It depends on the cardinality. For each distinct test_id, how many combinations of client_name/record_data do you have ? By the way, why do you put the record_data as part of primary key ? In your table partiton key = test_id, client_name = first clustering column, record_data = second clustering column On Fri, Sep 19, 2014 at 5:41 PM, Check Peck comptechge...@gmail.com wrote: I am trying to use wide rows concept in my data modelling design for Cassandra. We are using Cassandra 2.0.6. CREATE TABLE test_data ( test_id int, client_name text, record_data text, creation_date timestamp, last_modified_date timestamp, PRIMARY KEY (test_id, client_name, record_data) ) So I came up with above table design. Does my above table falls under the category of wide rows in Cassandra or not? And is there any problem If I have three columns in my PRIMARY KEY? I guess PARTITION KEY will be test_id right? And what about other two? In this table, we can have multiple record_data for same client_name. Query Pattern will be - select client_name, record_data from test_data where test_id = 1;
Re: Wide rows (time series data) and ORM
Can Kundera work with wide rows in an ORM manner? What specifically you looking for? Composite column based implementation can be built using Kundera. With Recent CQL3 developments, Kundera supports most of these. I think POJO needs to be aware of number of fields needs to be persisted(Same as CQL3) -Vivek On Wed, Oct 23, 2013 at 12:48 AM, Les Hartzman lhartz...@gmail.com wrote: As I'm becoming more familiar with Cassandra I'm still trying to shift my thinking from relational to NoSQL. Can Kundera work with wide rows in an ORM manner? In other words, can you actually design a POJO that fits the standard recipe for JPA usage? Would the queries return collections of the POJO to handle wide row data? I had considered using Spring and JPA for Cassandra, but it appears that other than basic configuration issues for Cassandra, to use Spring and JPA on a Cassandra database seems like an effort in futility if Cassandra is used as a NoSQL database instead of mimicking an RDBMS solution. If anyone can shed any light on this, I'd appreciate it. Thanks. Les
Re: Wide rows (time series data) and ORM
PlayOrm supports different types of wide rows like embedded list in the object, etc. etc. There is a list of nosql patterns mixed with playorm patterns on this page http://buffalosw.com/wiki/patterns-page/ From: Les Hartzman lhartz...@gmail.commailto:lhartz...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Tuesday, October 22, 2013 1:18 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Wide rows (time series data) and ORM As I'm becoming more familiar with Cassandra I'm still trying to shift my thinking from relational to NoSQL. Can Kundera work with wide rows in an ORM manner? In other words, can you actually design a POJO that fits the standard recipe for JPA usage? Would the queries return collections of the POJO to handle wide row data? I had considered using Spring and JPA for Cassandra, but it appears that other than basic configuration issues for Cassandra, to use Spring and JPA on a Cassandra database seems like an effort in futility if Cassandra is used as a NoSQL database instead of mimicking an RDBMS solution. If anyone can shed any light on this, I'd appreciate it. Thanks. Les
Re: Wide rows (time series data) and ORM
Hi Vivek, What I'm looking for are a couple of things as I'm gaining an understanding of Cassandra. With wide rows and time series data, how do you (or can you) handle this data in an ORM manner? Now I understand that with CQL3, doing a select * from time_series_data will return the data as multiple rows. So does handling this data equal the way you would deal with any mapping of objects to results in a relational manner? Would you still use a JPA approach or is there a Cassandra/CQL3-specific way of interacting with the database? I expect to use a compound key for partitioning/clustering. For example I'm planning on creating a table as follows: CREATE TABLE sensor_data ( sensor_id text, date text, data_time_stamptimestamp, reading int, PRIMARY KEY ( (sensor_id, date), data_time_stamp) ); The 'date' field will be day-specific so that for each day there will be a new row created. So will I be able to define a POJO, SensorData, with the fields show above and basically process each 'row' returned by CQL as another SensorData object? Thanks. Les On Wed, Oct 23, 2013 at 1:22 AM, Vivek Mishra mishra.v...@gmail.com wrote: Can Kundera work with wide rows in an ORM manner? What specifically you looking for? Composite column based implementation can be built using Kundera. With Recent CQL3 developments, Kundera supports most of these. I think POJO needs to be aware of number of fields needs to be persisted(Same as CQL3) -Vivek On Wed, Oct 23, 2013 at 12:48 AM, Les Hartzman lhartz...@gmail.comwrote: As I'm becoming more familiar with Cassandra I'm still trying to shift my thinking from relational to NoSQL. Can Kundera work with wide rows in an ORM manner? In other words, can you actually design a POJO that fits the standard recipe for JPA usage? Would the queries return collections of the POJO to handle wide row data? I had considered using Spring and JPA for Cassandra, but it appears that other than basic configuration issues for Cassandra, to use Spring and JPA on a Cassandra database seems like an effort in futility if Cassandra is used as a NoSQL database instead of mimicking an RDBMS solution. If anyone can shed any light on this, I'd appreciate it. Thanks. Les
Re: Wide rows (time series data) and ORM
Thanks Dean. I'll check that page out. Les On Wed, Oct 23, 2013 at 7:52 AM, Hiller, Dean dean.hil...@nrel.gov wrote: PlayOrm supports different types of wide rows like embedded list in the object, etc. etc. There is a list of nosql patterns mixed with playorm patterns on this page http://buffalosw.com/wiki/patterns-page/ From: Les Hartzman lhartz...@gmail.commailto:lhartz...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Tuesday, October 22, 2013 1:18 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Wide rows (time series data) and ORM As I'm becoming more familiar with Cassandra I'm still trying to shift my thinking from relational to NoSQL. Can Kundera work with wide rows in an ORM manner? In other words, can you actually design a POJO that fits the standard recipe for JPA usage? Would the queries return collections of the POJO to handle wide row data? I had considered using Spring and JPA for Cassandra, but it appears that other than basic configuration issues for Cassandra, to use Spring and JPA on a Cassandra database seems like an effort in futility if Cassandra is used as a NoSQL database instead of mimicking an RDBMS solution. If anyone can shed any light on this, I'd appreciate it. Thanks. Les
Re: Wide rows (time series data) and ORM
Another idea is the open source Energy Databus project which does time series data and is based on PlayORM actually(ORM is a bad name since it is more noSQL patterns and not really relational). http://www.nrel.gov/analysis/databus/ That Energy Databus project is mainly time series data with some meta data. I think NREL may be holding an Energy Databus summit soon (though again it is 100% time series data and they need to rename it to just Databus which has been talked about at NREL). Dean From: Les Hartzman lhartz...@gmail.commailto:lhartz...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Wednesday, October 23, 2013 11:12 AM To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: Wide rows (time series data) and ORM Thanks Dean. I'll check that page out. Les On Wed, Oct 23, 2013 at 7:52 AM, Hiller, Dean dean.hil...@nrel.govmailto:dean.hil...@nrel.gov wrote: PlayOrm supports different types of wide rows like embedded list in the object, etc. etc. There is a list of nosql patterns mixed with playorm patterns on this page http://buffalosw.com/wiki/patterns-page/ From: Les Hartzman lhartz...@gmail.commailto:lhartz...@gmail.commailto:lhartz...@gmail.commailto:lhartz...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Tuesday, October 22, 2013 1:18 PM To: user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Wide rows (time series data) and ORM As I'm becoming more familiar with Cassandra I'm still trying to shift my thinking from relational to NoSQL. Can Kundera work with wide rows in an ORM manner? In other words, can you actually design a POJO that fits the standard recipe for JPA usage? Would the queries return collections of the POJO to handle wide row data? I had considered using Spring and JPA for Cassandra, but it appears that other than basic configuration issues for Cassandra, to use Spring and JPA on a Cassandra database seems like an effort in futility if Cassandra is used as a NoSQL database instead of mimicking an RDBMS solution. If anyone can shed any light on this, I'd appreciate it. Thanks. Les
Re: Wide rows (time series data) and ORM
Hi, CREATE TABLE sensor_data ( sensor_id text, date text, data_time_stamptimestamp, reading int, PRIMARY KEY ( (sensor_id, date), data_time_stamp) ); Yes, you can create a POJO for this and map exactly with one row as a POJO object. Please have a look at: https://github.com/impetus-opensource/Kundera/wiki/Using-Compound-keys-with-Kundera There are users built production system using Kundera, please refer : https://github.com/impetus-opensource/Kundera/wiki/Kundera-in-Production-Deployments I am working as a core commitor in Kundera, please do let me know if you have any query. Sincerely, -Vivek On Wed, Oct 23, 2013 at 10:41 PM, Les Hartzman lhartz...@gmail.com wrote: Hi Vivek, What I'm looking for are a couple of things as I'm gaining an understanding of Cassandra. With wide rows and time series data, how do you (or can you) handle this data in an ORM manner? Now I understand that with CQL3, doing a select * from time_series_data will return the data as multiple rows. So does handling this data equal the way you would deal with any mapping of objects to results in a relational manner? Would you still use a JPA approach or is there a Cassandra/CQL3-specific way of interacting with the database? I expect to use a compound key for partitioning/clustering. For example I'm planning on creating a table as follows: CREATE TABLE sensor_data ( sensor_id text, date text, data_time_stamptimestamp, reading int, PRIMARY KEY ( (sensor_id, date), data_time_stamp) ); The 'date' field will be day-specific so that for each day there will be a new row created. So will I be able to define a POJO, SensorData, with the fields show above and basically process each 'row' returned by CQL as another SensorData object? Thanks. Les On Wed, Oct 23, 2013 at 1:22 AM, Vivek Mishra mishra.v...@gmail.comwrote: Can Kundera work with wide rows in an ORM manner? What specifically you looking for? Composite column based implementation can be built using Kundera. With Recent CQL3 developments, Kundera supports most of these. I think POJO needs to be aware of number of fields needs to be persisted(Same as CQL3) -Vivek On Wed, Oct 23, 2013 at 12:48 AM, Les Hartzman lhartz...@gmail.comwrote: As I'm becoming more familiar with Cassandra I'm still trying to shift my thinking from relational to NoSQL. Can Kundera work with wide rows in an ORM manner? In other words, can you actually design a POJO that fits the standard recipe for JPA usage? Would the queries return collections of the POJO to handle wide row data? I had considered using Spring and JPA for Cassandra, but it appears that other than basic configuration issues for Cassandra, to use Spring and JPA on a Cassandra database seems like an effort in futility if Cassandra is used as a NoSQL database instead of mimicking an RDBMS solution. If anyone can shed any light on this, I'd appreciate it. Thanks. Les
Re: Wide rows (time series data) and ORM
Thanks Vivek. I'll look over those links tonight. On Wed, Oct 23, 2013 at 4:20 PM, Vivek Mishra mishra.v...@gmail.com wrote: Hi, CREATE TABLE sensor_data ( sensor_id text, date text, data_time_stamptimestamp, reading int, PRIMARY KEY ( (sensor_id, date), data_time_stamp) ); Yes, you can create a POJO for this and map exactly with one row as a POJO object. Please have a look at: https://github.com/impetus-opensource/Kundera/wiki/Using-Compound-keys-with-Kundera There are users built production system using Kundera, please refer : https://github.com/impetus-opensource/Kundera/wiki/Kundera-in-Production-Deployments I am working as a core commitor in Kundera, please do let me know if you have any query. Sincerely, -Vivek On Wed, Oct 23, 2013 at 10:41 PM, Les Hartzman lhartz...@gmail.comwrote: Hi Vivek, What I'm looking for are a couple of things as I'm gaining an understanding of Cassandra. With wide rows and time series data, how do you (or can you) handle this data in an ORM manner? Now I understand that with CQL3, doing a select * from time_series_data will return the data as multiple rows. So does handling this data equal the way you would deal with any mapping of objects to results in a relational manner? Would you still use a JPA approach or is there a Cassandra/CQL3-specific way of interacting with the database? I expect to use a compound key for partitioning/clustering. For example I'm planning on creating a table as follows: CREATE TABLE sensor_data ( sensor_id text, date text, data_time_stamptimestamp, reading int, PRIMARY KEY ( (sensor_id, date), data_time_stamp) ); The 'date' field will be day-specific so that for each day there will be a new row created. So will I be able to define a POJO, SensorData, with the fields show above and basically process each 'row' returned by CQL as another SensorData object? Thanks. Les On Wed, Oct 23, 2013 at 1:22 AM, Vivek Mishra mishra.v...@gmail.comwrote: Can Kundera work with wide rows in an ORM manner? What specifically you looking for? Composite column based implementation can be built using Kundera. With Recent CQL3 developments, Kundera supports most of these. I think POJO needs to be aware of number of fields needs to be persisted(Same as CQL3) -Vivek On Wed, Oct 23, 2013 at 12:48 AM, Les Hartzman lhartz...@gmail.comwrote: As I'm becoming more familiar with Cassandra I'm still trying to shift my thinking from relational to NoSQL. Can Kundera work with wide rows in an ORM manner? In other words, can you actually design a POJO that fits the standard recipe for JPA usage? Would the queries return collections of the POJO to handle wide row data? I had considered using Spring and JPA for Cassandra, but it appears that other than basic configuration issues for Cassandra, to use Spring and JPA on a Cassandra database seems like an effort in futility if Cassandra is used as a NoSQL database instead of mimicking an RDBMS solution. If anyone can shed any light on this, I'd appreciate it. Thanks. Les
Re: Wide rows/composite keys clarification needed
So looking at Patrick McFadin's data modeling videos I now know about using compound keys as a way of partitioning data on a by-day basis. My other questions probably go more to the storage engine itself. How do you refer to the columns in the wide row? What kind of names are assigned to the columns? Les On Oct 20, 2013 9:34 PM, Les Hartzman lhartz...@gmail.com wrote: Please correct me if I'm not describing this correctly. But if I am collecting sensor data and have a table defined as follows: create table sensor_data ( sensor_id int, time_stamp int, // time to the hour granularity voltage float, amp float, PRIMARY KEY (sensor_id, time_stamp) )); The partitioning value is the sensor_id and the rest of the PK components become part of the column name for the additional fields, in this case voltage and amp. What goes into determining what additional data is inserted into this row? The first time an insert takes place there will be one entry for all of the fields. Is there anything besides the sensor_id that is used to determine that the subsequent insertions for that sensor will go into the same row as opposed to starting a new row? Base on something I read (but can't currently find again), I thought that as long as all of the elements of the PK remain the same (same sensor_id and still within the same hour as the first reading), that the next insertion would be tacked onto the end of the first row. Is this correct? For subsequent entries into the same row for additional voltage/amp readings, what are the names of the columns for these readings? My understanding is that the column name becomes a concatenation of the non-row key field names plus the data field names.So if the first go-around you have time_stamp:voltage and time_stamp:amp, what do the subsequent column names become? Thanks. Les
Re: Wide rows/composite keys clarification needed
If you're working with CQL, you don't need to worry about the column names, it's handled for you. If you specify multiple keys as part of the primary key, they become clustering keys and are mapped to the column names. So if you have a sensor_id / time_stamp, all your sensor readings will be in the same row in the traditional cassandra sense, sorted by your time_stamp. On Oct 21, 2013, at 4:27 PM, Les Hartzman lhartz...@gmail.com wrote: So looking at Patrick McFadin's data modeling videos I now know about using compound keys as a way of partitioning data on a by-day basis. My other questions probably go more to the storage engine itself. How do you refer to the columns in the wide row? What kind of names are assigned to the columns? Les On Oct 20, 2013 9:34 PM, Les Hartzman lhartz...@gmail.com wrote: Please correct me if I'm not describing this correctly. But if I am collecting sensor data and have a table defined as follows: create table sensor_data ( sensor_id int, time_stamp int, // time to the hour granularity voltage float, amp float, PRIMARY KEY (sensor_id, time_stamp) )); The partitioning value is the sensor_id and the rest of the PK components become part of the column name for the additional fields, in this case voltage and amp. What goes into determining what additional data is inserted into this row? The first time an insert takes place there will be one entry for all of the fields. Is there anything besides the sensor_id that is used to determine that the subsequent insertions for that sensor will go into the same row as opposed to starting a new row? Base on something I read (but can't currently find again), I thought that as long as all of the elements of the PK remain the same (same sensor_id and still within the same hour as the first reading), that the next insertion would be tacked onto the end of the first row. Is this correct? For subsequent entries into the same row for additional voltage/amp readings, what are the names of the columns for these readings? My understanding is that the column name becomes a concatenation of the non-row key field names plus the data field names.So if the first go-around you have time_stamp:voltage and time_stamp:amp, what do the subsequent column names become? Thanks. Les
Re: Wide rows/composite keys clarification needed
What if you plan on using Kundera and JPQL and not CQL? Les On Oct 21, 2013 4:45 PM, Jon Haddad j...@jonhaddad.com wrote: If you're working with CQL, you don't need to worry about the column names, it's handled for you. If you specify multiple keys as part of the primary key, they become clustering keys and are mapped to the column names. So if you have a sensor_id / time_stamp, all your sensor readings will be in the same row in the traditional cassandra sense, sorted by your time_stamp. On Oct 21, 2013, at 4:27 PM, Les Hartzman lhartz...@gmail.com wrote: So looking at Patrick McFadin's data modeling videos I now know about using compound keys as a way of partitioning data on a by-day basis. My other questions probably go more to the storage engine itself. How do you refer to the columns in the wide row? What kind of names are assigned to the columns? Les On Oct 20, 2013 9:34 PM, Les Hartzman lhartz...@gmail.com wrote: Please correct me if I'm not describing this correctly. But if I am collecting sensor data and have a table defined as follows: create table sensor_data ( sensor_id int, time_stamp int, // time to the hour granularity voltage float, amp float, PRIMARY KEY (sensor_id, time_stamp) )); The partitioning value is the sensor_id and the rest of the PK components become part of the column name for the additional fields, in this case voltage and amp. What goes into determining what additional data is inserted into this row? The first time an insert takes place there will be one entry for all of the fields. Is there anything besides the sensor_id that is used to determine that the subsequent insertions for that sensor will go into the same row as opposed to starting a new row? Base on something I read (but can't currently find again), I thought that as long as all of the elements of the PK remain the same (same sensor_id and still within the same hour as the first reading), that the next insertion would be tacked onto the end of the first row. Is this correct? For subsequent entries into the same row for additional voltage/amp readings, what are the names of the columns for these readings? My understanding is that the column name becomes a concatenation of the non-row key field names plus the data field names.So if the first go-around you have time_stamp:voltage and time_stamp:amp, what do the subsequent column names become? Thanks. Les
Re: Wide rows/composite keys clarification needed
So I just saw a post about how Kundera translates all JPQL to CQL. On Mon, Oct 21, 2013 at 4:45 PM, Jon Haddad j...@jonhaddad.com wrote: If you're working with CQL, you don't need to worry about the column names, it's handled for you. If you specify multiple keys as part of the primary key, they become clustering keys and are mapped to the column names. So if you have a sensor_id / time_stamp, all your sensor readings will be in the same row in the traditional cassandra sense, sorted by your time_stamp. On Oct 21, 2013, at 4:27 PM, Les Hartzman lhartz...@gmail.com wrote: So looking at Patrick McFadin's data modeling videos I now know about using compound keys as a way of partitioning data on a by-day basis. My other questions probably go more to the storage engine itself. How do you refer to the columns in the wide row? What kind of names are assigned to the columns? Les On Oct 20, 2013 9:34 PM, Les Hartzman lhartz...@gmail.com wrote: Please correct me if I'm not describing this correctly. But if I am collecting sensor data and have a table defined as follows: create table sensor_data ( sensor_id int, time_stamp int, // time to the hour granularity voltage float, amp float, PRIMARY KEY (sensor_id, time_stamp) )); The partitioning value is the sensor_id and the rest of the PK components become part of the column name for the additional fields, in this case voltage and amp. What goes into determining what additional data is inserted into this row? The first time an insert takes place there will be one entry for all of the fields. Is there anything besides the sensor_id that is used to determine that the subsequent insertions for that sensor will go into the same row as opposed to starting a new row? Base on something I read (but can't currently find again), I thought that as long as all of the elements of the PK remain the same (same sensor_id and still within the same hour as the first reading), that the next insertion would be tacked onto the end of the first row. Is this correct? For subsequent entries into the same row for additional voltage/amp readings, what are the names of the columns for these readings? My understanding is that the column name becomes a concatenation of the non-row key field names plus the data field names.So if the first go-around you have time_stamp:voltage and time_stamp:amp, what do the subsequent column names become? Thanks. Les
Re: Wide rows in CQL 3
Thanks for explaining, Sylvain.You say that it is not a mandatory one, how long could we expect it to be not mandatory?I think the new CQL stuff is great and I will probably use it heavily. I understand the upgrade path, but my question is if I should start planning for an all-CQL future, or if I still could make some CFs with thrift and also expect it to work in 3 years time. You say you should see CQL3 non compact tables as the new stuff, the thing that you use post-upgrade - but doesn't that mean that we also have to suddenly depend on a schema? Let us for example say you have a logger, which logs all kinds of different stuff - typically key-value - and that each row could contain different keys. ROWKEY1: key1: val1, key2: val2, key3: val3ROWKEY2: key4: val4, key1: val2, keyN: valN Is this possible without using multiple rows in CQL3 non compact tables? .vegard, - Original Message - From: user@cassandra.apache.org To:user@cassandra.apache.org Cc: Sent:Wed, 9 Jan 2013 23:14:25 +0100 Subject:Re: Wide rows in CQL 3 I'd be clear, CQL3 is meant as an upgrade from thrift. Not a mandatory one, you can stick to thrift if you don't think CQL3 is better. But if you do decide to upgrade, you should see CQL3 non compact tables as the new stuff, the thing that you use post-upgrade. While you upgrade, stick to compact tables. Once you've upgraded, then you can start using the new stuff and accessing the new stuff the old way doesn't matter. -- Sylvain
Re: Wide rows in CQL 3
Is this possible without using multiple rows in CQL3 non compact tables? Depending on the number of (log record) keys you *could* do this as a map type in your CQL Table. create table log_row ( sequence timestamp, props maptext, text ) Cheers - Aaron Morton Freelance Cassandra Developer New Zealand @aaronmorton http://www.thelastpickle.com On 11/01/2013, at 1:58 AM, Vegard Berget p...@fantasista.no wrote: Thanks for explaining, Sylvain. You say that it is not a mandatory one, how long could we expect it to be not mandatory? I think the new CQL stuff is great and I will probably use it heavily. I understand the upgrade path, but my question is if I should start planning for an all-CQL future, or if I still could make some CFs with thrift and also expect it to work in 3 years time. You say you should see CQL3 non compact tables as the new stuff, the thing that you use post-upgrade - but doesn't that mean that we also have to suddenly depend on a schema? Let us for example say you have a logger, which logs all kinds of different stuff - typically key-value - and that each row could contain different keys. ROWKEY1: key1: val1, key2: val2, key3: val3 ROWKEY2: key4: val4, key1: val2, keyN: valN Is this possible without using multiple rows in CQL3 non compact tables? .vegard, - Original Message - From: user@cassandra.apache.org To: user@cassandra.apache.org user@cassandra.apache.org Cc: Sent: Wed, 9 Jan 2013 23:14:25 +0100 Subject: Re: Wide rows in CQL 3 I'd be clear, CQL3 is meant as an upgrade from thrift. Not a mandatory one, you can stick to thrift if you don't think CQL3 is better. But if you do decide to upgrade, you should see CQL3 non compact tables as the new stuff, the thing that you use post-upgrade. While you upgrade, stick to compact tables. Once you've upgraded, then you can start using the new stuff and accessing the new stuff the old way doesn't matter. -- Sylvain
Re: Wide rows in CQL 3
Probably should read this http://www.datastax.com/dev/blog/cql3-for-cassandra-experts I don't see wide row support going away since they specifically made the change to enable 2 billion columns in a row according to that paper. Dean From: mrevilgnome mrevilgn...@gmail.commailto:mrevilgn...@gmail.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Wednesday, January 9, 2013 9:51 AM To: user user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Wide rows in CQL 3 We use the thrift bindings for our current production cluster, so I haven't been tracking the developments regarding CQL3. I just discovered when speaking to another potential DSE customer that wide rows, or rather columns not defined in the metadata aren't supported in CQL 3. I'm curious to understand the reasoning behind this, whether this is an intentional direction shift away from the big table paradigm, and what's supposed to happen to those of us who have already bought into C* specifically because of the wide row support. What is our upgrade path?
Re: Wide rows in CQL 3
I'm currently in the process of porting my app from Thrift to CQL3 and it seems to me that the underlying storage layout hasn't really changed fundamentally. The difference appears to be that CQL3 offers a neater abstraction on top of the wide row format. For example, in CQL3, your query results are bound to a specific schema, so you get named columns back - previously you had to process the slices procedurally. The insert path appears to be tighter as well - you don't seem to get away with leaving out key attributes. I'm sure somebody more knowledgeable can explain this better though. Cheers, Ben On Wed, Jan 9, 2013 at 4:51 PM, mrevilgnome mrevilgn...@gmail.com wrote: We use the thrift bindings for our current production cluster, so I haven't been tracking the developments regarding CQL3. I just discovered when speaking to another potential DSE customer that wide rows, or rather columns not defined in the metadata aren't supported in CQL 3. I'm curious to understand the reasoning behind this, whether this is an intentional direction shift away from the big table paradigm, and what's supposed to happen to those of us who have already bought into C* specifically because of the wide row support. What is our upgrade path?
Re: Wide rows in CQL 3
I ask myself this every day. CQL3 is new way to do things, including wide rows with collections. There is no upgrade path. You adopt CQL3's sparse tables as soon as you start creating column families from CQL. There is not much backwards compatibility. CQL3 can query compact tables, but you may have to remove the metadata from them so they can be transposed. Thrift can not write into CQL tables easily, because of how the primary keys and column names are encoded into the key column and compact metadata is not equal to cql3's metadata. http://www.datastax.com/dev/blog/thrift-to-cql3 For a large swath of problems I like how CQL3 deals with them. For example you do not really need CQL3 to store a collection in a column family along side other data. You can use wide rows for this, but the integrated solution with CQL3 metadata is interesting. My biggest beefs are: 1) column names are UTF8 (seems wasteful in most cases) 2) sparse empty row to ghost (seems like tiny rows with one column have much overhead now) 3) using composites (with (compound primary keys) in some table designs) is wasteful. Composite adds two unsigned bytes for size and one unsigned byte as 0 per part. 4) many lines of code between user/request and actual disk. (tracing a CQL select VS a slice, young gen, etc) 5) not sure if collections can be used in REALLY wide row scenarios. aka 1,000,000 entry set? I feel that in an effort to be nube friendly, sparse+CQL is presented as the best default option. However the 5 above items are not minor, and in several use cases could make CQL's sparse tables a bad choice for certain applications. Those users would get better performance from compact storage. I feel that message sometimes gets washed away in all the CQL coolness. What is that you say? This is not actually the most efficient way to store this data? Well who cares I can do an IN CLAUSE! WooHoo! On Wed, Jan 9, 2013 at 12:10 PM, Ben Hood 0x6e6...@gmail.com wrote: I'm currently in the process of porting my app from Thrift to CQL3 and it seems to me that the underlying storage layout hasn't really changed fundamentally. The difference appears to be that CQL3 offers a neater abstraction on top of the wide row format. For example, in CQL3, your query results are bound to a specific schema, so you get named columns back - previously you had to process the slices procedurally. The insert path appears to be tighter as well - you don't seem to get away with leaving out key attributes. I'm sure somebody more knowledgeable can explain this better though. Cheers, Ben On Wed, Jan 9, 2013 at 4:51 PM, mrevilgnome mrevilgn...@gmail.com wrote: We use the thrift bindings for our current production cluster, so I haven't been tracking the developments regarding CQL3. I just discovered when speaking to another potential DSE customer that wide rows, or rather columns not defined in the metadata aren't supported in CQL 3. I'm curious to understand the reasoning behind this, whether this is an intentional direction shift away from the big table paradigm, and what's supposed to happen to those of us who have already bought into C* specifically because of the wide row support. What is our upgrade path?
Re: Wide rows in CQL 3
There is no upgrade path. I don't think that's true. The goal of the blog post you've linked is to discuss that upgrade path (and in particular show that for the most part, you can access your thrift data from CQL3 without any modification whatsoever). You adopt CQL3's sparse tables as soon as you start creating column families from CQL. That's not true, you can create non sparse from CQL3 (using COMPACT STORAGE) and so you can work with both CQL3 and thrift alongside the time it takes you to upgrade from thrift to CQL3. Then, for things that you know you will only access to CQL3 (i.e. when the upgrade is complete), you can start using non compact tables and enjoy their convenience (like collections for instance). There is not much backwards compatibility. CQL3 can query compact tables, but you may have to remove the metadata from them so they can be transposed. I think not much backwards compatibility is a tad unfair. The only case where you may have to remove the metadata is if you are using a CF in both a static and dynamic way. Now I can't pretend knowing what every user is doing, but from my experience and what I've seen, this is not such a common thing and CF are either static or dynamic in nature, not both. I do think that for most user upgrading from thrift to CQL3 won't require any data migration or messing with metadata. But more importantly, things are not completely closed. If you have *concrete* difficulties moving from thrift to CQL3, please do share them on this mailing list and we'll try to help you out. Thrift can not write into CQL tables easily, because of how the primary keys and column names are encoded into the key column and compact metadata is not equal to cql3's metadata. I'd be clear, CQL3 is meant as an upgrade from thrift. Not a mandatory one, you can stick to thrift if you don't think CQL3 is better. But if you do decide to upgrade, you should see CQL3 non compact tables as the new stuff, the thing that you use post-upgrade. While you upgrade, stick to compact tables. Once you've upgraded, then you can start using the new stuff and accessing the new stuff the old way doesn't matter. My biggest beefs are: 1) column names are UTF8 (seems wasteful in most cases) That's largely not true, the wasteful in most cases part at least. A column name in CQL3 does not always translate to a internal column name. You can still do your time series where the internal column name is an int and you don't waste space. As for the static cases, yes, CQL3 forces UTF8, I'm pretty certain that people overwhelmingly use UTF8 or ascii in those cases. And because CQL3 forces you to declare your column names in those static cases, we may actually be able to optimize the size used internally for those in the future, which is harder with thrift, so I think we actually have the potential to make is less wasteful in most cases. 2) sparse empty row to ghost (seems like tiny rows with one column have much overhead now) It is true that for non compact CQL3 we've focused on flexibility and on making the behavior predictable, which does adds some slight space overhead. However: - that's why compact storage is here. There is zero overhead over thrift if you use compact storage. That's even why we named it like that, it's compact. - we know that most the overhead of non compact tables can be win back by optimization of the storage engine. That's an advantage of having an API that is not too ties to the underlying storage: it gives room for optimizations. 3) using composites (with (compound primary keys) in some table designs) is wasteful. Composite adds two unsigned bytes for size and one unsigned byte as 0 per part. See above. 4) many lines of code between user/request and actual disk. (tracing a CQL select VS a slice, young gen, etc) If you are saying the implementation of CQL3 is more lines of code than the thrift part, then you're probably right, but given how much convenient CQL3 is compared to thrift, I happily take that criticism. But in term of overhead, provided you use prepared statement (which you should if you care about performance), then it remains to be proven that CQL3 has more overhead than thrift. In particular in terms of garbage (since you're citing young gen), while I haven't tested it, I'd be *really* surprised if thrift is generating less garbage than CQL3. And in term of the query tracing there is almost no difference whatsoever between the two. 5) not sure if collections can be used in REALLY wide row scenarios. aka 1,000,000 entry set? Lists have their downsides (listed in the documentation) but for sets and maps, they have no more limitation than wide rows have in theory. They do have the limitation with the currently the API don't allow to fetch parts of a collection. But that will change. That being said and possibly more importantly, collections are *not* meant to be very wide. They are *not* meant for wide row scenarios. CQL3 has wide
Re: Wide rows in CQL 3
By no upgrade path I mean to say if I have a table with compact storage I can not upgrade it to sparse storage. If i have an existing COMPACT table and I want to add a Map to it, I can not. This is what I mean by no upgrade path. Column families that mix static and dynamic columns are pretty common. In fact it is pretty much the default case, you have a default validator then some columns have specific validators. In the old days people used to say You only need one column family you would subdivide your row key into parts username=username, password=password, friend-friene = friends, pet-pets = pets. It's very efficient and very easy if you understand what a slice is. Is everyone else just adding a column family every time they have new data? :) Sounds very un-no-sql-like. Most people are probably going to store column names as tersely as possible. Your not going to store password as a multibyte UTF8(password). You store it as ascii(password). (or really ascii('pw') Also for the rest of my comment I meant that the comparator of any sparse tables always seems to be a COMPOSITE even if it is only one part (last I checked). Everything is -COMPOSITE(UTF-8(colname))- at minimum, when in a compact table it is -colname- My overarching point is the 5 things I listed do have a cost, the user by default gets sparse storage unless they are smart enough to know they do not want it. This is naturally going to force people away from compact storage. Basically for any column family: two possible decision paths: 1) use compact 2) use sparse Other then ease of use why would I chose sparse? Why should it be the default? On Wed, Jan 9, 2013 at 5:14 PM, Sylvain Lebresne sylv...@datastax.comwrote: c way. Now I can't pretend knowing what every user is doing, but from my experience and what I've seen, this is not such a common thing and CF are either static or dynamic in nature, not both.
Re: Wide rows in CQL 3
Also I have to say I do not get that blank sparse column. Ghost ranges are a little weird but they don't bother me. 1 its a row of nothing. The definition of a waste. 2 suppose of have 1 billion rows and my distribution is mostly rows of 1 or 2 columns. My database is now significantly bigger. That stinks. 3 suppose I write columns frequently. Well do I have to constantly need to keep writing this sparse empty row? It seems like I would. Worst case each stable with a write to a rowkey also has this sparse column, meaning multiple blank empty wasteful columns on disk to solve ghosts, that do not bother me anyway. 4 are these sparse columns also taking memtable space? This questions would give me serious pause to use sparse tables On Wednesday, January 9, 2013, Edward Capriolo edlinuxg...@gmail.com wrote: By no upgrade path I mean to say if I have a table with compact storage I can not upgrade it to sparse storage. If i have an existing COMPACT table and I want to add a Map to it, I can not. This is what I mean by no upgrade path. Column families that mix static and dynamic columns are pretty common. In fact it is pretty much the default case, you have a default validator then some columns have specific validators. In the old days people used to say You only need one column family you would subdivide your row key into parts username=username, password=password, friend-friene = friends, pet-pets = pets. It's very efficient and very easy if you understand what a slice is. Is everyone else just adding a column family every time they have new data? :) Sounds very un-no-sql-like. Most people are probably going to store column names as tersely as possible. Your not going to store password as a multibyte UTF8(password). You store it as ascii(password). (or really ascii('pw') Also for the rest of my comment I meant that the comparator of any sparse tables always seems to be a COMPOSITE even if it is only one part (last I checked). Everything is -COMPOSITE(UTF-8(colname))- at minimum, when in a compact table it is -colname- My overarching point is the 5 things I listed do have a cost, the user by default gets sparse storage unless they are smart enough to know they do not want it. This is naturally going to force people away from compact storage. Basically for any column family: two possible decision paths: 1) use compact 2) use sparse Other then ease of use why would I chose sparse? Why should it be the default? On Wed, Jan 9, 2013 at 5:14 PM, Sylvain Lebresne sylv...@datastax.com wrote: c way. Now I can't pretend knowing what every user is doing, but from my experience and what I've seen, this is not such a common thing and CF are either static or dynamic in nature, not both.
Re: Wide rows in CQL 3
On 10 Jan 2013, at 01:30, Edward Capriolo edlinuxg...@gmail.com wrote: Column families that mix static and dynamic columns are pretty common. In fact it is pretty much the default case, you have a default validator then some columns have specific validators. In the old days people used to say You only need one column family you would subdivide your row key into parts username=username, password=password, friend-friene = friends, pet-pets = pets. It's very efficient and very easy if you understand what a slice is. Is everyone else just adding a column family every time they have new data? :) Sounds very un-no-sql-like. Well, we for sure are heavily mixing static and dynamic columns; it's quite useful, really. Which is why upgrading to CQL3 isn't really something I've considered seriously at any point. Most people are probably going to store column names as tersely as possible. Your not going to store password as a multibyte UTF8(password). You store it as ascii(password). (or really ascii('pw') UTF8('password') === ascii('password'), actually - as long as you're within ascii range, UTF8 and ascii are equal byte for byte. It's not until code points 128 where you start getting multibytes. Having said that, doesn't the sparse storage lend itself really well for further column name optimisation - like using a single byte to denote the column name and then have a lookup table? The server could do a lot of nice tricks in this area, when afforded so by a tighter schema. Also, I think that compression pretty much does this already - effect is the same even if mechanism is different. /Janne
Re: Wide rows and reads
From what I understand, wide rows have quite a bit of overhead, especially if you are picking columns that are far apart from each other for a given row. This post by Aaron Morton was quite good at explaining this issue http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/ -Phil On Thu, Jul 5, 2012 at 12:17 PM, Oleg Dulin oleg.du...@gmail.com wrote: Here is my flow: One process write a really wide row (250K+ supercolumns, each one with 5 subcolumns, for the total of 1K or so per supercolumn) Second process comes in literally 2-3 seconds later and starts reading from it. My observation is that nothing good happens. It is ridiculously slow to read. It seems that if I wait long enough, the reads from that row will be much faster. Could someone enlighten me as to what exactly happens when I do this ? Regards, Oleg
Re: Wide rows or tons of rows?
2010/10/11 Héctor Izquierdo Seliva izquie...@strands.com: Hi everyone. I'm sure this question or similar has come up before, but I can't find a clear answer. I have to store a unknown number of items in cassandra, which can vary from a few hundreds to a few millions per customer. I read that in cassandra wide rows are better than a lot of rows, but then I face two problems. First, column distribution. The only way I can think of distributing items among a given set of rows is hashing the item id to a row id, and the using the item id as the column name. In this way, I can distribute data among a few rows evenly, but If there are only a few items it's equivalent to a row per item plus more overhead, and if there are millions of items then the rows are to big, and I have to turn off row cache. Does anybody knows a way around this? The second issue is that in my benchmarks, once the data is mmapped, one item per row performs faster than wide rows by a significant margin. Is this how it is supposed to be? I can give additional data if needed. English is not my first language so I apologize beforehand is some of this doesn't make sense. Thanks for your time If you have wide rows RowCache is a problem. IMHO RowCache is only viable in situations where you have a fixed amount of data and thus will get a high hit rate. I was running a large row cache for some time and I found it unpredictable. It causes memory pressure on the JVM from moving things in and out of memory, and if the hit rate is low taking a key and all its columns in and out repeatedly ends up being counter productive for disk utilization. Suggest KeyCache in most situations, (there is a ticket opened for a fractional row cache) Another factor to consider is if you have many rows and many columns you end up with large (er) indexes. In our case we have start up times slightly longer then we would like because the process of sampling indexes during start up is intensive. If I could do it all over again I might serialize more into single columns rather then exploding data across multiple rows and columns. If you always need to look up the entire row do not break it down by columns. memory mapping. There are different dynamics depending on data size relative to memory size. You may have something like ~ 40GB of data and 10GB index, 32GB RAM a node, this system is not going to respond the same way with say 200GB data 25 GB Indexes. Also it is very workload dependent. Hope this helps, Edward
Re: Wide rows or tons of rows?
El lun, 11-10-2010 a las 11:08 -0400, Edward Capriolo escribió: Inlined: 2010/10/11 Héctor Izquierdo Seliva izquie...@strands.com: Hi everyone. I'm sure this question or similar has come up before, but I can't find a clear answer. I have to store a unknown number of items in cassandra, which can vary from a few hundreds to a few millions per customer. I read that in cassandra wide rows are better than a lot of rows, but then I face two problems. First, column distribution. The only way I can think of distributing items among a given set of rows is hashing the item id to a row id, and the using the item id as the column name. In this way, I can distribute data among a few rows evenly, but If there are only a few items it's equivalent to a row per item plus more overhead, and if there are millions of items then the rows are to big, and I have to turn off row cache. Does anybody knows a way around this? The second issue is that in my benchmarks, once the data is mmapped, one item per row performs faster than wide rows by a significant margin. Is this how it is supposed to be? I can give additional data if needed. English is not my first language so I apologize beforehand is some of this doesn't make sense. Thanks for your time If you have wide rows RowCache is a problem. IMHO RowCache is only viable in situations where you have a fixed amount of data and thus will get a high hit rate. I was running a large row cache for some time and I found it unpredictable. It causes memory pressure on the JVM from moving things in and out of memory, and if the hit rate is low taking a key and all its columns in and out repeatedly ends up being counter productive for disk utilization. Suggest KeyCache in most situations, (there is a ticket opened for a fractional row cache) I saw the same behavior. It's a pity there is not a column cache. That would be awesome. Another factor to consider is if you have many rows and many columns you end up with large (er) indexes. In our case we have start up times slightly longer then we would like because the process of sampling indexes during start up is intensive. If I could do it all over again I might serialize more into single columns rather then exploding data across multiple rows and columns. If you always need to look up the entire row do not break it down by columns. So it might be better to store a json serialized version then? I was using SuperColumns to store item info, but a simple string might give me the option to do some compression. memory mapping. There are different dynamics depending on data size relative to memory size. You may have something like ~ 40GB of data and 10GB index, 32GB RAM a node, this system is not going to respond the same way with say 200GB data 25 GB Indexes. Also it is very workload dependent. We have a 6 node cluster with 16 GB RAM each, although the whole dataset is expected to be around 100GB per machine. Which indexes are more expensive, row or column indexes? Hope this helps, Edward It does!
Re: Wide rows or tons of rows?
Thanks for this reply. I'm wondering about the same issue... Should I bucket things into Wide rows (say 10M rows), or narrow (say 10K or 100K).. Of course it depends on my access patterns right... Does anyone know if a partial row cache is a feasible feature to implement? My use case is something like: I have rows with 10MB / 100K Columns of data. I _typically_ slice from oldest to newest on the row, and _typically_ only need the first 100 columns / 10KB, etc... If someone went to implement a cache strategy to support this, would they find it feasible, or difficult/impossible because of some limitation xyz -JD On Mon, Oct 11, 2010 at 8:08 AM, Edward Capriolo edlinuxg...@gmail.comwrote: 2010/10/11 Héctor Izquierdo Seliva izquie...@strands.com: Hi everyone. I'm sure this question or similar has come up before, but I can't find a clear answer. I have to store a unknown number of items in cassandra, which can vary from a few hundreds to a few millions per customer. I read that in cassandra wide rows are better than a lot of rows, but then I face two problems. First, column distribution. The only way I can think of distributing items among a given set of rows is hashing the item id to a row id, and the using the item id as the column name. In this way, I can distribute data among a few rows evenly, but If there are only a few items it's equivalent to a row per item plus more overhead, and if there are millions of items then the rows are to big, and I have to turn off row cache. Does anybody knows a way around this? The second issue is that in my benchmarks, once the data is mmapped, one item per row performs faster than wide rows by a significant margin. Is this how it is supposed to be? I can give additional data if needed. English is not my first language so I apologize beforehand is some of this doesn't make sense. Thanks for your time If you have wide rows RowCache is a problem. IMHO RowCache is only viable in situations where you have a fixed amount of data and thus will get a high hit rate. I was running a large row cache for some time and I found it unpredictable. It causes memory pressure on the JVM from moving things in and out of memory, and if the hit rate is low taking a key and all its columns in and out repeatedly ends up being counter productive for disk utilization. Suggest KeyCache in most situations, (there is a ticket opened for a fractional row cache) Another factor to consider is if you have many rows and many columns you end up with large (er) indexes. In our case we have start up times slightly longer then we would like because the process of sampling indexes during start up is intensive. If I could do it all over again I might serialize more into single columns rather then exploding data across multiple rows and columns. If you always need to look up the entire row do not break it down by columns. memory mapping. There are different dynamics depending on data size relative to memory size. You may have something like ~ 40GB of data and 10GB index, 32GB RAM a node, this system is not going to respond the same way with say 200GB data 25 GB Indexes. Also it is very workload dependent. Hope this helps, Edward
Re: Wide rows or tons of rows?
No idea about a partial row cache, but I would start with fat rows in your use case. If you find that performance is really a problem then you could add a second "recent / oldest" CF that you maintain with the most recent entries and use the row cache there. OR add more nodes.AaronOn 12 Oct, 2010,at 10:08 AM, Jeremy Davis jerdavis.cassan...@gmail.com wrote:Thanks for this reply. I'm wondering about the same issue... Should I bucket things into Wide rows (say 10M rows), or narrow (say 10K or 100K)..Of course it depends on my access patterns right... Does anyone know if a partial row cache is a feasible feature to implement? My use case is something like: I have rows with 10MB / 100K Columns of data. I _typically_ slice from oldest to newest on the row, and _typically_ only need the first 100 columns / 10KB, etc... If someone went to implement a cache strategy to support this, would they find it feasible, or difficult/impossible because of some limitation xyz -JD On Mon, Oct 11, 2010 at 8:08 AM, Edward Capriolo edlinuxg...@gmail.com wrote: 2010/10/11 Héctor Izquierdo Seliva izquie...@strands.com: Hi everyone. I'm sure this question or similar has come up before, but I can't find a clear answer. I have to store a unknown number of items in cassandra, which can vary from a few hundreds to a few millions per customer. I read that in cassandra wide rows are better than a lot of rows, but then I face two problems. First, column distribution. The only way I can think of distributing items among a given set of rows is hashing the item id to a row id, and the using the item id as the column name. In this way, I can distribute data among a few rows evenly, but If there are only a few items it's equivalent to a row per item plus more overhead, and if there are millions of items then the rows are to big, and I have to turn off row cache. Does anybody knows a way around this? The second issue is that in my benchmarks, once the data is mmapped, one item per row performs faster than wide rows by a significant margin. Is this how it is supposed to be? I can give additional data if needed. English is not my first language so I apologize beforehand is some of this doesn't make sense. Thanks for your time If you have wide rows RowCache is a problem. IMHO RowCache is only viable in situations where you have a fixed amount of data and thus will get a high hit rate. I was running a large row cache for some time and I found it unpredictable. It causes memory pressure on the JVM from moving things in and out of memory, and if the hit rate is low taking a key and all its columns in and out repeatedly ends up being counter productive for disk utilization. Suggest KeyCache in most situations, (there is a ticket opened for a fractional row cache) Another factor to consider is if you have many rows and many columns you end up with large (er) indexes. In our case we have start up times slightly longer then we would like because the process of sampling indexes during start up is intensive. If I could do it all over again I might serialize more into single columns rather then exploding data across multiple rows and columns. If you always need to look up the entire row do not break it down by columns. memory mapping. There are different dynamics depending on data size relative to memory size. You may have something like ~ 40GB of data and 10GB index, 32GB RAM a node, this system is not going to respond the same way with say 200GB data 25 GB Indexes. Also it is very workload dependent. Hope this helps, Edward