Re: Wide rows splitting

2017-09-18 Thread Stefano Ortolani
You might find this interesting:
https://medium.com/@foundev/synthetic-sharding-in-cassandra-to-deal-with-large-partitions-2124b2fd788b

Cheers,
Stefano

On Mon, Sep 18, 2017 at 5:07 AM, Adam Smith  wrote:

> Dear community,
>
> I have a table with inlinks to URLs, i.e. many URLs point to
> http://google.com, less URLs point to http://somesmallweb.page.
>
> It has very wide and very skinny rows - the distribution is following a
> power law. I do not know a priori how many columns a row has. Also, I can't
> identify a schema to introduce a good partitioning.
>
> Currently, I am thinking about introducing splits by: pk is like (URL,
> splitnumber), where splitnumber is initially 1 and  hash URL mod
> splitnumber would determine the splitnumber on insert. I would need a
> separate table to maintain the splitnumber and a spark-cassandra-connector
> job counts the columns and and increases/doubles the number of splits on
> demand. This means then that I would have to move e.g. (URL1,0) -> (URL1,1)
> when splitnumber would be 2.
>
> Would you do the same? Is there a better way?
>
> Thanks!
> Adam
>


Re: wide rows

2016-10-18 Thread Yabin Meng
With CQL data modeling, everything is called a "row". But really in CQL, a
row is just a logical concept. So if you think of "wide partition" instead
of "wide row" (partition is what is determined by the has index of the
partition key), it will help the understanding a bit: one wide-partition
may contain multiple logical CQL rows - each CQL row just represents one
actual storage column of the partition.

Time-series data is usually a good fit for "wide-partition" data modeling,
but please remember that don't go too crazy with it.

Cheers,

Yabin

On Tue, Oct 18, 2016 at 11:23 AM, DuyHai Doan  wrote:

> // user table: skinny partition
> CREATE TABLE user (
> user_id uuid,
> firstname text,
> lastname text,
> 
> PRIMARY KEY ((user_id))
> );
>
> // sensor_data table: wide partition
> CREATE TABLE sensor_data (
>  sensor_id uuid,
>  date timestamp,
>  value double,
>  PRIMARY KEY ((sensor_id),  date)
> );
>
> On Tue, Oct 18, 2016 at 5:07 PM, S Ahmed  wrote:
>
>> Hi,
>>
>> Can someone clarify how you would model a "wide" row cassandra table?
>> From what I understand, a wide row table is where you keep appending
>> columns to a given row.
>>
>> The other way to model a table would be the "regular" style where each
>> row contains data so you would during a SELECT you would want multiple rows
>> as oppose to a wide row where you would get a single row, but a subset of
>> columns.
>>
>> Can someone show a simple data model that compares both styles?
>>
>> Thanks.
>>
>
>


Re: wide rows

2016-10-18 Thread DuyHai Doan
// user table: skinny partition
CREATE TABLE user (
user_id uuid,
firstname text,
lastname text,

PRIMARY KEY ((user_id))
);

// sensor_data table: wide partition
CREATE TABLE sensor_data (
 sensor_id uuid,
 date timestamp,
 value double,
 PRIMARY KEY ((sensor_id),  date)
);

On Tue, Oct 18, 2016 at 5:07 PM, S Ahmed  wrote:

> Hi,
>
> Can someone clarify how you would model a "wide" row cassandra table?
> From what I understand, a wide row table is where you keep appending
> columns to a given row.
>
> The other way to model a table would be the "regular" style where each row
> contains data so you would during a SELECT you would want multiple rows as
> oppose to a wide row where you would get a single row, but a subset of
> columns.
>
> Can someone show a simple data model that compares both styles?
>
> Thanks.
>


RE: wide rows

2016-10-18 Thread S Ahmed
Hi,

Can someone clarify how you would model a "wide" row cassandra table?  From
what I understand, a wide row table is where you keep appending columns to
a given row.

The other way to model a table would be the "regular" style where each row
contains data so you would during a SELECT you would want multiple rows as
oppose to a wide row where you would get a single row, but a subset of
columns.

Can someone show a simple data model that compares both styles?

Thanks.


Re: Wide rows best practices and GC impact

2014-12-04 Thread Jabbar Azam
Hello,

I saw this earlier yesterday but didn't want to reply because I didn't know
what the cause was.

Basically I using wide rows with cassandra 1.x and was inserting data
constantly. After about 18 hours the JVM would crash with a dump file. For
some reason I removed the compaction throttling and the problem
disappeared. I've never really found out what the root cause was.


On Thu Dec 04 2014 at 2:49:57 AM Gianluca Borello gianl...@draios.com
wrote:

 Thanks Robert, I really appreciate your help!

 I'm still unsure why Cassandra 2.1 seem to perform much better in that
 same scenario (even setting the same values of compaction threshold and
 number of compactors), but I guess we'll revise when we'll decide to
 upgrade 2.1 in production.

 On Dec 3, 2014 6:33 PM, Robert Coli rc...@eventbrite.com wrote:
 
  On Tue, Dec 2, 2014 at 5:01 PM, Gianluca Borello gianl...@draios.com
 wrote:
 
  We mainly store time series-like data, where each data point is a
 binary blob of 5-20KB. We use wide rows, and try to put in the same row all
 the data that we usually need in a single query (but not more than that).
 As a result, our application logic is very simple (since we have to do just
 one query to read the data on average) and read/write response times are
 very satisfactory. This is a cfhistograms and a cfstats of our heaviest CF:
 
 
  100mb is not HYOOOGE but is around the size where large rows can cause
 heap pressure.
 
  You seem to be unclear on the implications of pending compactions,
 however.
 
  Briefly, pending compactions indicate that you have more SSTables than
 you should. As compaction both merges row versions and reduces the number
 of SSTables, a high number of pending compactions causes problems
 associated with both having too many row versions (fragmentation) and a
 large number of SSTables (per-SSTable heap/memory (depending on version)
 overhead like bloom filters and index samples). In your case, it seems the
 problem is probably just the compaction throttle being too low.
 
  My conjecture is that, given your normal data size and read/write
 workload, you are relatively close to GC pre-fail when compaction is
 working. When it stops working, you relatively quickly get into a state
 where you exhaust heap because you have too many SSTables.
 
  =Rob
  http://twitter.com/rcolidba
  PS - Given 30GB of RAM on the machine, you could consider investigating
 large-heap configurations, rbranson from Instagram has some slides out
 there on the topic. What you pay is longer stop the world GCs, IOW latency
 if you happen to be talking to a replica node when it pauses.
 



Re: Wide rows best practices and GC impact

2014-12-03 Thread Robert Coli
On Tue, Dec 2, 2014 at 5:01 PM, Gianluca Borello gianl...@draios.com
wrote:

 We mainly store time series-like data, where each data point is a binary
 blob of 5-20KB. We use wide rows, and try to put in the same row all the
 data that we usually need in a single query (but not more than that). As a
 result, our application logic is very simple (since we have to do just one
 query to read the data on average) and read/write response times are very
 satisfactory. This is a cfhistograms and a cfstats of our heaviest CF:


100mb is not HYOOOGE but is around the size where large rows can cause heap
pressure.

You seem to be unclear on the implications of pending compactions, however.

Briefly, pending compactions indicate that you have more SSTables than you
should. As compaction both merges row versions and reduces the number of
SSTables, a high number of pending compactions causes problems associated
with both having too many row versions (fragmentation) and a large number
of SSTables (per-SSTable heap/memory (depending on version) overhead like
bloom filters and index samples). In your case, it seems the problem is
probably just the compaction throttle being too low.

My conjecture is that, given your normal data size and read/write workload,
you are relatively close to GC pre-fail when compaction is working. When
it stops working, you relatively quickly get into a state where you exhaust
heap because you have too many SSTables.

=Rob
http://twitter.com/rcolidba
PS - Given 30GB of RAM on the machine, you could consider investigating
large-heap configurations, rbranson from Instagram has some slides out
there on the topic. What you pay is longer stop the world GCs, IOW latency
if you happen to be talking to a replica node when it pauses.


Re: Wide rows best practices and GC impact

2014-12-03 Thread Gianluca Borello
Thanks Robert, I really appreciate your help!

I'm still unsure why Cassandra 2.1 seem to perform much better in that same
scenario (even setting the same values of compaction threshold and number
of compactors), but I guess we'll revise when we'll decide to upgrade 2.1
in production.

On Dec 3, 2014 6:33 PM, Robert Coli rc...@eventbrite.com wrote:

 On Tue, Dec 2, 2014 at 5:01 PM, Gianluca Borello gianl...@draios.com
wrote:

 We mainly store time series-like data, where each data point is a binary
blob of 5-20KB. We use wide rows, and try to put in the same row all the
data that we usually need in a single query (but not more than that). As a
result, our application logic is very simple (since we have to do just one
query to read the data on average) and read/write response times are very
satisfactory. This is a cfhistograms and a cfstats of our heaviest CF:


 100mb is not HYOOOGE but is around the size where large rows can cause
heap pressure.

 You seem to be unclear on the implications of pending compactions,
however.

 Briefly, pending compactions indicate that you have more SSTables than
you should. As compaction both merges row versions and reduces the number
of SSTables, a high number of pending compactions causes problems
associated with both having too many row versions (fragmentation) and a
large number of SSTables (per-SSTable heap/memory (depending on version)
overhead like bloom filters and index samples). In your case, it seems the
problem is probably just the compaction throttle being too low.

 My conjecture is that, given your normal data size and read/write
workload, you are relatively close to GC pre-fail when compaction is
working. When it stops working, you relatively quickly get into a state
where you exhaust heap because you have too many SSTables.

 =Rob
 http://twitter.com/rcolidba
 PS - Given 30GB of RAM on the machine, you could consider investigating
large-heap configurations, rbranson from Instagram has some slides out
there on the topic. What you pay is longer stop the world GCs, IOW latency
if you happen to be talking to a replica node when it pauses.



Re: Wide Rows - Data Model Design

2014-09-19 Thread Jonathan Lacefield
Hello,

  Yes, this is a wide row table design.  The first col is your Partition
Key.  The remaining 2 cols are clustering cols.  You will receive ordered
result sets based on client_name, record_date when running that query.

Jonathan

[image: datastax_logo.png]

Jonathan Lacefield

Solution Architect | (404) 822 3487 | jlacefi...@datastax.com

[image: linkedin.png] http://www.linkedin.com/in/jlacefield/ [image:
facebook.png] https://www.facebook.com/datastax [image: twitter.png]
https://twitter.com/datastax [image: g+.png]
https://plus.google.com/+Datastax/about
http://feeds.feedburner.com/datastax https://github.com/datastax/

On Fri, Sep 19, 2014 at 10:41 AM, Check Peck comptechge...@gmail.com
wrote:

 I am trying to use wide rows concept in my data modelling design for
 Cassandra. We are using Cassandra 2.0.6.

 CREATE TABLE test_data (
   test_id int,
   client_name text,
   record_data text,
   creation_date timestamp,
   last_modified_date timestamp,
   PRIMARY KEY (test_id, client_name, record_data)
 )

 So I came up with above table design. Does my above table falls under the
 category of wide rows in Cassandra or not?

 And is there any problem If I have three columns in my  PRIMARY KEY? I
 guess PARTITION KEY will be test_id right? And what about other two?

 In this table, we can have multiple record_data for same client_name.

 Query Pattern will be -

 select client_name, record_data from test_data where test_id = 1;



Re: Wide Rows - Data Model Design

2014-09-19 Thread DuyHai Doan
Does my above table falls under the category of wide rows in Cassandra or
not? -- It depends on the cardinality. For each distinct test_id, how
many combinations of client_name/record_data do you have ?

 By the way, why do you put the record_data as part of primary key ?

In your table partiton key = test_id, client_name = first clustering
column, record_data = second clustering column


On Fri, Sep 19, 2014 at 5:41 PM, Check Peck comptechge...@gmail.com wrote:

 I am trying to use wide rows concept in my data modelling design for
 Cassandra. We are using Cassandra 2.0.6.

 CREATE TABLE test_data (
   test_id int,
   client_name text,
   record_data text,
   creation_date timestamp,
   last_modified_date timestamp,
   PRIMARY KEY (test_id, client_name, record_data)
 )

 So I came up with above table design. Does my above table falls under the
 category of wide rows in Cassandra or not?

 And is there any problem If I have three columns in my  PRIMARY KEY? I
 guess PARTITION KEY will be test_id right? And what about other two?

 In this table, we can have multiple record_data for same client_name.

 Query Pattern will be -

 select client_name, record_data from test_data where test_id = 1;



Re: Wide Rows - Data Model Design

2014-09-19 Thread Check Peck
@DuyHai - I have put that because of this condition -

In this table, we can have multiple record_data for same client_name.

It can be multiple combinations of client_name and record_data for each
distinct test_id.


On Fri, Sep 19, 2014 at 8:48 AM, DuyHai Doan doanduy...@gmail.com wrote:

 Does my above table falls under the category of wide rows in Cassandra
 or not? -- It depends on the cardinality. For each distinct test_id, how
 many combinations of client_name/record_data do you have ?

  By the way, why do you put the record_data as part of primary key ?

 In your table partiton key = test_id, client_name = first clustering
 column, record_data = second clustering column


 On Fri, Sep 19, 2014 at 5:41 PM, Check Peck comptechge...@gmail.com
 wrote:

 I am trying to use wide rows concept in my data modelling design for
 Cassandra. We are using Cassandra 2.0.6.

 CREATE TABLE test_data (
   test_id int,
   client_name text,
   record_data text,
   creation_date timestamp,
   last_modified_date timestamp,
   PRIMARY KEY (test_id, client_name, record_data)
 )

 So I came up with above table design. Does my above table falls under the
 category of wide rows in Cassandra or not?

 And is there any problem If I have three columns in my  PRIMARY KEY? I
 guess PARTITION KEY will be test_id right? And what about other two?

 In this table, we can have multiple record_data for same client_name.

 Query Pattern will be -

 select client_name, record_data from test_data where test_id = 1;





Re: Wide Rows - Data Model Design

2014-09-19 Thread DuyHai Doan
Ahh yes, sorry, I read too fast, missed it.

On Fri, Sep 19, 2014 at 5:54 PM, Check Peck comptechge...@gmail.com wrote:

 @DuyHai - I have put that because of this condition -

 In this table, we can have multiple record_data for same client_name.

 It can be multiple combinations of client_name and record_data for each
 distinct test_id.


 On Fri, Sep 19, 2014 at 8:48 AM, DuyHai Doan doanduy...@gmail.com wrote:

 Does my above table falls under the category of wide rows in Cassandra
 or not? -- It depends on the cardinality. For each distinct test_id, how
 many combinations of client_name/record_data do you have ?

  By the way, why do you put the record_data as part of primary key ?

 In your table partiton key = test_id, client_name = first clustering
 column, record_data = second clustering column


 On Fri, Sep 19, 2014 at 5:41 PM, Check Peck comptechge...@gmail.com
 wrote:

 I am trying to use wide rows concept in my data modelling design for
 Cassandra. We are using Cassandra 2.0.6.

 CREATE TABLE test_data (
   test_id int,
   client_name text,
   record_data text,
   creation_date timestamp,
   last_modified_date timestamp,
   PRIMARY KEY (test_id, client_name, record_data)
 )

 So I came up with above table design. Does my above table falls under
 the category of wide rows in Cassandra or not?

 And is there any problem If I have three columns in my  PRIMARY KEY? I
 guess PARTITION KEY will be test_id right? And what about other two?

 In this table, we can have multiple record_data for same client_name.

 Query Pattern will be -

 select client_name, record_data from test_data where test_id = 1;






Re: Wide rows (time series data) and ORM

2013-10-23 Thread Vivek Mishra
Can Kundera work with wide rows in an ORM manner?

What specifically you looking for? Composite column based implementation
can be built using Kundera.
With Recent CQL3 developments, Kundera supports most of these. I think POJO
needs to be aware of number of fields needs to be persisted(Same as CQL3)

-Vivek


On Wed, Oct 23, 2013 at 12:48 AM, Les Hartzman lhartz...@gmail.com wrote:

 As I'm becoming more familiar with Cassandra I'm still trying to shift my
 thinking from relational to NoSQL.

 Can Kundera work with wide rows in an ORM manner? In other words, can you
 actually design a POJO that fits the standard recipe for JPA usage? Would
 the queries return collections of the POJO to handle wide row data?

 I had considered using Spring and JPA for Cassandra, but it appears that
 other than basic configuration issues for Cassandra, to use Spring and JPA
 on a Cassandra database seems like an effort in futility if Cassandra is
 used as a NoSQL database instead of mimicking an RDBMS solution.

 If anyone can shed any light on this, I'd appreciate it.

 Thanks.

 Les




Re: Wide rows (time series data) and ORM

2013-10-23 Thread Hiller, Dean
PlayOrm supports different types of wide rows like embedded list in the object, 
etc. etc.  There is a list of nosql patterns mixed with playorm patterns on 
this page

http://buffalosw.com/wiki/patterns-page/

From: Les Hartzman lhartz...@gmail.commailto:lhartz...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Tuesday, October 22, 2013 1:18 PM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Wide rows (time series data) and ORM

As I'm becoming more familiar with Cassandra I'm still trying to shift my 
thinking from relational to NoSQL.

Can Kundera work with wide rows in an ORM manner? In other words, can you 
actually design a POJO that fits the standard recipe for JPA usage? Would the 
queries return collections of the POJO to handle wide row data?

I had considered using Spring and JPA for Cassandra, but it appears that other 
than basic configuration issues for Cassandra, to use Spring and JPA on a 
Cassandra database seems like an effort in futility if Cassandra is used as a 
NoSQL database instead of mimicking an RDBMS solution.

If anyone can shed any light on this, I'd appreciate it.

Thanks.

Les



Re: Wide rows (time series data) and ORM

2013-10-23 Thread Les Hartzman
Hi Vivek,

What I'm looking for are a couple of things as I'm gaining an understanding
of Cassandra. With wide rows and time series data, how do you (or can you)
handle this data in an ORM manner? Now I understand that with CQL3, doing a
select * from time_series_data will return the data as multiple rows. So
does handling this data equal the way you would deal with any mapping of
objects to results in a relational manner? Would you still use a JPA
approach or is there a Cassandra/CQL3-specific way of interacting with the
database?

I expect to use a compound key for partitioning/clustering. For example I'm
planning on creating a table as follows:
  CREATE TABLE sensor_data (
sensor_id   text,
date   text,
data_time_stamptimestamp,
reading  int,
PRIMARY KEY ( (sensor_id, date),
data_time_stamp) );
The 'date' field will be day-specific so that for each day there will be a
new row created.

So will I be able to define a POJO, SensorData, with the fields show above
and basically process each 'row' returned by CQL as another SensorData
object?

Thanks.

Les



On Wed, Oct 23, 2013 at 1:22 AM, Vivek Mishra mishra.v...@gmail.com wrote:

 Can Kundera work with wide rows in an ORM manner?

 What specifically you looking for? Composite column based implementation
 can be built using Kundera.
 With Recent CQL3 developments, Kundera supports most of these. I think
 POJO needs to be aware of number of fields needs to be persisted(Same as
 CQL3)

 -Vivek


 On Wed, Oct 23, 2013 at 12:48 AM, Les Hartzman lhartz...@gmail.comwrote:

 As I'm becoming more familiar with Cassandra I'm still trying to shift my
 thinking from relational to NoSQL.

 Can Kundera work with wide rows in an ORM manner? In other words, can you
 actually design a POJO that fits the standard recipe for JPA usage? Would
 the queries return collections of the POJO to handle wide row data?

 I had considered using Spring and JPA for Cassandra, but it appears that
 other than basic configuration issues for Cassandra, to use Spring and JPA
 on a Cassandra database seems like an effort in futility if Cassandra is
 used as a NoSQL database instead of mimicking an RDBMS solution.

 If anyone can shed any light on this, I'd appreciate it.

 Thanks.

 Les





Re: Wide rows (time series data) and ORM

2013-10-23 Thread Les Hartzman
Thanks Dean. I'll check that page out.

Les


On Wed, Oct 23, 2013 at 7:52 AM, Hiller, Dean dean.hil...@nrel.gov wrote:

 PlayOrm supports different types of wide rows like embedded list in the
 object, etc. etc.  There is a list of nosql patterns mixed with playorm
 patterns on this page

 http://buffalosw.com/wiki/patterns-page/

 From: Les Hartzman lhartz...@gmail.commailto:lhartz...@gmail.com
 Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Date: Tuesday, October 22, 2013 1:18 PM
 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
 user@cassandra.apache.orgmailto:user@cassandra.apache.org
 Subject: Wide rows (time series data) and ORM

 As I'm becoming more familiar with Cassandra I'm still trying to shift my
 thinking from relational to NoSQL.

 Can Kundera work with wide rows in an ORM manner? In other words, can you
 actually design a POJO that fits the standard recipe for JPA usage? Would
 the queries return collections of the POJO to handle wide row data?

 I had considered using Spring and JPA for Cassandra, but it appears that
 other than basic configuration issues for Cassandra, to use Spring and JPA
 on a Cassandra database seems like an effort in futility if Cassandra is
 used as a NoSQL database instead of mimicking an RDBMS solution.

 If anyone can shed any light on this, I'd appreciate it.

 Thanks.

 Les




Re: Wide rows (time series data) and ORM

2013-10-23 Thread Hiller, Dean
Another idea is the open source Energy Databus project which does time series 
data and is based on PlayORM actually(ORM is a bad name since it is more noSQL 
patterns and not really relational).

http://www.nrel.gov/analysis/databus/

That Energy Databus project is mainly time series data with some meta data.  I 
think NREL may be holding an Energy Databus summit soon (though again it is 
100% time series data and they need to rename it to just Databus which has been 
talked about at NREL).

Dean

From: Les Hartzman lhartz...@gmail.commailto:lhartz...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Wednesday, October 23, 2013 11:12 AM
To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Re: Wide rows (time series data) and ORM

Thanks Dean. I'll check that page out.

Les


On Wed, Oct 23, 2013 at 7:52 AM, Hiller, Dean 
dean.hil...@nrel.govmailto:dean.hil...@nrel.gov wrote:
PlayOrm supports different types of wide rows like embedded list in the object, 
etc. etc.  There is a list of nosql patterns mixed with playorm patterns on 
this page

http://buffalosw.com/wiki/patterns-page/

From: Les Hartzman 
lhartz...@gmail.commailto:lhartz...@gmail.commailto:lhartz...@gmail.commailto:lhartz...@gmail.com
Reply-To: 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Tuesday, October 22, 2013 1:18 PM
To: 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
 
user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Wide rows (time series data) and ORM

As I'm becoming more familiar with Cassandra I'm still trying to shift my 
thinking from relational to NoSQL.

Can Kundera work with wide rows in an ORM manner? In other words, can you 
actually design a POJO that fits the standard recipe for JPA usage? Would the 
queries return collections of the POJO to handle wide row data?

I had considered using Spring and JPA for Cassandra, but it appears that other 
than basic configuration issues for Cassandra, to use Spring and JPA on a 
Cassandra database seems like an effort in futility if Cassandra is used as a 
NoSQL database instead of mimicking an RDBMS solution.

If anyone can shed any light on this, I'd appreciate it.

Thanks.

Les




Re: Wide rows (time series data) and ORM

2013-10-23 Thread Vivek Mishra
Hi,
CREATE TABLE sensor_data (
sensor_id   text,
date   text,
data_time_stamptimestamp,
reading  int,
PRIMARY KEY ( (sensor_id, date),
data_time_stamp) );

Yes, you can create a POJO for this and map exactly with one row as a POJO
object.

Please have a look at:
https://github.com/impetus-opensource/Kundera/wiki/Using-Compound-keys-with-Kundera

There are users built production system using Kundera, please refer :
https://github.com/impetus-opensource/Kundera/wiki/Kundera-in-Production-Deployments


I am working as a core commitor in Kundera, please do let me know if you
have any query.

Sincerely,
-Vivek



On Wed, Oct 23, 2013 at 10:41 PM, Les Hartzman lhartz...@gmail.com wrote:

 Hi Vivek,

 What I'm looking for are a couple of things as I'm gaining an
 understanding of Cassandra. With wide rows and time series data, how do you
 (or can you) handle this data in an ORM manner? Now I understand that with
 CQL3, doing a select * from time_series_data will return the data as
 multiple rows. So does handling this data equal the way you would deal with
 any mapping of objects to results in a relational manner? Would you still
 use a JPA approach or is there a Cassandra/CQL3-specific way of interacting
 with the database?

 I expect to use a compound key for partitioning/clustering. For example
 I'm planning on creating a table as follows:
   CREATE TABLE sensor_data (
 sensor_id   text,
 date   text,
 data_time_stamptimestamp,
 reading  int,
 PRIMARY KEY ( (sensor_id, date),
 data_time_stamp) );
 The 'date' field will be day-specific so that for each day there will be a
 new row created.

 So will I be able to define a POJO, SensorData, with the fields show above
 and basically process each 'row' returned by CQL as another SensorData
 object?

 Thanks.

 Les



 On Wed, Oct 23, 2013 at 1:22 AM, Vivek Mishra mishra.v...@gmail.comwrote:

 Can Kundera work with wide rows in an ORM manner?

 What specifically you looking for? Composite column based implementation
 can be built using Kundera.
 With Recent CQL3 developments, Kundera supports most of these. I think
 POJO needs to be aware of number of fields needs to be persisted(Same as
 CQL3)

 -Vivek


 On Wed, Oct 23, 2013 at 12:48 AM, Les Hartzman lhartz...@gmail.comwrote:

 As I'm becoming more familiar with Cassandra I'm still trying to shift
 my thinking from relational to NoSQL.

 Can Kundera work with wide rows in an ORM manner? In other words, can
 you actually design a POJO that fits the standard recipe for JPA usage?
 Would the queries return collections of the POJO to handle wide row data?

 I had considered using Spring and JPA for Cassandra, but it appears that
 other than basic configuration issues for Cassandra, to use Spring and JPA
 on a Cassandra database seems like an effort in futility if Cassandra is
 used as a NoSQL database instead of mimicking an RDBMS solution.

 If anyone can shed any light on this, I'd appreciate it.

 Thanks.

 Les






Re: Wide rows (time series data) and ORM

2013-10-23 Thread Les Hartzman
Thanks Vivek. I'll look over those links tonight.



On Wed, Oct 23, 2013 at 4:20 PM, Vivek Mishra mishra.v...@gmail.com wrote:

 Hi,
 CREATE TABLE sensor_data (
 sensor_id   text,
 date   text,
 data_time_stamptimestamp,
 reading  int,
 PRIMARY KEY ( (sensor_id, date),
 data_time_stamp) );

 Yes, you can create a POJO for this and map exactly with one row as a POJO
 object.

 Please have a look at:

 https://github.com/impetus-opensource/Kundera/wiki/Using-Compound-keys-with-Kundera

 There are users built production system using Kundera, please refer :

 https://github.com/impetus-opensource/Kundera/wiki/Kundera-in-Production-Deployments


 I am working as a core commitor in Kundera, please do let me know if you
 have any query.

 Sincerely,
 -Vivek



 On Wed, Oct 23, 2013 at 10:41 PM, Les Hartzman lhartz...@gmail.comwrote:

 Hi Vivek,

 What I'm looking for are a couple of things as I'm gaining an
 understanding of Cassandra. With wide rows and time series data, how do you
 (or can you) handle this data in an ORM manner? Now I understand that with
 CQL3, doing a select * from time_series_data will return the data as
 multiple rows. So does handling this data equal the way you would deal with
 any mapping of objects to results in a relational manner? Would you still
 use a JPA approach or is there a Cassandra/CQL3-specific way of interacting
 with the database?

 I expect to use a compound key for partitioning/clustering. For example
 I'm planning on creating a table as follows:
   CREATE TABLE sensor_data (
 sensor_id   text,
 date   text,
 data_time_stamptimestamp,
 reading  int,
 PRIMARY KEY ( (sensor_id, date),
 data_time_stamp) );
 The 'date' field will be day-specific so that for each day there will be
 a new row created.

 So will I be able to define a POJO, SensorData, with the fields show
 above and basically process each 'row' returned by CQL as another
 SensorData object?

 Thanks.

 Les



 On Wed, Oct 23, 2013 at 1:22 AM, Vivek Mishra mishra.v...@gmail.comwrote:

 Can Kundera work with wide rows in an ORM manner?

 What specifically you looking for? Composite column based implementation
 can be built using Kundera.
 With Recent CQL3 developments, Kundera supports most of these. I think
 POJO needs to be aware of number of fields needs to be persisted(Same as
 CQL3)

 -Vivek


 On Wed, Oct 23, 2013 at 12:48 AM, Les Hartzman lhartz...@gmail.comwrote:

 As I'm becoming more familiar with Cassandra I'm still trying to shift
 my thinking from relational to NoSQL.

 Can Kundera work with wide rows in an ORM manner? In other words, can
 you actually design a POJO that fits the standard recipe for JPA usage?
 Would the queries return collections of the POJO to handle wide row data?

 I had considered using Spring and JPA for Cassandra, but it appears
 that other than basic configuration issues for Cassandra, to use Spring and
 JPA on a Cassandra database seems like an effort in futility if Cassandra
 is used as a NoSQL database instead of mimicking an RDBMS solution.

 If anyone can shed any light on this, I'd appreciate it.

 Thanks.

 Les







Re: Wide rows/composite keys clarification needed

2013-10-21 Thread Les Hartzman
So looking at Patrick McFadin's data modeling videos I now know about using
compound keys as a way of partitioning data on a by-day basis.

My other questions probably go more to the storage engine itself. How do
you refer to the columns in the wide row? What kind of names are assigned
to the columns?

Les
On Oct 20, 2013 9:34 PM, Les Hartzman lhartz...@gmail.com wrote:

 Please correct me if I'm not describing this correctly. But if I am
 collecting sensor data and have a table defined as follows:

  create table sensor_data (
sensor_id int,
time_stamp int,  // time to the hour granularity
voltage float,
amp float,
PRIMARY KEY (sensor_id, time_stamp) ));

 The partitioning value is the sensor_id and the rest of the PK components
 become part of the column name for the additional fields, in this case
 voltage and amp.

 What goes into determining what additional data is inserted into this row?
 The first time an insert takes place there will be one entry for all of the
 fields. Is there anything besides the sensor_id that is used to determine
 that the subsequent insertions for that sensor will go into the same row as
 opposed to starting a new row?

 Base on something I read (but can't currently find again), I thought that
 as long as all of the elements of the PK remain the same (same sensor_id
 and still within the same hour as the first reading), that the next
 insertion would be tacked onto the end of the first row. Is this correct?

 For subsequent entries into the same row for additional voltage/amp
 readings, what are the names of the columns for these readings? My
 understanding is that the column name becomes a concatenation of the
 non-row key field names plus the data field names.So if the first go-around
 you have time_stamp:voltage and time_stamp:amp, what do the
 subsequent column names become?

 Thanks.

 Les




Re: Wide rows/composite keys clarification needed

2013-10-21 Thread Jon Haddad
If you're working with CQL, you don't need to worry about the column names, 
it's handled for you.

If you specify multiple keys as part of the primary key, they become clustering 
keys and are mapped to the column names.  So if you have a sensor_id / 
time_stamp, all your sensor readings will be in the same row in the traditional 
cassandra sense, sorted by your time_stamp.

On Oct 21, 2013, at 4:27 PM, Les Hartzman lhartz...@gmail.com wrote:

 So looking at Patrick McFadin's data modeling videos I now know about using 
 compound keys as a way of partitioning data on a by-day basis.
 
 My other questions probably go more to the storage engine itself. How do you 
 refer to the columns in the wide row? What kind of names are assigned to the 
 columns?
 
 Les
 
 On Oct 20, 2013 9:34 PM, Les Hartzman lhartz...@gmail.com wrote:
 Please correct me if I'm not describing this correctly. But if I am 
 collecting sensor data and have a table defined as follows:
 
  create table sensor_data (
sensor_id int,
time_stamp int,  // time to the hour granularity
voltage float,
amp float,
PRIMARY KEY (sensor_id, time_stamp) ));
 
 The partitioning value is the sensor_id and the rest of the PK components 
 become part of the column name for the additional fields, in this case 
 voltage and amp.
 
 What goes into determining what additional data is inserted into this row? 
 The first time an insert takes place there will be one entry for all of the 
 fields. Is there anything besides the sensor_id that is used to determine 
 that the subsequent insertions for that sensor will go into the same row as 
 opposed to starting a new row?
 
 Base on something I read (but can't currently find again), I thought that as 
 long as all of the elements of the PK remain the same (same sensor_id and 
 still within the same hour as the first reading), that the next insertion 
 would be tacked onto the end of the first row. Is this correct?
 
 For subsequent entries into the same row for additional voltage/amp readings, 
 what are the names of the columns for these readings? My understanding is 
 that the column name becomes a concatenation of the non-row key field names 
 plus the data field names.So if the first go-around you have 
 time_stamp:voltage and time_stamp:amp, what do the subsequent column 
 names become? 
 
 Thanks.
 
 Les
 



Re: Wide rows/composite keys clarification needed

2013-10-21 Thread Les Hartzman
What if you plan on using Kundera and JPQL and not CQL?

Les
On Oct 21, 2013 4:45 PM, Jon Haddad j...@jonhaddad.com wrote:

 If you're working with CQL, you don't need to worry about the column
 names, it's handled for you.

 If you specify multiple keys as part of the primary key, they become
 clustering keys and are mapped to the column names.  So if you have a
 sensor_id / time_stamp, all your sensor readings will be in the same row in
 the traditional cassandra sense, sorted by your time_stamp.

 On Oct 21, 2013, at 4:27 PM, Les Hartzman lhartz...@gmail.com wrote:

 So looking at Patrick McFadin's data modeling videos I now know about
 using compound keys as a way of partitioning data on a by-day basis.

 My other questions probably go more to the storage engine itself. How do
 you refer to the columns in the wide row? What kind of names are assigned
 to the columns?

 Les
 On Oct 20, 2013 9:34 PM, Les Hartzman lhartz...@gmail.com wrote:

 Please correct me if I'm not describing this correctly. But if I am
 collecting sensor data and have a table defined as follows:

  create table sensor_data (
sensor_id int,
time_stamp int,  // time to the hour granularity
voltage float,
amp float,
PRIMARY KEY (sensor_id, time_stamp) ));

 The partitioning value is the sensor_id and the rest of the PK components
 become part of the column name for the additional fields, in this case
 voltage and amp.

 What goes into determining what additional data is inserted into this
 row? The first time an insert takes place there will be one entry for all
 of the fields. Is there anything besides the sensor_id that is used to
 determine that the subsequent insertions for that sensor will go into the
 same row as opposed to starting a new row?

 Base on something I read (but can't currently find again), I thought that
 as long as all of the elements of the PK remain the same (same sensor_id
 and still within the same hour as the first reading), that the next
 insertion would be tacked onto the end of the first row. Is this correct?

 For subsequent entries into the same row for additional voltage/amp
 readings, what are the names of the columns for these readings? My
 understanding is that the column name becomes a concatenation of the
 non-row key field names plus the data field names.So if the first go-around
 you have time_stamp:voltage and time_stamp:amp, what do the
 subsequent column names become?

 Thanks.

 Les





Re: Wide rows/composite keys clarification needed

2013-10-21 Thread Les Hartzman
So I just saw a post about how Kundera translates all JPQL to CQL.


On Mon, Oct 21, 2013 at 4:45 PM, Jon Haddad j...@jonhaddad.com wrote:

 If you're working with CQL, you don't need to worry about the column
 names, it's handled for you.

 If you specify multiple keys as part of the primary key, they become
 clustering keys and are mapped to the column names.  So if you have a
 sensor_id / time_stamp, all your sensor readings will be in the same row in
 the traditional cassandra sense, sorted by your time_stamp.

 On Oct 21, 2013, at 4:27 PM, Les Hartzman lhartz...@gmail.com wrote:

 So looking at Patrick McFadin's data modeling videos I now know about
 using compound keys as a way of partitioning data on a by-day basis.

 My other questions probably go more to the storage engine itself. How do
 you refer to the columns in the wide row? What kind of names are assigned
 to the columns?

 Les
 On Oct 20, 2013 9:34 PM, Les Hartzman lhartz...@gmail.com wrote:

 Please correct me if I'm not describing this correctly. But if I am
 collecting sensor data and have a table defined as follows:

  create table sensor_data (
sensor_id int,
time_stamp int,  // time to the hour granularity
voltage float,
amp float,
PRIMARY KEY (sensor_id, time_stamp) ));

 The partitioning value is the sensor_id and the rest of the PK components
 become part of the column name for the additional fields, in this case
 voltage and amp.

 What goes into determining what additional data is inserted into this
 row? The first time an insert takes place there will be one entry for all
 of the fields. Is there anything besides the sensor_id that is used to
 determine that the subsequent insertions for that sensor will go into the
 same row as opposed to starting a new row?

 Base on something I read (but can't currently find again), I thought that
 as long as all of the elements of the PK remain the same (same sensor_id
 and still within the same hour as the first reading), that the next
 insertion would be tacked onto the end of the first row. Is this correct?

 For subsequent entries into the same row for additional voltage/amp
 readings, what are the names of the columns for these readings? My
 understanding is that the column name becomes a concatenation of the
 non-row key field names plus the data field names.So if the first go-around
 you have time_stamp:voltage and time_stamp:amp, what do the
 subsequent column names become?

 Thanks.

 Les





Re: Wide rows in CQL 3

2013-01-10 Thread Vegard Berget
Thanks for explaining, Sylvain.You say that it is not a mandatory
one, how long could we expect it to be not mandatory?I think the
new CQL stuff is great and I will probably use it heavily.  I
understand the upgrade path, but my question is if I should start
planning for an all-CQL future, or if I still could make some CFs with
thrift and also expect it to work in 3 years time.  You say you
should see CQL3 non compact tables as the new stuff, the thing that
you use post-upgrade - but doesn't that mean that we also have to
suddenly depend on a schema?  Let us for example say you have a
logger, which logs all kinds of different stuff - typically key-value
- and that each row could contain different keys.    ROWKEY1:
 key1: val1, key2: val2, key3: val3ROWKEY2:  key4: val4, key1: val2,
keyN: valN
Is this possible without using multiple rows in CQL3 non compact
tables?  
.vegard,

- Original Message -
From: user@cassandra.apache.org
To:user@cassandra.apache.org 
Cc:
Sent:Wed, 9 Jan 2013 23:14:25 +0100
Subject:Re: Wide rows in CQL 3

I'd be clear, CQL3 is meant as an upgrade from thrift. Not a mandatory
one, you
 can stick to thrift if you don't think CQL3 is better. But if you do
decide to  upgrade, you should see CQL3 non compact tables as the new
stuff, the thing that you use post-upgrade. While you upgrade, stick
to compact tables. Once you've upgraded, then you can start using the
new stuff and accessing the new stuff the old way doesn't matter. 

 -- Sylvain 



Re: Wide rows in CQL 3

2013-01-10 Thread aaron morton
 Is this possible without using multiple rows in CQL3 non compact tables?  
Depending on the number of (log record) keys you *could* do this as a map type 
in your CQL Table. 

create table log_row (
sequence timestamp, 
props maptext, text
)

Cheers


-
Aaron Morton
Freelance Cassandra Developer
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 11/01/2013, at 1:58 AM, Vegard Berget p...@fantasista.no wrote:

 Thanks for explaining, Sylvain.
 You say that it is not a mandatory one, how long could we expect it to be 
 not mandatory?
 I think the new CQL stuff is great and I will probably use it heavily.  I 
 understand the upgrade path, but my question is if I should start planning 
 for an all-CQL future, or if I still could make some CFs with thrift and also 
 expect it to work in 3 years time.  You say you should see CQL3 non compact 
 tables as the new stuff, the thing that you use post-upgrade - but doesn't 
 that mean that we also have to suddenly depend on a schema?  Let us for 
 example say you have a logger, which logs all kinds of different stuff - 
 typically key-value - and that each row could contain different keys.
 ROWKEY1:  key1: val1, key2: val2, key3: val3
 ROWKEY2:  key4: val4, key1: val2, keyN: valN
 
 Is this possible without using multiple rows in CQL3 non compact tables?  
 
 .vegard,
 
 
 
 - Original Message -
 From:
 user@cassandra.apache.org
 
 To:
 user@cassandra.apache.org user@cassandra.apache.org
 Cc:
 
 Sent:
 Wed, 9 Jan 2013 23:14:25 +0100
 Subject:
 Re: Wide rows in CQL 3
 
 
 I'd be clear, CQL3 is meant as an upgrade from thrift. Not a mandatory one, 
 you
 can stick to thrift if you don't think CQL3 is better. But if you do decide to
 upgrade, you should see CQL3 non compact tables as the new stuff, the thing
 that you use post-upgrade. While you upgrade, stick to compact tables. Once
 you've upgraded, then you can start using the new stuff and accessing the new
 stuff the old way doesn't matter.
 
 
 
 
 --
 Sylvain
 



Re: Wide rows in CQL 3

2013-01-09 Thread Hiller, Dean
Probably should read this
http://www.datastax.com/dev/blog/cql3-for-cassandra-experts

I don't see wide row support going away since they specifically made the change 
to enable 2 billion columns in a row according to that paper.

Dean

From: mrevilgnome mrevilgn...@gmail.commailto:mrevilgn...@gmail.com
Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org 
user@cassandra.apache.orgmailto:user@cassandra.apache.org
Date: Wednesday, January 9, 2013 9:51 AM
To: user user@cassandra.apache.orgmailto:user@cassandra.apache.org
Subject: Wide rows in CQL 3

We use the thrift bindings for our current production cluster, so I haven't 
been tracking the developments regarding CQL3. I just discovered when speaking 
to another potential DSE customer that wide rows, or rather columns not defined 
in the metadata aren't supported in CQL 3.

I'm curious to understand the reasoning behind this, whether this is an 
intentional direction shift away from the big table paradigm, and what's 
supposed to happen to those of us who have already bought into C* specifically 
because of the wide row support. What is our upgrade path?


Re: Wide rows in CQL 3

2013-01-09 Thread Ben Hood
I'm currently in the process of porting my app from Thrift to CQL3 and it
seems to me that the underlying storage layout hasn't really changed
fundamentally. The difference appears to be that CQL3 offers a neater
abstraction on top of the wide row format. For example, in CQL3, your query
results are bound to a specific schema, so you get named columns back -
previously you had to process the slices procedurally. The insert path
appears to be tighter as well - you don't seem to get away with leaving out
key attributes.

I'm sure somebody more knowledgeable can explain this better though.

Cheers,

Ben


On Wed, Jan 9, 2013 at 4:51 PM, mrevilgnome mrevilgn...@gmail.com wrote:

 We use the thrift bindings for our current production cluster, so I
 haven't been tracking the developments regarding CQL3. I just discovered
 when speaking to another potential DSE customer that wide rows, or rather
 columns not defined in the metadata aren't supported in CQL 3.

 I'm curious to understand the reasoning behind this, whether this is an
 intentional direction shift away from the big table paradigm, and what's
 supposed to happen to those of us who have already bought into
 C* specifically because of the wide row support. What is our upgrade path?



Re: Wide rows in CQL 3

2013-01-09 Thread Edward Capriolo
I ask myself this every day. CQL3 is new way to do things, including wide
rows with collections. There is no upgrade path. You adopt CQL3's sparse
tables as soon as you start creating column families from CQL. There is not
much backwards compatibility. CQL3 can query compact tables, but you may
have to remove the metadata from them so they can be transposed. Thrift can
not write into CQL tables easily, because of how the primary keys and
column names are encoded into the key column and compact metadata is not
equal to cql3's metadata.

http://www.datastax.com/dev/blog/thrift-to-cql3

For a large swath of problems I like how CQL3 deals with them. For example
you do not really need CQL3 to store a collection in a column family along
side other data. You can use wide rows for this, but the integrated
solution with CQL3 metadata is interesting.

My biggest beefs are:
1) column names are UTF8 (seems wasteful in most cases)
2) sparse empty row to ghost (seems like tiny rows with one column have
much overhead now)
3) using composites (with (compound primary keys) in some table designs) is
wasteful. Composite adds two unsigned bytes for size and one unsigned byte
as 0 per part.
4) many lines of code between user/request and actual disk. (tracing a CQL
select VS a slice, young gen, etc)
5) not sure if collections can be used in REALLY wide row scenarios. aka
1,000,000 entry set?

I feel that in an effort to be nube friendly, sparse+CQL is presented as
the best default option.  However the 5 above items are not minor, and in
several use cases could make CQL's sparse tables a bad choice for certain
applications. Those users would get better performance from compact
storage. I feel that message sometimes gets washed away in all the CQL
coolness. What is that you say? This is not actually the most efficient
way to store this data? Well who cares I can do an IN CLAUSE! WooHoo!


On Wed, Jan 9, 2013 at 12:10 PM, Ben Hood 0x6e6...@gmail.com wrote:

 I'm currently in the process of porting my app from Thrift to CQL3 and it
 seems to me that the underlying storage layout hasn't really changed
 fundamentally. The difference appears to be that CQL3 offers a neater
 abstraction on top of the wide row format. For example, in CQL3, your query
 results are bound to a specific schema, so you get named columns back -
 previously you had to process the slices procedurally. The insert path
 appears to be tighter as well - you don't seem to get away with leaving out
 key attributes.

 I'm sure somebody more knowledgeable can explain this better though.

 Cheers,

 Ben


 On Wed, Jan 9, 2013 at 4:51 PM, mrevilgnome mrevilgn...@gmail.com wrote:

 We use the thrift bindings for our current production cluster, so I
 haven't been tracking the developments regarding CQL3. I just discovered
 when speaking to another potential DSE customer that wide rows, or rather
 columns not defined in the metadata aren't supported in CQL 3.

 I'm curious to understand the reasoning behind this, whether this is an
 intentional direction shift away from the big table paradigm, and what's
 supposed to happen to those of us who have already bought into
 C* specifically because of the wide row support. What is our upgrade path?





Re: Wide rows in CQL 3

2013-01-09 Thread Sylvain Lebresne
 There is no upgrade path.

I don't think that's true. The goal of the blog post you've linked is to
discuss that upgrade path (and in particular show that for the most part,
you
can access your thrift data from CQL3 without any modification whatsoever).

 You adopt CQL3's sparse tables as soon as you start creating column
families
 from CQL.

That's not true, you can create non sparse from CQL3 (using COMPACT STORAGE)
and so you can work with both CQL3 and thrift alongside the time it takes
you
to upgrade from thrift to CQL3. Then, for things that you know you will only
access to CQL3 (i.e. when the upgrade is complete), you can start using
non
compact tables and enjoy their convenience (like collections for instance).

 There is not much backwards compatibility. CQL3 can query compact tables,
but
 you may have to remove the metadata from them so they can be transposed.

I think not much backwards compatibility is a tad unfair. The only case
where
you may have to remove the metadata is if you are using a CF in both a
static
and dynamic way. Now I can't pretend knowing what every user is doing, but
from
my experience and what I've seen, this is not such a common thing and CF are
either static or dynamic in nature, not both.

I do think that for most user upgrading from thrift to CQL3 won't require
any
data migration or messing with metadata. But more importantly, things are
not
completely closed. If you have *concrete* difficulties moving from thrift to
CQL3, please do share them on this mailing list and we'll try to help you
out.

 Thrift can not write into CQL tables easily, because of how the primary
keys
 and column names are encoded into the key column and compact metadata is
not
 equal to cql3's metadata.

I'd be clear, CQL3 is meant as an upgrade from thrift. Not a mandatory one,
you
can stick to thrift if you don't think CQL3 is better. But if you do decide
to
upgrade, you should see CQL3 non compact tables as the new stuff, the thing
that you use post-upgrade. While you upgrade, stick to compact tables. Once
you've upgraded, then you can start using the new stuff and accessing the
new
stuff the old way doesn't matter.

 My biggest beefs are:
 1) column names are UTF8 (seems wasteful in most cases)

That's largely not true, the wasteful in most cases part at least. A
column
name in CQL3 does not always translate to a internal column name. You can
still
do your time series where the internal column name is an int and you don't
waste space.

As for the static cases, yes, CQL3 forces UTF8, I'm pretty certain that
people
overwhelmingly use UTF8 or ascii in those cases. And because CQL3 forces
you to
declare your column names in those static cases, we may actually be able to
optimize the size used internally for those in the future, which is harder
with
thrift, so I think we actually have the potential to make is less wasteful
in
most cases.

 2) sparse empty row to ghost (seems like tiny rows with one column have
much
 overhead now)

It is true that for non compact CQL3 we've focused on flexibility and on
making
the behavior predictable, which does adds some slight space overhead.
However:
- that's why compact storage is here. There is zero overhead over thrift if
  you use compact storage. That's even why we named it like that, it's
compact.
- we know that most the overhead of non compact tables can be win back by
  optimization of the storage engine. That's an advantage of having an API
  that is not too ties to the underlying storage: it gives room for
  optimizations.

 3) using composites (with (compound primary keys) in some table designs)
is
 wasteful. Composite adds two unsigned bytes for size and one unsigned
byte as
 0 per part.

See above.

 4) many lines of code between user/request and actual disk. (tracing a CQL
 select VS a slice, young gen, etc)

If you are saying the implementation of CQL3 is more lines of code than the
thrift part, then you're probably right, but given how much convenient CQL3
is
compared to thrift, I happily take that criticism.

But in term of overhead, provided you use prepared statement (which you
should
if you care about performance), then it remains to be proven that CQL3 has
more
overhead than thrift. In particular in terms of garbage (since you're citing
young gen), while I haven't tested it, I'd be *really* surprised if thrift
is
generating less garbage than CQL3. And in term of the query tracing there is
almost no difference whatsoever between the two.

 5) not sure if collections can be used in REALLY wide row scenarios. aka
 1,000,000 entry set?

Lists have their downsides (listed in the documentation) but for sets and
maps,
they have no more limitation than wide rows have in theory. They do have the
limitation with the currently the API don't allow to fetch parts of a
collection. But that will change.

That being said and possibly more importantly, collections are *not* meant
to
be very wide. They are *not* meant for wide row scenarios. CQL3 has wide

Re: Wide rows in CQL 3

2013-01-09 Thread Edward Capriolo
By no upgrade path I mean to say if I have a table with compact storage I
can not upgrade it to sparse storage. If i have an existing COMPACT table
and I want to add a Map to it, I can not. This is what I mean by no upgrade
path.

Column families that mix static and dynamic columns are pretty common. In
fact it is pretty much the default case, you have a default validator then
some columns have specific validators. In the old days people used to say
You only need one column family you would subdivide your row key into
parts username=username, password=password, friend-friene = friends,
pet-pets = pets. It's very efficient and very easy if you understand what a
slice is. Is everyone else just adding a column family every time they have
new data? :) Sounds very un-no-sql-like.

Most people are probably going to store column names as tersely as
possible. Your not going to store password as a multibyte
UTF8(password). You store it as ascii(password). (or really ascii('pw')

Also for the rest of my comment I meant that the comparator of any sparse
tables always seems to be a COMPOSITE even if it is only one part (last I
checked). Everything is -COMPOSITE(UTF-8(colname))- at minimum, when in a
compact table it is -colname-

My overarching point is the 5 things I listed do have a cost, the user by
default gets sparse storage unless they are smart enough to know they do
not want it. This is naturally going to force people away from compact
storage.

Basically for any column family: two possible decision paths:

1) use compact
2) use sparse

Other then ease of use why would I chose sparse? Why should it be the
default?

On Wed, Jan 9, 2013 at 5:14 PM, Sylvain Lebresne sylv...@datastax.comwrote:

 c way. Now I can't pretend knowing what every user is doing, but from
 my experience and what I've seen, this is not such a common thing and CF
 are
 either static or dynamic in nature, not both.



Re: Wide rows in CQL 3

2013-01-09 Thread Edward Capriolo
Also I have to say I do not get that blank sparse column.

Ghost ranges are a little weird but they don't bother me.

1 its a row of nothing. The definition of a waste.

2 suppose of have 1 billion rows and my distribution is mostly rows of 1 or
2 columns. My database is now significantly bigger. That stinks.

3 suppose I write columns frequently. Well do I have to constantly need to
keep writing this sparse empty row? It seems like I would. Worst case each
stable with a write to a rowkey also has this sparse column, meaning
multiple blank empty wasteful columns on disk to solve ghosts, that do not
bother me anyway.

4 are these sparse columns also taking memtable space?

This questions would give me serious pause to use sparse tables





On Wednesday, January 9, 2013, Edward Capriolo edlinuxg...@gmail.com
wrote:
 By no upgrade path I mean to say if I have a table with compact storage
I can not upgrade it to sparse storage. If i have an existing COMPACT table
and I want to add a Map to it, I can not. This is what I mean by no upgrade
path.

 Column families that mix static and dynamic columns are pretty common. In
fact it is pretty much the default case, you have a default validator then
some columns have specific validators. In the old days people used to say
You only need one column family you would subdivide your row key into
parts username=username, password=password, friend-friene = friends,
pet-pets = pets. It's very efficient and very easy if you understand what a
slice is. Is everyone else just adding a column family every time they have
new data? :) Sounds very un-no-sql-like.
 Most people are probably going to store column names as tersely as
possible. Your not going to store password as a multibyte
UTF8(password). You store it as ascii(password). (or really ascii('pw')
 Also for the rest of my comment I meant that the comparator of any sparse
tables always seems to be a COMPOSITE even if it is only one part (last I
checked). Everything is -COMPOSITE(UTF-8(colname))- at minimum, when in a
compact table it is -colname-
 My overarching point is the 5 things I listed do have a cost, the user by
default gets sparse storage unless they are smart enough to know they do
not want it. This is naturally going to force people away from compact
storage.
 Basically for any column family: two possible decision paths:
 1) use compact
 2) use sparse
 Other then ease of use why would I chose sparse? Why should it be the
default?
 On Wed, Jan 9, 2013 at 5:14 PM, Sylvain Lebresne sylv...@datastax.com
wrote:

 c way. Now I can't pretend knowing what every user is doing, but from
 my experience and what I've seen, this is not such a common thing and CF
are
 either static or dynamic in nature, not both.



Re: Wide rows in CQL 3

2013-01-09 Thread Janne Jalkanen

On 10 Jan 2013, at 01:30, Edward Capriolo edlinuxg...@gmail.com wrote:

 Column families that mix static and dynamic columns are pretty common. In 
 fact it is pretty much the default case, you have a default validator then 
 some columns have specific validators. In the old days people used to say 
 You only need one column family you would subdivide your row key into parts 
 username=username, password=password, friend-friene = friends, pet-pets = 
 pets. It's very efficient and very easy if you understand what a slice is. Is 
 everyone else just adding a column family every time they have new data? :) 
 Sounds very un-no-sql-like. 

Well, we for sure are heavily mixing static and dynamic columns; it's quite 
useful, really. Which is why upgrading to CQL3 isn't really something I've 
considered seriously at any point.

 Most people are probably going to store column names as tersely as possible. 
 Your not going to store password as a multibyte UTF8(password). You store 
 it as ascii(password). (or really ascii('pw')

UTF8('password') === ascii('password'), actually - as long as you're within 
ascii range, UTF8 and ascii are equal byte for byte. It's not until code points 
 128 where you start getting multibytes.

Having said that, doesn't the sparse storage lend itself really well for 
further column name optimisation - like using a single byte to denote the 
column name and then have a lookup table?  The server could do a lot of nice 
tricks in this area, when afforded so by a tighter schema. Also, I think that 
compression pretty much does this already - effect is the same even if 
mechanism is different.

/Janne



Re: Wide rows and reads

2012-07-05 Thread Philip Shon
From what I understand, wide rows have quite a bit of overhead, especially
if you are picking columns that are far apart from each other for a given
row.

This post by Aaron Morton was quite good at explaining this issue
http://thelastpickle.com/2011/07/04/Cassandra-Query-Plans/

-Phil

On Thu, Jul 5, 2012 at 12:17 PM, Oleg Dulin oleg.du...@gmail.com wrote:

 Here is my flow:

 One process write a really wide row (250K+ supercolumns, each one with 5
 subcolumns, for the total of 1K or so per supercolumn)

 Second process comes in literally 2-3 seconds later and starts reading
 from it.

 My observation is that nothing good happens. It is ridiculously slow to
 read. It seems that if I wait long enough, the reads from that row will be
 much faster.

 Could someone enlighten me as to what exactly happens when I do this ?

 Regards,
 Oleg





Re: Wide rows or tons of rows?

2010-10-11 Thread Edward Capriolo
2010/10/11 Héctor Izquierdo Seliva izquie...@strands.com:
 Hi everyone.

 I'm sure this question or similar has come up before, but I can't find a
 clear answer. I have to store a unknown number of items in cassandra,
 which can vary from a few hundreds to a few millions per customer.

 I read that in cassandra wide rows are better than a lot of rows, but
 then I face two problems. First, column distribution. The only way I can
 think of distributing items among a given set of rows is hashing the
 item id to a row id, and the using the item id as the column name. In
 this way, I can distribute data among a few rows evenly, but If there
 are only a few items it's equivalent to a row per item plus more
 overhead, and if there are millions of items then the rows are to big,
 and I have to turn off row cache. Does anybody knows a way around this?

 The second issue is that in my benchmarks, once the data is mmapped, one
 item per row performs faster than wide rows by a significant margin. Is
 this how it is supposed to be?

 I can give additional data if needed. English is not my first language
 so I apologize beforehand is some of this doesn't make sense.

 Thanks for your time


If you have wide rows RowCache is a problem. IMHO RowCache is only
viable in situations where you have a fixed amount of data and thus
will get a high hit rate. I was running a large row cache for some
time and I found it unpredictable. It causes memory pressure on the
JVM from moving things in and out of memory, and if the hit rate is
low taking a key and all its columns in and out repeatedly ends up
being counter productive for disk utilization. Suggest KeyCache in
most situations, (there is a ticket opened for a fractional row cache)

Another factor to consider is if you have many rows and many columns
you end up with large (er) indexes. In our case we have start up times
slightly longer then we would like because the process of sampling
indexes during start up is intensive. If I could do it all over again
I might serialize more into single columns rather then exploding data
across multiple rows and columns. If you always need to look up the
entire row do not break it down by columns.

memory mapping. There are different dynamics depending on data size
relative to memory size. You may have something like ~ 40GB of data
and 10GB index, 32GB RAM a node, this system is not going to respond
the same way with say 200GB data 25 GB Indexes. Also it is very
workload dependent.

Hope this helps,
Edward


Re: Wide rows or tons of rows?

2010-10-11 Thread Héctor Izquierdo Seliva
El lun, 11-10-2010 a las 11:08 -0400, Edward Capriolo escribió:

Inlined:

 2010/10/11 Héctor Izquierdo Seliva izquie...@strands.com:
  Hi everyone.
 
  I'm sure this question or similar has come up before, but I can't find a
  clear answer. I have to store a unknown number of items in cassandra,
  which can vary from a few hundreds to a few millions per customer.
 
  I read that in cassandra wide rows are better than a lot of rows, but
  then I face two problems. First, column distribution. The only way I can
  think of distributing items among a given set of rows is hashing the
  item id to a row id, and the using the item id as the column name. In
  this way, I can distribute data among a few rows evenly, but If there
  are only a few items it's equivalent to a row per item plus more
  overhead, and if there are millions of items then the rows are to big,
  and I have to turn off row cache. Does anybody knows a way around this?
 
  The second issue is that in my benchmarks, once the data is mmapped, one
  item per row performs faster than wide rows by a significant margin. Is
  this how it is supposed to be?
 
  I can give additional data if needed. English is not my first language
  so I apologize beforehand is some of this doesn't make sense.
 
  Thanks for your time
 
 
 If you have wide rows RowCache is a problem. IMHO RowCache is only
 viable in situations where you have a fixed amount of data and thus
 will get a high hit rate. I was running a large row cache for some
 time and I found it unpredictable. It causes memory pressure on the
 JVM from moving things in and out of memory, and if the hit rate is
 low taking a key and all its columns in and out repeatedly ends up
 being counter productive for disk utilization. Suggest KeyCache in
 most situations, (there is a ticket opened for a fractional row cache)

I saw the same behavior. It's a pity there is not a column cache. That
would be awesome.

 Another factor to consider is if you have many rows and many columns
 you end up with large (er) indexes. In our case we have start up times
 slightly longer then we would like because the process of sampling
 indexes during start up is intensive. If I could do it all over again
 I might serialize more into single columns rather then exploding data
 across multiple rows and columns. If you always need to look up the
 entire row do not break it down by columns.

So it might be better to store a json serialized version then? I was
using SuperColumns to store item info, but a simple string might give me
the option to do some compression.

 memory mapping. There are different dynamics depending on data size
 relative to memory size. You may have something like ~ 40GB of data
 and 10GB index, 32GB RAM a node, this system is not going to respond
 the same way with say 200GB data 25 GB Indexes. Also it is very
 workload dependent.

We have a 6 node cluster with 16 GB RAM  each, although the whole
dataset is expected to be around 100GB per machine. Which indexes are
more expensive, row or column indexes?

 Hope this helps,
 Edward

It does!




Re: Wide rows or tons of rows?

2010-10-11 Thread Jeremy Davis
Thanks for this reply. I'm wondering about the same issue... Should I bucket
things into Wide rows (say 10M rows), or narrow (say 10K or 100K)..
Of course it depends on my access patterns right...

Does anyone know if a partial row cache is a feasible feature to implement?
My use case is something like:
I have rows with 10MB / 100K Columns of data. I _typically_ slice from
oldest to newest on the row, and _typically_ only need the first 100 columns
/ 10KB, etc...

If someone went to implement a cache strategy to support this, would they
find it feasible, or difficult/impossible because of some limitation xyz

-JD



On Mon, Oct 11, 2010 at 8:08 AM, Edward Capriolo edlinuxg...@gmail.comwrote:

 2010/10/11 Héctor Izquierdo Seliva izquie...@strands.com:
  Hi everyone.
 
  I'm sure this question or similar has come up before, but I can't find a
  clear answer. I have to store a unknown number of items in cassandra,
  which can vary from a few hundreds to a few millions per customer.
 
  I read that in cassandra wide rows are better than a lot of rows, but
  then I face two problems. First, column distribution. The only way I can
  think of distributing items among a given set of rows is hashing the
  item id to a row id, and the using the item id as the column name. In
  this way, I can distribute data among a few rows evenly, but If there
  are only a few items it's equivalent to a row per item plus more
  overhead, and if there are millions of items then the rows are to big,
  and I have to turn off row cache. Does anybody knows a way around this?
 
  The second issue is that in my benchmarks, once the data is mmapped, one
  item per row performs faster than wide rows by a significant margin. Is
  this how it is supposed to be?
 
  I can give additional data if needed. English is not my first language
  so I apologize beforehand is some of this doesn't make sense.
 
  Thanks for your time
 
 
 If you have wide rows RowCache is a problem. IMHO RowCache is only
 viable in situations where you have a fixed amount of data and thus
 will get a high hit rate. I was running a large row cache for some
 time and I found it unpredictable. It causes memory pressure on the
 JVM from moving things in and out of memory, and if the hit rate is
 low taking a key and all its columns in and out repeatedly ends up
 being counter productive for disk utilization. Suggest KeyCache in
 most situations, (there is a ticket opened for a fractional row cache)

 Another factor to consider is if you have many rows and many columns
 you end up with large (er) indexes. In our case we have start up times
 slightly longer then we would like because the process of sampling
 indexes during start up is intensive. If I could do it all over again
 I might serialize more into single columns rather then exploding data
 across multiple rows and columns. If you always need to look up the
 entire row do not break it down by columns.

 memory mapping. There are different dynamics depending on data size
 relative to memory size. You may have something like ~ 40GB of data
 and 10GB index, 32GB RAM a node, this system is not going to respond
 the same way with say 200GB data 25 GB Indexes. Also it is very
 workload dependent.

 Hope this helps,
 Edward



Re: Wide rows or tons of rows?

2010-10-11 Thread Aaron Morton
No idea about a partial row cache, but I would start with fat rows in your use case. If you find that performance is really a problem then you could add a second "recent / oldest" CF that you maintain with the most recent entries and use the row cache there. OR add more nodes.AaronOn 12 Oct, 2010,at 10:08 AM, Jeremy Davis jerdavis.cassan...@gmail.com wrote:Thanks for this reply. I'm wondering about the same issue... Should I bucket things into Wide rows (say 10M rows), or narrow (say 10K or 100K)..Of course it depends on my access patterns right... Does anyone know if a partial row cache is a feasible feature to implement? My use case is something like:
I have rows with 10MB / 100K Columns of data. I _typically_ slice from oldest to newest on the row, and _typically_ only need the first 100 columns / 10KB, etc... If someone went to implement a cache strategy to support this, would they find it feasible, or difficult/impossible because of some limitation xyz
-JD
On Mon, Oct 11, 2010 at 8:08 AM, Edward Capriolo edlinuxg...@gmail.com wrote:

2010/10/11 Héctor Izquierdo Seliva izquie...@strands.com:
 Hi everyone.

 I'm sure this question or similar has come up before, but I can't find a
 clear answer. I have to store a unknown number of items in cassandra,
 which can vary from a few hundreds to a few millions per customer.

 I read that in cassandra wide rows are better than a lot of rows, but
 then I face two problems. First, column distribution. The only way I can
 think of distributing items among a given set of rows is hashing the
 item id to a row id, and the using the item id as the column name. In
 this way, I can distribute data among a few rows evenly, but If there
 are only a few items it's equivalent to a row per item plus more
 overhead, and if there are millions of items then the rows are to big,
 and I have to turn off row cache. Does anybody knows a way around this?

 The second issue is that in my benchmarks, once the data is mmapped, one
 item per row performs faster than wide rows by a significant margin. Is
 this how it is supposed to be?

 I can give additional data if needed. English is not my first language
 so I apologize beforehand is some of this doesn't make sense.

 Thanks for your time


If you have wide rows RowCache is a problem. IMHO RowCache is only
viable in situations where you have a fixed amount of data and thus
will get a high hit rate. I was running a large row cache for some
time and I found it unpredictable. It causes memory pressure on the
JVM from moving things in and out of memory, and if the hit rate is
low taking a key and all its columns in and out repeatedly ends up
being counter productive for disk utilization. Suggest KeyCache in
most situations, (there is a ticket opened for a fractional row cache)

Another factor to consider is if you have many rows and many columns
you end up with large (er) indexes. In our case we have start up times
slightly longer then we would like because the process of sampling
indexes during start up is intensive. If I could do it all over again
I might serialize more into single columns rather then exploding data
across multiple rows and columns. If you always need to look up the
entire row do not break it down by columns.

memory mapping. There are different dynamics depending on data size
relative to memory size. You may have something like ~ 40GB of data
and 10GB index, 32GB RAM a node, this system is not going to respond
the same way with say 200GB data 25 GB Indexes. Also it is very
workload dependent.

Hope this helps,
Edward