Re: Good partition key doubt

2014-12-15 Thread José Guilherme Vanz
Nice, I got it. =]
If I have more questions I'll send other emails. xD
Thank you

On Thu, Dec 11, 2014 at 12:17 PM, DuyHai Doan doanduy...@gmail.com wrote:

 what is a good partition key? Is partition key direct related with my
 query performance? What is the best practices?

 A good partition key is a partition key that will scale with your data. An
 example: if you have a business involving individuals, it is likely that
 your business will scale as soon as the number of users will grow. In this
 case user_id is a good partition key because all the users will
 be uniformly distributed over all the Cassandra nodes.

 For your log example, using only server_id for partition key is clearly
 not enough because what will scale is the log lines, not the number of
 server.

 From the point of view of scalability (not taking about query-ability),
 adding the log_type will not scale either, because the number of different
 log types is likely to be a small set. For great scalability (not taking
 about query-ability), the couple (server_id,log_timestamp) is likely a good
 combination.

  Now for query, as you should know, it is not possible to have range query
 (using , ≤, ≥, ) over partition key, you must always use equality (=) so
 you won't be able to leverage the log_timestamp component in the partition
 key for your query.

 Bucketing by date is a good idea though, and the date resolution will
 depends on the log generation rate. If logs are generated very often, maybe
 a bucket by hour. If the generation rate is smaller, maybe a day or a week
 bucket is fine.

 Talking about log_type, putting it into the partition key will help
 partitioning further, in addition of the date bucket. However it forces you
 to always provide a log_type whenever you want to query, be aware of this.

 An example of data model for your logs could be

 CREATE TABLE logs_by_server_and_type_and_date(
server_id int,
log_type text,
date_bucket int, //Date bucket using format MMDD or MMDDHH or
 ...
log_timestamp timeuuid,
log_info text,
PRIMARY KEY((server_id,log_type,date_bucket),log_timestamp)
 );


 And if I want to query all logs in a period of time how can I select I
 range o rows? -- New query path = new table

 CREATE TABLE logs_by_date(
date_bucket int, //Date bucket using format MMDD or MMDDHH or
 ...
log_timestamp timeuuid,
server_id int,
log_type text,
log_info text,
PRIMARY KEY((date_bucket),log_timestamp) // you may add server_id or
 log_type as clustering column optionally
 );

 For this table, the date_bucket should be chosen very carefully because
 for the same bucket, we're going to store logs of ALL servers and all types
 ...

 For the query, you should provide the date bucket as partition key, and
 then use (, ≤, ≥, ) on the log_timestamp column


 On Thu, Dec 11, 2014 at 12:00 PM, José Guilherme Vanz 
 guilherme@gmail.com wrote:

 Hello folks

 I am studying Cassandra for a short a period of time and now I am
 modeling a database for study purposes. During my modeling I have faced a
 doubt, what is a good partition key? Is partition key direct related with
 my query performance? What is the best practices?

 Just to study case, let's suppose I have a column family where is
 inserted all kind of logs ( http server, application server, application
 logs, etc ) data from different servers. In this column family I have
 server_id ( unique identifier for each server ) column, log_type ( http
 server,  application server, application log ) column and log_info column.
 Is a good ideia create a partition key using server_id and log_type columns
 to store all logs data from a specific type and server in a physical row?
 And if do I want a physical row for each day? Is a good idea add a third
 column with the date in the partition key? And if I want to query all logs
 in a period of time how can I select I range o rows? Do I have to duplicate
 date column ( considering I have to use = operator with partition key ) ?

 All the best
 --
 Att. José Guilherme Vanz
 br.linkedin.com/pub/josé-guilherme-vanz/51/b27/58b/
 http://br.linkedin.com/pub/jos%C3%A9-guilherme-vanz/51/b27/58b/
 O sofrimento é passageiro, desistir é para sempre - Bernardo Fonseca,
 recordista da Antarctic Ice Marathon.




-- 
Att. José Guilherme Vanz
br.linkedin.com/pub/josé-guilherme-vanz/51/b27/58b/
http://br.linkedin.com/pub/jos%C3%A9-guilherme-vanz/51/b27/58b/
O sofrimento é passageiro, desistir é para sempre - Bernardo Fonseca,
recordista da Antarctic Ice Marathon.


Good partition key doubt

2014-12-11 Thread José Guilherme Vanz
Hello folks

I am studying Cassandra for a short a period of time and now I am modeling
a database for study purposes. During my modeling I have faced a doubt,
what is a good partition key? Is partition key direct related with my query
performance? What is the best practices?

Just to study case, let's suppose I have a column family where is inserted
all kind of logs ( http server, application server, application logs, etc )
data from different servers. In this column family I have server_id (
unique identifier for each server ) column, log_type ( http server,
application server, application log ) column and log_info column. Is a good
ideia create a partition key using server_id and log_type columns to store
all logs data from a specific type and server in a physical row? And if do
I want a physical row for each day? Is a good idea add a third column with
the date in the partition key? And if I want to query all logs in a period
of time how can I select I range o rows? Do I have to duplicate date column
( considering I have to use = operator with partition key ) ?

All the best
-- 
Att. José Guilherme Vanz
br.linkedin.com/pub/josé-guilherme-vanz/51/b27/58b/
http://br.linkedin.com/pub/jos%C3%A9-guilherme-vanz/51/b27/58b/
O sofrimento é passageiro, desistir é para sempre - Bernardo Fonseca,
recordista da Antarctic Ice Marathon.


Re: Good partition key doubt

2014-12-11 Thread DuyHai Doan
what is a good partition key? Is partition key direct related with my
query performance? What is the best practices?

A good partition key is a partition key that will scale with your data. An
example: if you have a business involving individuals, it is likely that
your business will scale as soon as the number of users will grow. In this
case user_id is a good partition key because all the users will
be uniformly distributed over all the Cassandra nodes.

For your log example, using only server_id for partition key is clearly not
enough because what will scale is the log lines, not the number of server.

From the point of view of scalability (not taking about query-ability),
adding the log_type will not scale either, because the number of different
log types is likely to be a small set. For great scalability (not taking
about query-ability), the couple (server_id,log_timestamp) is likely a good
combination.

 Now for query, as you should know, it is not possible to have range query
(using , ≤, ≥, ) over partition key, you must always use equality (=) so
you won't be able to leverage the log_timestamp component in the partition
key for your query.

Bucketing by date is a good idea though, and the date resolution will
depends on the log generation rate. If logs are generated very often, maybe
a bucket by hour. If the generation rate is smaller, maybe a day or a week
bucket is fine.

Talking about log_type, putting it into the partition key will help
partitioning further, in addition of the date bucket. However it forces you
to always provide a log_type whenever you want to query, be aware of this.

An example of data model for your logs could be

CREATE TABLE logs_by_server_and_type_and_date(
   server_id int,
   log_type text,
   date_bucket int, //Date bucket using format MMDD or MMDDHH or ...
   log_timestamp timeuuid,
   log_info text,
   PRIMARY KEY((server_id,log_type,date_bucket),log_timestamp)
);


And if I want to query all logs in a period of time how can I select I
range o rows? -- New query path = new table

CREATE TABLE logs_by_date(
   date_bucket int, //Date bucket using format MMDD or MMDDHH or ...
   log_timestamp timeuuid,
   server_id int,
   log_type text,
   log_info text,
   PRIMARY KEY((date_bucket),log_timestamp) // you may add server_id or
log_type as clustering column optionally
);

For this table, the date_bucket should be chosen very carefully because for
the same bucket, we're going to store logs of ALL servers and all types ...

For the query, you should provide the date bucket as partition key, and
then use (, ≤, ≥, ) on the log_timestamp column


On Thu, Dec 11, 2014 at 12:00 PM, José Guilherme Vanz 
guilherme@gmail.com wrote:

 Hello folks

 I am studying Cassandra for a short a period of time and now I am modeling
 a database for study purposes. During my modeling I have faced a doubt,
 what is a good partition key? Is partition key direct related with my query
 performance? What is the best practices?

 Just to study case, let's suppose I have a column family where is inserted
 all kind of logs ( http server, application server, application logs, etc )
 data from different servers. In this column family I have server_id (
 unique identifier for each server ) column, log_type ( http server,
 application server, application log ) column and log_info column. Is a good
 ideia create a partition key using server_id and log_type columns to store
 all logs data from a specific type and server in a physical row? And if do
 I want a physical row for each day? Is a good idea add a third column with
 the date in the partition key? And if I want to query all logs in a period
 of time how can I select I range o rows? Do I have to duplicate date column
 ( considering I have to use = operator with partition key ) ?

 All the best
 --
 Att. José Guilherme Vanz
 br.linkedin.com/pub/josé-guilherme-vanz/51/b27/58b/
 http://br.linkedin.com/pub/jos%C3%A9-guilherme-vanz/51/b27/58b/
 O sofrimento é passageiro, desistir é para sempre - Bernardo Fonseca,
 recordista da Antarctic Ice Marathon.