Re: Example Data Modelling

2015-07-11 Thread Jérôme Mainaud
Hi Carlos,

Columns in primary key like EmpID can't be static.
But remind that EmpID as in the partition key is not duplicated.

-- 
Jérôme Mainaud
jer...@mainaud.com

2015-07-07 16:02 GMT+02:00 Carlos Alonso i...@mrcalonso.com:

 Hi Jerome,

 Good point!! Really a nice usage of static columns! BTW, wouldn't the
 EmpID be static as well?

 Cheers

 Carlos Alonso | Software Engineer | @calonso https://twitter.com/calonso

 On 7 July 2015 at 14:42, Jérôme Mainaud jer...@mainaud.com wrote:

 Hello,

 You can slightly adapt Carlos answer to reduce repliation of data that
 don't change for month to month.
 Static columns are great for this.

 The table become:

 CREATE TABLE salaries (
   EmpID varchar,
   FN varchar *static*,
   LN varchar *static*,
   Phone varchar *static*,
   Address varchar *static*,
   month integer,
   basic integer,
   flexible_allowance float,
   PRIMARY KEY(EmpID, month)
 )

 There is only one copy of static column per partition the value is shared
 between all rows of the partition.
 When Employee data change you can update it with the partition key in the
 where clause.
 When you insert a new month entry you just fill non static columns.
 The table can be queried the same way as the original one.

 Cheers



 --
 Jérôme Mainaud
 jer...@mainaud.com

 2015-07-07 11:51 GMT+02:00 Rory Bramwell, DevOp Services 
 rory.bramw...@devopservices.com:

 Hi,

 I've been following this thread and my thoughts are inline with Carlos'
 latest response... Model your data to suite your queries. That is one of
 the data model / design considerations in Cassandra that differs from the
 RDBMS world. Embrace demoralization and data duplication. Disk space is
 cheapest, so exploit how your data is laid out in order to optimize for
 faster reads (which are more costly than writes).

 Regards,

 Rory Bramwell
 Founder and CEO
 DevOp Services

 Skype: devopservices
 Email: rory.bramw...@devopservices.com
 Web: www.devopservices.com
 On Jul 7, 2015 4:02 AM, Carlos Alonso i...@mrcalonso.com wrote:

 I guess you're right, using my proposal, getting last employee's record
 is straightforward and quick, but also, as Peter pointed, getting all slips
 for a particular month requires you to know all the employee IDs and,
 ideally, run a query for each employee. This would work depending on how
 many employees you're managing.

 At this moment I'm beginning to feel that maybe using both approaches
 is the best way to go. And I think this is one of Cassandra's
 recommendations: Write your data in several formats if required to fit your
 reads. Therefore I'd use my suggestion for getting a salary by employee ID
 and I'd also have Peter's one to run the end of the month query.
 Does it make sense?

 Cheers!

 Carlos Alonso | Software Engineer | @calonso
 https://twitter.com/calonso

 On 7 July 2015 at 09:07, Srinivasa T N seen...@gmail.com wrote:

 Thanks for the inputs.

 Now my question is how should the app populate the duplicate data,
 i.e., if I have an employee record (along with his FN, LN,..) for the 
 month
 of Apr and later I am populating the same record for the month of may 
 (with
 salary changed), should my application first read/fetch the corresponding
 data for apr and re-insert with modification for month of may?

 Regards,
 Seenu.

 On Tue, Jul 7, 2015 at 11:32 AM, Peer, Oded oded.p...@rsa.com wrote:

  The data model suggested isn’t optimal for the “end of month” query
 you want to run since you are not querying by partition key.

 The query would look like “select EmpID, FN, LN, basic from salaries
 where month = 1” which requires filtering and has unpredictable 
 performance.



 For this type of query to be fast you can use the “month” column as
 the partition key and the “EmpID” and the clustering column.

 This approach also has drawbacks:

 1. This data model creates a wide row. Depending on the number of
 employees this partition might be very large. You should limit partition
 sizes to 25MB

 2. Distributing data according to month means that only a small
 number of nodes will hold all of the salary data for a specific month 
 which
 might cause hotspots on those nodes.



 Choose the approach that works best for you.





 *From:* Carlos Alonso [mailto:i...@mrcalonso.com]
 *Sent:* Monday, July 06, 2015 7:04 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: Example Data Modelling



 Hi Srinivasa,



 I think you're right, In Cassandra you should favor denormalisation
 when in RDBMS you find a relationship like this.



 I'd suggest a cf like this

 CREATE TABLE salaries (

   EmpID varchar,

   FN varchar,

   LN varchar,

   Phone varchar,

   Address varchar,

   month integer,

   basic integer,

   flexible_allowance float,

   PRIMARY KEY(EmpID, month)

 )



 That way the salaries will be partitioned by EmpID and clustered by
 month, which I guess is the natural sorting you want.



 Hope it helps,

 Cheers!


   Carlos Alonso | Software Engineer | @calonso
 https://twitter.com

Re: Example Data Modelling

2015-07-07 Thread Jérôme Mainaud
Hello,

You can slightly adapt Carlos answer to reduce repliation of data that
don't change for month to month.
Static columns are great for this.

The table become:

CREATE TABLE salaries (
  EmpID varchar,
  FN varchar *static*,
  LN varchar *static*,
  Phone varchar *static*,
  Address varchar *static*,
  month integer,
  basic integer,
  flexible_allowance float,
  PRIMARY KEY(EmpID, month)
)

There is only one copy of static column per partition the value is shared
between all rows of the partition.
When Employee data change you can update it with the partition key in the
where clause.
When you insert a new month entry you just fill non static columns.
The table can be queried the same way as the original one.

Cheers



-- 
Jérôme Mainaud
jer...@mainaud.com

2015-07-07 11:51 GMT+02:00 Rory Bramwell, DevOp Services 
rory.bramw...@devopservices.com:

 Hi,

 I've been following this thread and my thoughts are inline with Carlos'
 latest response... Model your data to suite your queries. That is one of
 the data model / design considerations in Cassandra that differs from the
 RDBMS world. Embrace demoralization and data duplication. Disk space is
 cheapest, so exploit how your data is laid out in order to optimize for
 faster reads (which are more costly than writes).

 Regards,

 Rory Bramwell
 Founder and CEO
 DevOp Services

 Skype: devopservices
 Email: rory.bramw...@devopservices.com
 Web: www.devopservices.com
 On Jul 7, 2015 4:02 AM, Carlos Alonso i...@mrcalonso.com wrote:

 I guess you're right, using my proposal, getting last employee's record
 is straightforward and quick, but also, as Peter pointed, getting all slips
 for a particular month requires you to know all the employee IDs and,
 ideally, run a query for each employee. This would work depending on how
 many employees you're managing.

 At this moment I'm beginning to feel that maybe using both approaches is
 the best way to go. And I think this is one of Cassandra's recommendations:
 Write your data in several formats if required to fit your reads. Therefore
 I'd use my suggestion for getting a salary by employee ID and I'd also have
 Peter's one to run the end of the month query.
 Does it make sense?

 Cheers!

 Carlos Alonso | Software Engineer | @calonso
 https://twitter.com/calonso

 On 7 July 2015 at 09:07, Srinivasa T N seen...@gmail.com wrote:

 Thanks for the inputs.

 Now my question is how should the app populate the duplicate data, i.e.,
 if I have an employee record (along with his FN, LN,..) for the month of
 Apr and later I am populating the same record for the month of may (with
 salary changed), should my application first read/fetch the corresponding
 data for apr and re-insert with modification for month of may?

 Regards,
 Seenu.

 On Tue, Jul 7, 2015 at 11:32 AM, Peer, Oded oded.p...@rsa.com wrote:

  The data model suggested isn’t optimal for the “end of month” query
 you want to run since you are not querying by partition key.

 The query would look like “select EmpID, FN, LN, basic from salaries
 where month = 1” which requires filtering and has unpredictable 
 performance.



 For this type of query to be fast you can use the “month” column as the
 partition key and the “EmpID” and the clustering column.

 This approach also has drawbacks:

 1. This data model creates a wide row. Depending on the number of
 employees this partition might be very large. You should limit partition
 sizes to 25MB

 2. Distributing data according to month means that only a small number
 of nodes will hold all of the salary data for a specific month which might
 cause hotspots on those nodes.



 Choose the approach that works best for you.





 *From:* Carlos Alonso [mailto:i...@mrcalonso.com]
 *Sent:* Monday, July 06, 2015 7:04 PM
 *To:* user@cassandra.apache.org
 *Subject:* Re: Example Data Modelling



 Hi Srinivasa,



 I think you're right, In Cassandra you should favor denormalisation
 when in RDBMS you find a relationship like this.



 I'd suggest a cf like this

 CREATE TABLE salaries (

   EmpID varchar,

   FN varchar,

   LN varchar,

   Phone varchar,

   Address varchar,

   month integer,

   basic integer,

   flexible_allowance float,

   PRIMARY KEY(EmpID, month)

 )



 That way the salaries will be partitioned by EmpID and clustered by
 month, which I guess is the natural sorting you want.



 Hope it helps,

 Cheers!


   Carlos Alonso | Software Engineer | @calonso
 https://twitter.com/calonso



 On 6 July 2015 at 13:01, Srinivasa T N seen...@gmail.com wrote:

 Hi,

I have basic doubt: I have an RDBMS with the following two tables:

Emp - EmpID, FN, LN, Phone, Address
Sal - Month, Empid, Basic, Flexible Allowance

My use case is to print the Salary slip at the end of each month and
 the slip contains emp name and his other details.

Now, if I want to have the same in cassandra, I will have a single
 cf with emp personal details and his salary details.  Is this the right
 approach

Re: Upgrading to SSD

2016-04-26 Thread Jérôme Mainaud
Hello,

Maybe you should call « nodetool drain » just before stoping the node.
As it flush the memtables, the commitlog will be empty and so easier to
move.

https://docs.datastax.com/en/cassandra/2.0/cassandra/tools/toolsDrain.html


-- 
Jérôme Mainaud
jer...@mainaud.com

2016-04-26 8:44 GMT+02:00 Anuj Wadehra <anujw_2...@yahoo.co.in>:

> Thanks All !!
>
> Anuj
> 
> On Mon, 25/4/16, Alain RODRIGUEZ <arodr...@gmail.com> wrote:
>
>  Subject: Re: Upgrading to SSD
>  To: user@cassandra.apache.org
>  Date: Monday, 25 April, 2016, 2:45 PM
>
>  Hi
>  Anuj, You could do the following instead
>  to minimize server downtime:
>  1. rsync while the server is
>  running2. rsync again
>  to get any new files3.
>  shut server down4.
>  rsync for the 3rd time 5. change directory in yaml and
>  start back up
>  +1
>  Here
>  are some more details about that process and a script doing
>  most of the job:
> thelastpickle.com/blog/2016/02/25/removing-a-disk-mapping-from-cassandra.html
>  Hope it will be useful to
>  you
>  C*heers,---Alain
>  Rodriguez - alain@thelastpickle.comFrance
>  The Last Pickle - Apache Cassandra
>  Consultinghttp://www.thelastpickle.com
>  2016-04-23 21:47 GMT+02:00
>  Jonathan Haddad <j...@jonhaddad.com>:
>  You could do the
>  following instead to minimize server downtime:
>  1. rsync while the server is
>  running2. rsync again to get any new
>  files3. shut server down4. rsync for
>  the 3rd time5. change directory in yaml and start
>  back up
>
>
>  On Sat, Apr 23, 2016 at 12:23 PM Clint Martin
>  <clintlmar...@coolfiretechnologies.com>
>  wrote:
>  As long as you shut down the node before you start
>  copying and moving stuff around it shouldn't matter if
>  you take backups or snapshots or whatever.
>  When you add the filesystem for the ssd will
>  you be removing the existing filesystem? Or will you be able
>  to keep both filesystems mounted at the same time for the
>  migration?  If you can keep them both at the same time then
>  an of system backup isn't strictly necessary.  Just
>  change your data dir config in your yaml. Copy the data and
>  commit from the old dir to the new ssd and restart the node.
>
>  If you can't keep both filesystems mounted
>  concurrently then a remote system is necessary to copy the
>  data to. But the steps and procedure are the same.
>  Running repair before you do the migration
>  isn't strictly necessary. Not a bad idea if you
>  don't mind spending the time. Definitely run repair
>  after you restart the node, especially if you take longer
>  than the hint interval to perform the work.
>  As for your filesystems, there is really
>  nothing special to worry about. Depending on which
>  filesystem you use there are recommendations for tuning and
>  configuration that you should probably follow.
>  (Datastax's recommendations as well as AL tobey's
>  tuning guide are great resources.
> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html
>  )
>
>
>  Clint
>  On Apr 23, 2016 3:05
>  PM, "Anuj Wadehra" <anujw_2...@yahoo.co.in>
>  wrote:
>  Hi
>  We have a 3 node cluster of 2.0.14.
>  We use Read/Write Qorum and RF is 3.  We want to move data
>  and commitlog directory from a SATA HDD to SSD. We have
>  planned to do a rolling upgrade.
>  We plan to run repair -pr on all
>  nodes  to sync data upfront and then execute following
>  steps on each server one by one:
>  1. Take backup of data/commitlog
>  directory to external server.2. Change mount
>  points so that Cassandra data/commitlog directory now points
>  to SSD.3. Copy files from external backup to
>  SSD.4. Start Cassandra.5. Run full
>  repair on the node before starting step 1 on next
>  node.
>  Questions:1. Is copying
>  commitlog and data directory good enough or we should go for
>  taking snapshot of each node and restoring data from that
>  snapshot?
>  2. Any
>  precautions we need to take while moving data to new SSD?
>  File system format of two disks etc.
>  3. Should we drain data before
>  taking backup? We are also restoring commitlog directory
>  from backup.
>  4. I have
>  added repair to sync full data upfront and a repair after
>  restoring data on each node. Sounds safe and
>  logical?
>  5. Any
>  problems you see with mentioned approach? Any better
>  approach?
>  ThanksAnuj
>
>  Sent
>  from Yahoo Mail on
>  Android
>
>
>


Re: Why simple replication strategy for system_auth ?

2016-05-17 Thread Jérôme Mainaud
Thank you for your answer.

What I still don't understand is why auth data is not managed in the same
way as schema metadata.
Both must be accessible to the node to do the job. Both are changed very
rarely.
In a way users are some kind of database objects.

I understand the choice for trace and repair history, not for
authentication.

I note that 3.0 suggest 3 to 5 nodes. It was my choice but some client told
me I was wrong pointing at 2.1 documentation...
And it was difficult to explain to experienced classic DBAs that creating a
user and granting rights were so different from creating a table that
metadata was stored in a different way.



-- 
Jérôme Mainaud
jer...@mainaud.com

2016-05-13 12:13 GMT+02:00 Sam Tunnicliffe <s...@beobal.com>:

> LocalStrategy means that data is not replicated in the usual way and
> remains local to each node. Where it is used, replication is either not
> required (for example in the case of secondary indexes and system.local) or
> happens out of band via some other method (as in the case of schema, or
> system.peers which is populated largely from gossip).
>
> There are several components in Cassandra which generate or persist
> "system" data for which a normal distribution makes sense. Auth data is
> one, tracing, repair history and materialized view status are others. The
> keyspaces for this data generally use SimpleStategy by default as it is
> guaranteed to work out of the box, regardless of topology.  The intent of
> the advice to configure system_auth with RF=N was to increase the
> likelihood that any read of auth data would be done locally, avoiding
> remote requests where possible. This is somewhat outdated though and not
> really necessary. In fact, the 3.x docs actually suggest "3 to 5 nodes per
> Data Center"[1]
>
> FTR, you can't specify LocalStrategy in a CREATE or ALTER KEYSPACE, for
> these reasons.
>
> [1]
> http://docs.datastax.com/en/cassandra/3.x/cassandra/configuration/secureConfigNativeAuth.htm
>
>
> On Fri, May 13, 2016 at 10:47 AM, Jérôme Mainaud <jer...@mainaud.com>
> wrote:
>
>> Hello,
>>
>> Is there any good reason why system_auth strategy is SimpleStrategy by
>> default instead of LocalStrategy like system and system_schema ?
>>
>> Especially when documentation advice to set the replication factor to the
>> number of nodes in the cluster, which is both weird and inconvenient to
>> follow.
>>
>> Do you think that changing the strategy to LocalStrategy would work or
>> have undesirable side effects ?
>>
>> Thank you.
>>
>> --
>> Jérôme Mainaud
>> jer...@mainaud.com
>>
>
>


Why simple replication strategy for system_auth ?

2016-05-13 Thread Jérôme Mainaud
Hello,

Is there any good reason why system_auth strategy is SimpleStrategy by
default instead of LocalStrategy like system and system_schema ?

Especially when documentation advice to set the replication factor to the
number of nodes in the cluster, which is both weird and inconvenient to
follow.

Do you think that changing the strategy to LocalStrategy would work or have
undesirable side effects ?

Thank you.

-- 
Jérôme Mainaud
jer...@mainaud.com


New node block in autobootstrap

2016-08-15 Thread Jérôme Mainaud
Hello,

A client of mime have problems when adding a node in the cluster.
After 4 days, the node is still in joining mode, it doesn't have the same
level of load than the other and there seems to be no streaming from and to
the new node.

This node has a history.

   1. At the begin, it was in a seed in the cluster.
   2. Ops detected that client had problems with it.
   3. They tried to reset it but failed. In their process they launched
   several repair and rebuild process on the node.
   4. Then they asked me to help them.
   5. We stopped the node,
   6. removed it from the list of seeds (more precisely it was replaced by
   another node),
   7. removed it from the cluster (I choose not to use decommission since
   node data was compromised)
   8. deleted all files from data, commitlog and savedcache directories.
   9. after the leaving process ended, it was started as a fresh new node
   and began autobootstrap.


As I don’t have direct access to the cluster I don't have a lot of
information, but I will have tomorrow (logs and results of some commands).
And I can ask for people any required information.

Does someone have any idea of what could have happened and what I should
investigate first ?
What would you do to unlock the situation ?

Context: The cluster consists of two DC, each with 15 nodes. Average load
is around 3 TB per node. The joining node froze a little after 2 TB.

Thank you for your help.
Cheers,


-- 
Jérôme Mainaud
jer...@mainaud.com


Re: nodetool repair with -pr and -dc

2016-08-19 Thread Jérôme Mainaud
Hi Romain,

Thank you for your answer, I will open a ticket soon.

Best

-- 
Jérôme Mainaud
jer...@mainaud.com

2016-08-19 12:16 GMT+02:00 Romain Hardouin <romainh...@yahoo.fr>:

> Hi Jérôme,
>
> The code in 2.2.6 allows -local and -pr:
> https://github.com/apache/cassandra/blob/cassandra-2.2.
> 6/src/java/org/apache/cassandra/service/StorageService.java#L2899
>
> But... the options validation introduced in CASSANDRA-6455 seems to break
> this feature!
> https://github.com/apache/cassandra/blob/cassandra-2.2.
> 6/src/java/org/apache/cassandra/repair/messages/RepairOption.java#L211
>
> I suggest to open a ticket https://issues.apache.org/
> jira/browse/cassandra/
>
> Best,
>
> Romain
>
>
> Le Vendredi 19 août 2016 11h47, Jérôme Mainaud <jer...@mainaud.com> a
> écrit :
>
>
> Hello,
>
> I've got a repair command with both -pr and -local rejected on an 2.2.6
> cluster.
> The exact command was : nodetool repair --full -par -pr -local -j 4
>
> The message is  “You need to run primary range repair on all nodes in the
> cluster”.
>
> Reading the code and previously cited CASSANDRA-7450, it should have been
> accepted.
>
> Did anyone meet this error before ?
>
> Thanks
>
>
> --
> Jérôme Mainaud
> jer...@mainaud.com
>
> 2016-08-12 1:14 GMT+02:00 kurt Greaves <k...@instaclustr.com>:
>
> -D does not do what you think it does. I've quoted the relevant
> documentation from the README:
>
>
> <https://github.com/BrianGallew/cassandra_range_repair#multiple-datacenters>Multiple
> Datacenters
> If you have multiple datacenters in your ring, then you MUST specify the
> name of the datacenter containing the node you are repairing as part of the
> command-line options (--datacenter=DCNAME). Failure to do so will result in
> only a subset of your data being repaired (approximately
> data/number-of-datacenters). This is because nodetool has no way to
> determine the relevant DC on its own, which in turn means it will use the
> tokens from every ring member in every datacenter.
>
>
>
> On 11 August 2016 at 12:24, Paulo Motta <pauloricard...@gmail.com> wrote:
>
> > if we want to use -pr option ( which i suppose we should to prevent
> duplicate checks) in 2.0 then if we run the repair on all nodes in a single
> DC then it should be sufficient and we should not need to run it on all
> nodes across DC's?
>
> No, because the primary ranges of the nodes in other DCs will be missing
> repair, so you should either run with -pr in all nodes in all DCs, or
> restrict repair to a specific DC with -local (and have duplicate checks).
> Combined -pr and -local are only supported on 2.1
>
>
> 2016-08-11 1:29 GMT-03:00 Anishek Agarwal <anis...@gmail.com>:
>
> ok thanks, so if we want to use -pr option ( which i suppose we should to
> prevent duplicate checks) in 2.0 then if we run the repair on all nodes in
> a single DC then it should be sufficient and we should not need to run it
> on all nodes across DC's ?
>
>
>
> On Wed, Aug 10, 2016 at 5:01 PM, Paulo Motta <pauloricard...@gmail.com>
> wrote:
>
> On 2.0 repair -pr option is not supported together with -local, -hosts or
> -dc, since it assumes you need to repair all nodes in all DCs and it will
> throw and error if you try to run with nodetool, so perhaps there's
> something wrong with range_repair options parsing.
>
> On 2.1 it was added support to simultaneous -pr and -local options on
> CASSANDRA-7450, so if you need that you can either upgade to 2.1 or
> backport that to 2.0.
>
>
> 2016-08-10 5:20 GMT-03:00 Anishek Agarwal <anis...@gmail.com>:
>
> Hello,
>
> We have 2.0.17 cassandra cluster(*DC1*) with a cross dc setup with a
> smaller cluster(*DC2*).  After reading various blogs about
> scheduling/running repairs looks like its good to run it with the following
>
>
> -pr for primary range only
> -st -et for sub ranges
> -par for parallel
> -dc to make sure we can schedule repairs independently on each Data centre
> we have.
>
> i have configured the above using the repair utility @ 
> https://github.com/BrianGallew
> /cassandra_range_repair.git
> <https://github.com/BrianGallew/cassandra_range_repair.git>
>
> which leads to the following command :
>
> ./src/range_repair.py -k [keyspace] -c [columnfamily name] -v -H localhost
> -p -D* DC1*
>
> but looks like the merkle tree is being calculated on nodes which are part
> of other *DC2.*
>
> why does this happen? i thought it should only look at the nodes in local
> cluster. however on nodetool the* -pr* option cannot be used with *-local* 
> according
> to docs @https://docs.datastax.com/en/ cassandra/2.0/cassandra/tools/
> toolsRepair.html
> <https://docs.datastax.com/en/cassandra/2.0/cassandra/tools/toolsRepair.html>
>
> so i am may be missing something, can someone help explain this please.
>
> thanks
> anishek
>
>
>
>
>
>
>
> --
> Kurt Greaves
> k...@instaclustr.com
> www.instaclustr.com
>
>
>
>
>


full and incremental repair consistency

2016-08-19 Thread Jérôme Mainaud
Hello,

I have a 2.2.6 Cassandra cluster with two DC of 15 nodes each.
A continuous incremental repair process deal with anti-entropy concern.

Due to some untraced operation by someone, we choose to do a full repair on
one DC with the command : nodetool repair --full -local -j 4

Daily incremental repair was disabled during this operation

The significant amount of stream session produced by this repair session
confirms to me that it was a good necessary.

However, I wonder if the sstables involved in that repair are flagged or if
the next daily incremental repair will be equivalent to a full repair.

I didn't use the -pr option since -pr and -local are actually mutually
exclusive (whether they should is the subject of another thread). I chose
-local because the link between the datacenter is slow. But maybe choosing
-pr would have been a better choice.

Is there a better way I should have handled this ?

Thank you,

-- 
Jérôme Mainaud
jer...@mainaud.com


Re: nodetool repair with -pr and -dc

2016-08-19 Thread Jérôme Mainaud
Hello,

I've got a repair command with both -pr and -local rejected on an 2.2.6
cluster.
The exact command was : nodetool repair --full -par -pr -local -j 4

The message is  “You need to run primary range repair on all nodes in the
cluster”.

Reading the code and previously cited CASSANDRA-7450, it should have been
accepted.

Did anyone meet this error before ?

Thanks


-- 
Jérôme Mainaud
jer...@mainaud.com

2016-08-12 1:14 GMT+02:00 kurt Greaves <k...@instaclustr.com>:

> -D does not do what you think it does. I've quoted the relevant
> documentation from the README:
>
>>
>> <https://github.com/BrianGallew/cassandra_range_repair#multiple-datacenters>Multiple
>> Datacenters
>>
>> If you have multiple datacenters in your ring, then you MUST specify the
>> name of the datacenter containing the node you are repairing as part of the
>> command-line options (--datacenter=DCNAME). Failure to do so will result in
>> only a subset of your data being repaired (approximately
>> data/number-of-datacenters). This is because nodetool has no way to
>> determine the relevant DC on its own, which in turn means it will use the
>> tokens from every ring member in every datacenter.
>>
>
>
> On 11 August 2016 at 12:24, Paulo Motta <pauloricard...@gmail.com> wrote:
>
>> > if we want to use -pr option ( which i suppose we should to prevent
>> duplicate checks) in 2.0 then if we run the repair on all nodes in a single
>> DC then it should be sufficient and we should not need to run it on all
>> nodes across DC's?
>>
>> No, because the primary ranges of the nodes in other DCs will be missing
>> repair, so you should either run with -pr in all nodes in all DCs, or
>> restrict repair to a specific DC with -local (and have duplicate checks).
>> Combined -pr and -local are only supported on 2.1
>>
>>
>> 2016-08-11 1:29 GMT-03:00 Anishek Agarwal <anis...@gmail.com>:
>>
>>> ok thanks, so if we want to use -pr option ( which i suppose we should
>>> to prevent duplicate checks) in 2.0 then if we run the repair on all nodes
>>> in a single DC then it should be sufficient and we should not need to run
>>> it on all nodes across DC's ?
>>>
>>>
>>>
>>> On Wed, Aug 10, 2016 at 5:01 PM, Paulo Motta <pauloricard...@gmail.com>
>>> wrote:
>>>
>>>> On 2.0 repair -pr option is not supported together with -local, -hosts
>>>> or -dc, since it assumes you need to repair all nodes in all DCs and it
>>>> will throw and error if you try to run with nodetool, so perhaps there's
>>>> something wrong with range_repair options parsing.
>>>>
>>>> On 2.1 it was added support to simultaneous -pr and -local options on
>>>> CASSANDRA-7450, so if you need that you can either upgade to 2.1 or
>>>> backport that to 2.0.
>>>>
>>>>
>>>> 2016-08-10 5:20 GMT-03:00 Anishek Agarwal <anis...@gmail.com>:
>>>>
>>>>> Hello,
>>>>>
>>>>> We have 2.0.17 cassandra cluster(*DC1*) with a cross dc setup with a
>>>>> smaller cluster(*DC2*).  After reading various blogs about
>>>>> scheduling/running repairs looks like its good to run it with the 
>>>>> following
>>>>>
>>>>>
>>>>> -pr for primary range only
>>>>> -st -et for sub ranges
>>>>> -par for parallel
>>>>> -dc to make sure we can schedule repairs independently on each Data
>>>>> centre we have.
>>>>>
>>>>> i have configured the above using the repair utility @
>>>>> https://github.com/BrianGallew/cassandra_range_repair.git
>>>>>
>>>>> which leads to the following command :
>>>>>
>>>>> ./src/range_repair.py -k [keyspace] -c [columnfamily name] -v -H
>>>>> localhost -p -D* DC1*
>>>>>
>>>>> but looks like the merkle tree is being calculated on nodes which are
>>>>> part of other *DC2.*
>>>>>
>>>>> why does this happen? i thought it should only look at the nodes in
>>>>> local cluster. however on nodetool the* -pr* option cannot be used
>>>>> with *-local* according to docs @https://docs.datastax.com/en/
>>>>> cassandra/2.0/cassandra/tools/toolsRepair.html
>>>>>
>>>>> so i am may be missing something, can someone help explain this please.
>>>>>
>>>>> thanks
>>>>> anishek
>>>>>
>>>>
>>>>
>>>
>>
>
>
> --
> Kurt Greaves
> k...@instaclustr.com
> www.instaclustr.com
>


Re: full and incremental repair consistency

2016-08-19 Thread Jérôme Mainaud
It makes sense.

When you say "you need to run a full repair without the -local flag", do
you mean I have to set the -full flag ? Or do you mean that the next repair
without arguments will be a full one because sstables or not flagged ?

By the way, I suppose the repair flag don't break sstable file
immutability, so I wonder how it is stored.

-- 
Jérôme Mainaud
jer...@mainaud.com

2016-08-19 15:02 GMT+02:00 Paulo Motta <pauloricard...@gmail.com>:

> Running repair with -local flag does not mark sstables as repaired, since
> you can't guarantee data in other DCs are repaired. In order to support
> incremental repair, you need to run a full repair without the -local flag,
> and then in the next time you run repair, previously repaired sstables are
> skipped.
>
> 2016-08-19 9:55 GMT-03:00 Jérôme Mainaud <jer...@mainaud.com>:
>
>> Hello,
>>
>> I have a 2.2.6 Cassandra cluster with two DC of 15 nodes each.
>> A continuous incremental repair process deal with anti-entropy concern.
>>
>> Due to some untraced operation by someone, we choose to do a full repair
>> on one DC with the command : nodetool repair --full -local -j 4
>>
>> Daily incremental repair was disabled during this operation
>>
>> The significant amount of stream session produced by this repair session
>> confirms to me that it was a good necessary.
>>
>> However, I wonder if the sstables involved in that repair are flagged or
>> if the next daily incremental repair will be equivalent to a full repair.
>>
>> I didn't use the -pr option since -pr and -local are actually mutually
>> exclusive (whether they should is the subject of another thread). I chose
>> -local because the link between the datacenter is slow. But maybe choosing
>> -pr would have been a better choice.
>>
>> Is there a better way I should have handled this ?
>>
>> Thank you,
>>
>> --
>> Jérôme Mainaud
>> jer...@mainaud.com
>>
>
>


Re: full and incremental repair consistency

2016-08-19 Thread Jérôme Mainaud
> - Either way, with or without the flag will actually be equivalent when
> none of the sstables are marked as repaired (this will change after the
> first inc repair).
>

So, if I well understand, the repair -full -local command resets the flag
of sstables previously repaired. So even if I had some sstable already
flagged, they won't be any more.

- The actual data component is mutable, only a flag in the STATS sstable
> component is mutated.
>

This is an important property I missed. That means that snapshots are
succeptible to mutate as they are hard links of actual file.
I also must care of this if I try to deduplicate files in a external backup
system.


Re: New node block in autobootstrap

2016-08-16 Thread Jérôme Mainaud
Hello Paul,

Thank you for your reply.
The version is 2.2.6.

I received the logs today and can confirm three streams failed after
timeout. We will try to resume the bootstrap as you recommended.

I didn't use -Dreplace_address for two reasons:

   1. Because someone tried to reset the node someway. Because this person
   is on vacation, nobody really knows what he did. I supposed he just trash
   the data directory and launch the node again without (-Dreplace_address)
   nor removing the node before. I was unsure about how valid the tokens were
   so I preferred to remove it to go back to a clean situation.
   2. Since the replacing node and the new node have the same endpoint
   address (this is a fresh version of the same node) I was not sure the
   replace_address will not be confused.

Since I had time and was not sure that replacing the node would work in my
situation, I chose the slow safe way. Maybe I could have used it.





-- 
Jérôme Mainaud
jer...@mainaud.com

2016-08-15 20:51 GMT+02:00 Paulo Motta <pauloricard...@gmail.com>:

> What version are you in? This seems like a typical case were there was a
> problem with streaming (hanging, etc), do you have access to the logs?
> Maybe look for streaming errors? Typically streaming errors are related to
> timeouts, so you should review your cassandra
> streaming_socket_timeout_in_ms and kernel tcp_keepalive settings.
>
> If you're on 2.2+ you can resume a failed bootstrap with nodetool
> bootstrap resume. There were also some streaming hanging problems fixed
> recently, so I'd advise you to upgrade to the latest version of your
> particular series for a more robust version.
>
> Is there any reason why you didn't use the replace procedure
> (-Dreplace_address) to replace the node with the same tokens? This would be
> a bit faster than remove + bootstrap procedure.
>
> 2016-08-15 15:37 GMT-03:00 Jérôme Mainaud <jer...@mainaud.com>:
>
>> Hello,
>>
>> A client of mime have problems when adding a node in the cluster.
>> After 4 days, the node is still in joining mode, it doesn't have the same
>> level of load than the other and there seems to be no streaming from and to
>> the new node.
>>
>> This node has a history.
>>
>>1. At the begin, it was in a seed in the cluster.
>>2. Ops detected that client had problems with it.
>>3. They tried to reset it but failed. In their process they launched
>>several repair and rebuild process on the node.
>>4. Then they asked me to help them.
>>5. We stopped the node,
>>6. removed it from the list of seeds (more precisely it was replaced
>>by another node),
>>7. removed it from the cluster (I choose not to use decommission
>>since node data was compromised)
>>8. deleted all files from data, commitlog and savedcache directories.
>>9. after the leaving process ended, it was started as a fresh new
>>node and began autobootstrap.
>>
>>
>> As I don’t have direct access to the cluster I don't have a lot of
>> information, but I will have tomorrow (logs and results of some commands).
>> And I can ask for people any required information.
>>
>> Does someone have any idea of what could have happened and what I should
>> investigate first ?
>> What would you do to unlock the situation ?
>>
>> Context: The cluster consists of two DC, each with 15 nodes. Average load
>> is around 3 TB per node. The joining node froze a little after 2 TB.
>>
>> Thank you for your help.
>> Cheers,
>>
>>
>> --
>> Jérôme Mainaud
>> jer...@mainaud.com
>>
>
>


LCS Increasing the sstable size

2016-08-31 Thread Jérôme Mainaud
Hello,

My cluster use LeveledCompactionStrategy on rather big nodes (9 TB disk per
node with a target of 6 TB of data and the 3 remaining TB are reserved for
compaction and snapshots). There is only one table for this application.

With default sstable_size_in_mb at 160 MB, we have a huge number of
sstables (25,000+ for 4TB already loaded) which lead to IO errors due to
open files limit (set at 100,000).

Increasing the open files limit can be a solution but at this level, I
would rather increase sstable_size to 500 MB which would keep the file
number around 100,000.

Could increasing sstable size lead to any problem I don't see ?
Do you have any advice about this ?

Thank you.

-- 
Jérôme Mainaud
jer...@mainaud.com


Re: LCS Increasing the sstable size

2016-08-31 Thread Jérôme Mainaud
Hello DuyHai,

I have no problem with performance even if I'm using 3 HDD in RAID 0.
Last 4 years of data were imported in two weeks with is acceptable for the
client.
Daily data will be much less intensive and my client is more concerned with
storage price than with pure latency.
To be more precise, Cassandra was not chosen for its latency but because it
is a distributed, multi-datacenter and no downtime database.

Tests show that write amplification is not a problem in our case. So, if
LCS may not be the best technical choice, it is a relatively correct one as
long as the number of sstables doesn't explode.

The first thing I looked at was compaction stats and there is no compaction
pending.
The size of most sstable is 160 MB, the expected size.
If you do the math, 4 TB divided by 160 MB equals 26,214 sstables. With 8
files per sstable, you get 209,715 files.
With 6 TB, we get 39,321 sstables and 314,572 files.

If I change sstable_size_in_mb to 512, it would end with 12,2881 sstables
and 98,304 files for 6 TB.
That seems to be a good compromise if there is no trap.

The only problem I can see would be a latency drop due to the size of index
and summary.
But as average line size is 70 KB. So there should not be so many entries
per file.

Am I missing something ?


-- 
Jérôme Mainaud
jer...@mainaud.com

2016-08-31 13:28 GMT+02:00 DuyHai Doan <doanduy...@gmail.com>:

> Some random thoughts
>
> 1) Are they using SSD ?
>
> 2) If using SSD, I remember that one recommendation is not to exceed
> ~3Tb/node, unless they're using DateTiered or better TimeWindow compaction
> strategy
>
> 3) LCS is very disk intensive and usually exacerbates write amp the more
> you have data
>
> 4) The huge number of SSTable let me suspect some issue with compaction
> not keeping up. Can you post here a "nodetool tablestats"  and
> "compactionstats" ? Are there many pending compactions ?
>
> 5) Last but not least, what does "dstat" shows ? Is there any frequent CPU
> wait ?
>
> On Wed, Aug 31, 2016 at 12:34 PM, Jérôme Mainaud <jer...@mainaud.com>
> wrote:
>
>> Hello,
>>
>> My cluster use LeveledCompactionStrategy on rather big nodes (9 TB disk
>> per node with a target of 6 TB of data and the 3 remaining TB are reserved
>> for compaction and snapshots). There is only one table for this application.
>>
>> With default sstable_size_in_mb at 160 MB, we have a huge number of
>> sstables (25,000+ for 4TB already loaded) which lead to IO errors due to
>> open files limit (set at 100,000).
>>
>> Increasing the open files limit can be a solution but at this level, I
>> would rather increase sstable_size to 500 MB which would keep the file
>> number around 100,000.
>>
>> Could increasing sstable size lead to any problem I don't see ?
>> Do you have any advice about this ?
>>
>> Thank you.
>>
>> --
>> Jérôme Mainaud
>> jer...@mainaud.com
>>
>
>


Partition size estimation formula in 3.0

2016-09-19 Thread Jérôme Mainaud
Hello,

Until 3.0, we had a nice formula to estimate partition size :

  sizeof(partition keys)
+ sizeof(static columns)
+ countof(rows) * sizeof(regular columns)
+ countof(rows) * countof(regular columns) * sizeof(clustering columns)
+ 8 * count(values in partition)

With the 3.0 storage engine, the size is supposed to be smaller.
And I'm looking for the new formula.

I reckon the formula to become :

  sizeof(partition keys)
+ sizeof(static columns)
+ countof(rows) * sizeof(regular columns)
+ countof(rows) * sizeof(clustering columns)
+ 8 * count(values in partition)

That is the clustering column values are no more repeated for each regular
column in the row.

Could anyone confirm me that new formula or am I missing something ?

Thank you,

-- 
Jérôme Mainaud
jer...@mainaud.com


Check snapshot / sstable integrity

2017-01-12 Thread Jérôme Mainaud
Hello,

Is there any tool to test the integrity of a snapshot?

Suppose I have a snapshot based backup stored in an external low cost
storage system that I want to restore to a database after someone deleted
important data by mistake.

Before restoring the files, I will truncate the table to remove the
problematic tombstones.

But my Op retains my arm and asks: "Are you sure that the snapshot is safe
and will be restored before truncating data we have?"

If this scenario is a theoretical, the question is good. How can I verify
that a snapshot is clean?

Thank you,

-- 
Jérôme Mainaud
jer...@mainaud.com


Re: Check snapshot / sstable integrity

2017-01-13 Thread Jérôme Mainaud
Hello Alain,

Thank you for your answer.

Basically having a tool to check all sstables in a folder using the
checksum would be nice. But finally I can have the same result using some
shasum tool.
The goal is to verify integrity of files copied back from an external
backup tool.

The question came because their backup system corrupted some file in the
past and they think with their current backup process in mind.
I will insist on the snapshot on truncate that already saved me, and that
other checks should be done by the backup tool if any is used.

Cheers,


-- 
Jérôme Mainaud
jer...@mainaud.com

2017-01-12 14:05 GMT+01:00 Alain RODRIGUEZ <arodr...@gmail.com>:

> Hi Jérôme,
>
> About this concern:
>
> But my Op retains my arm and asks: "Are you sure that the snapshot is safe
>> and will be restored before truncating data we have?"
>
>
> Make sure to enable snapshot on truncate (cassandra.yaml) or do it
> manually. This way if the restored dataset is worst than the current one
> (the one you plan to truncate), you can always rollback this truncate /
> restore action. This way you can tell your "Op" that this is perfectly safe
> anyway, no data would be lost, even in the worst case scenario (not
> considering the downtime that would be induced). Plus this snapshot is
> cheap (hard links) and do not need to be moved around or kept once you are
> sure the old backup fits your need.
>
> Truncate is definitely the way to go before restoring a backup. Parsing
> the data to delete it all is not really an option imho.
>
> Then about the technical question "how to know that a snapshot is clean"
> it would be good to define "clean". You can make sure the backup is
> readable, consistent enough and correspond to what you want by inserting
> all  the sstables into a testing cluster and performing some reads there
> before doing it in production. You can use for example AWS EC2 machines
> with big EBS attached or whatever and use the sstableloader to load data
> into it.
>
> If you are just worried about SSTables format validity, there is no tool I
> am aware of to check sstables well formatted but it might exist or be
> doable. An other option might be to do a checksum on each sstable before
> uploading it elsewhere and make sure it matches when downloaded back.
> That's the first things that come to my mind.
>
> Hope that is helpful. Hopefully, someone else will be able to point you to
> an existing tool to do this work.
>
> Cheers,
> ---
> Alain Rodriguez - @arodream - al...@thelastpickle.com
> France
>
> The Last Pickle - Apache Cassandra Consulting
> http://www.thelastpickle.com
>
> 2017-01-12 11:33 GMT+01:00 Jérôme Mainaud <jer...@mainaud.com>:
>
>> Hello,
>>
>> Is there any tool to test the integrity of a snapshot?
>>
>> Suppose I have a snapshot based backup stored in an external low cost
>> storage system that I want to restore to a database after someone deleted
>> important data by mistake.
>>
>> Before restoring the files, I will truncate the table to remove the
>> problematic tombstones.
>>
>> But my Op retains my arm and asks: "Are you sure that the snapshot is
>> safe and will be restored before truncating data we have?"
>>
>> If this scenario is a theoretical, the question is good. How can I verify
>> that a snapshot is clean?
>>
>> Thank you,
>>
>> --
>> Jérôme Mainaud
>> jer...@mainaud.com
>>
>
>