Re: Optimal backup strategy

2019-12-03 Thread Hossein Ghiyasi Mehr
I am sorry! This is true. I forgot "*not*"!
1. It's *not* recommended to use commit log after one node failure.
Cassandra has many options such as replication factor as
substitute solution.

*VafaTech.com - A Total Solution for Data Gathering & Analysis*


On Tue, Dec 3, 2019 at 10:42 AM Adarsh Kumar  wrote:

> Thanks Hossein,
>
> Just one more question is there any special SOP or consideration we have
> to take for multi-site backup.
>
> Please share any helpful link, blog or steps documented.
>
> Regards,
> Adarsh Kumar
>
> On Sun, Dec 1, 2019 at 10:40 PM Hossein Ghiyasi Mehr <
> ghiyasim...@gmail.com> wrote:
>
>> 1. It's recommended to use commit log after one node failure. Cassandra
>> has many options such as replication factor as substitute solution.
>> 2. Yes, right.
>>
>> *VafaTech.com - A Total Solution for Data Gathering & Analysis*
>>
>>
>> On Fri, Nov 29, 2019 at 9:33 AM Adarsh Kumar 
>> wrote:
>>
>>> Thanks Ahu and Hussein,
>>>
>>> So my understanding is:
>>>
>>>1. Commit log backup is not documented for Apache Cassandra, hence
>>>not standard. But can be used for restore on the same machine (For taking
>>>backup from commit_log_dir). If used on other machine(s) has to be in the
>>>same topology. Can it be used for replacement node?
>>>2. For periodic backup Snapshot+Incremental backup is the best option
>>>
>>>
>>> Thanks,
>>> Adarsh Kumar
>>>
>>> On Fri, Nov 29, 2019 at 7:28 AM guo Maxwell 
>>> wrote:
>>>
 Hossein is right , But for use , we restore to the same cassandra
 topology ,So it is usable to do replay .But when restore to the
 same machine it is also usable .
 Using sstableloader cost too much time and more storage(though will
 reduce after  restored)

 Hossein Ghiyasi Mehr  于2019年11月28日周四 下午7:40写道:

> commitlog backup isn't usable in another machine.
> Backup solution depends on what you want to do: periodic backup or
> backup to restore on other machine?
> Periodic backup is combine of snapshot and incremental backup. Remove
> incremental backup after new snapshot.
> Take backup to restore on other machine: You can use snapshot after
> flushing memtable or Use sstableloader.
>
>
> 
> VafaTech.com - A Total Solution for Data Gathering & Analysis
>
> On Thu, Nov 28, 2019 at 6:05 AM guo Maxwell 
> wrote:
>
>> for cassandra or datastax's documentation, commitlog's backup is not
>> mentioned.
>> only snapshot and incremental backup is described to do backup .
>>
>> Though commitlog's archive for keyspace/table is not support but
>> commitlog' replay (though you must put log to commitlog_dir and restart 
>> the
>> process)
>> support the feature of keyspace/table' replay filter (using
>> -Dcassandra.replayList with the keyspace1.table1,keyspace1.table2 format 
>> to
>> replay the specified keyspace/table)
>>
>> Snapshot do affect the storage, for us we got snapshot one week a
>> time under the low business peak and making snapshot got throttle ,for 
>> you
>> you may
>> see the issue (https://issues.apache.org/jira/browse/CASSANDRA-13019)
>>
>>
>>
>> Adarsh Kumar  于2019年11月28日周四 上午1:00写道:
>>
>>> Thanks Guo and Eric for replying,
>>>
>>> I have some confusions about commit log backup:
>>>
>>>1. commit log archival technique is (
>>>
>>> https://support.datastax.com/hc/en-us/articles/115001593706-Manual-Backup-and-Restore-with-Point-in-time-and-table-level-restore-
>>>) as good as an incremental backup, as it also captures commit logs 
>>> after
>>>memtable flush.
>>>2. If we go for "Snapshot + Incremental bk + Commit log", here
>>>we have to take commit log from commit log directory (is there any 
>>> SOP for
>>>this?). As commit logs are not per table or ks, we will have 
>>> chalange in
>>>restoring selective tables.
>>>3. Snapshot based backups are easy to manage and operate due to
>>>its simplicity. But they are heavy on storage. Any views on this?
>>>4. Please share any successful strategy that someone is using
>>>for production. We are still in the design phase and want to 
>>> implement the
>>>best solution.
>>>
>>> Thanks Eric for sharing link for medusa.
>>>
>>> Regards,
>>> Adarsh Kumar
>>>
>>> On Wed, Nov 27, 2019 at 5:16 PM guo Maxwell 
>>> wrote:
>>>
 For me, I think the last one :
  Snapshot + Incremental + commitlog
 is the most meaningful way to do backup and restore, when you make
 the data backup to some where else like AWS S3.

- Snapshot based backup // for incremental data will not be
backuped and may lose data when restore to the time latter than 
 snapshot
time;
- Incremental backups // better 

Re: Optimal backup strategy

2019-12-02 Thread Adarsh Kumar
Thanks Hossein,

Just one more question is there any special SOP or consideration we have to
take for multi-site backup.

Please share any helpful link, blog or steps documented.

Regards,
Adarsh Kumar

On Sun, Dec 1, 2019 at 10:40 PM Hossein Ghiyasi Mehr 
wrote:

> 1. It's recommended to use commit log after one node failure. Cassandra
> has many options such as replication factor as substitute solution.
> 2. Yes, right.
>
> *VafaTech.com - A Total Solution for Data Gathering & Analysis*
>
>
> On Fri, Nov 29, 2019 at 9:33 AM Adarsh Kumar  wrote:
>
>> Thanks Ahu and Hussein,
>>
>> So my understanding is:
>>
>>1. Commit log backup is not documented for Apache Cassandra, hence
>>not standard. But can be used for restore on the same machine (For taking
>>backup from commit_log_dir). If used on other machine(s) has to be in the
>>same topology. Can it be used for replacement node?
>>2. For periodic backup Snapshot+Incremental backup is the best option
>>
>>
>> Thanks,
>> Adarsh Kumar
>>
>> On Fri, Nov 29, 2019 at 7:28 AM guo Maxwell  wrote:
>>
>>> Hossein is right , But for use , we restore to the same cassandra
>>> topology ,So it is usable to do replay .But when restore to the
>>> same machine it is also usable .
>>> Using sstableloader cost too much time and more storage(though will
>>> reduce after  restored)
>>>
>>> Hossein Ghiyasi Mehr  于2019年11月28日周四 下午7:40写道:
>>>
 commitlog backup isn't usable in another machine.
 Backup solution depends on what you want to do: periodic backup or
 backup to restore on other machine?
 Periodic backup is combine of snapshot and incremental backup. Remove
 incremental backup after new snapshot.
 Take backup to restore on other machine: You can use snapshot after
 flushing memtable or Use sstableloader.


 
 VafaTech.com - A Total Solution for Data Gathering & Analysis

 On Thu, Nov 28, 2019 at 6:05 AM guo Maxwell 
 wrote:

> for cassandra or datastax's documentation, commitlog's backup is not
> mentioned.
> only snapshot and incremental backup is described to do backup .
>
> Though commitlog's archive for keyspace/table is not support but
> commitlog' replay (though you must put log to commitlog_dir and restart 
> the
> process)
> support the feature of keyspace/table' replay filter (using
> -Dcassandra.replayList with the keyspace1.table1,keyspace1.table2 format 
> to
> replay the specified keyspace/table)
>
> Snapshot do affect the storage, for us we got snapshot one week a time
> under the low business peak and making snapshot got throttle ,for you you
> may
> see the issue (https://issues.apache.org/jira/browse/CASSANDRA-13019)
>
>
>
> Adarsh Kumar  于2019年11月28日周四 上午1:00写道:
>
>> Thanks Guo and Eric for replying,
>>
>> I have some confusions about commit log backup:
>>
>>1. commit log archival technique is (
>>
>> https://support.datastax.com/hc/en-us/articles/115001593706-Manual-Backup-and-Restore-with-Point-in-time-and-table-level-restore-
>>) as good as an incremental backup, as it also captures commit logs 
>> after
>>memtable flush.
>>2. If we go for "Snapshot + Incremental bk + Commit log", here we
>>have to take commit log from commit log directory (is there any SOP 
>> for
>>this?). As commit logs are not per table or ks, we will have chalange 
>> in
>>restoring selective tables.
>>3. Snapshot based backups are easy to manage and operate due to
>>its simplicity. But they are heavy on storage. Any views on this?
>>4. Please share any successful strategy that someone is using for
>>production. We are still in the design phase and want to implement 
>> the best
>>solution.
>>
>> Thanks Eric for sharing link for medusa.
>>
>> Regards,
>> Adarsh Kumar
>>
>> On Wed, Nov 27, 2019 at 5:16 PM guo Maxwell 
>> wrote:
>>
>>> For me, I think the last one :
>>>  Snapshot + Incremental + commitlog
>>> is the most meaningful way to do backup and restore, when you make
>>> the data backup to some where else like AWS S3.
>>>
>>>- Snapshot based backup // for incremental data will not be
>>>backuped and may lose data when restore to the time latter than 
>>> snapshot
>>>time;
>>>- Incremental backups // better than snapshot backup .but
>>>with Insufficient data accuracy. For data remain in the memtable 
>>> will be
>>>lose;
>>>- Snapshot + incremental
>>>- Snapshot + commitlog archival // better data precision than
>>>made incremental backup, but the data in the non archived 
>>> commitlog(not
>>>archive and commitlog log not closed) will not restore and will 
>>> lose. Also
>>>when log is too much, do log reply 

Re: Optimal backup strategy

2019-12-01 Thread Hossein Ghiyasi Mehr
1. It's recommended to use commit log after one node failure. Cassandra has
many options such as replication factor as substitute solution.
2. Yes, right.

*VafaTech.com - A Total Solution for Data Gathering & Analysis*


On Fri, Nov 29, 2019 at 9:33 AM Adarsh Kumar  wrote:

> Thanks Ahu and Hussein,
>
> So my understanding is:
>
>1. Commit log backup is not documented for Apache Cassandra, hence not
>standard. But can be used for restore on the same machine (For taking
>backup from commit_log_dir). If used on other machine(s) has to be in the
>same topology. Can it be used for replacement node?
>2. For periodic backup Snapshot+Incremental backup is the best option
>
>
> Thanks,
> Adarsh Kumar
>
> On Fri, Nov 29, 2019 at 7:28 AM guo Maxwell  wrote:
>
>> Hossein is right , But for use , we restore to the same cassandra
>> topology ,So it is usable to do replay .But when restore to the
>> same machine it is also usable .
>> Using sstableloader cost too much time and more storage(though will
>> reduce after  restored)
>>
>> Hossein Ghiyasi Mehr  于2019年11月28日周四 下午7:40写道:
>>
>>> commitlog backup isn't usable in another machine.
>>> Backup solution depends on what you want to do: periodic backup or
>>> backup to restore on other machine?
>>> Periodic backup is combine of snapshot and incremental backup. Remove
>>> incremental backup after new snapshot.
>>> Take backup to restore on other machine: You can use snapshot after
>>> flushing memtable or Use sstableloader.
>>>
>>>
>>> 
>>> VafaTech.com - A Total Solution for Data Gathering & Analysis
>>>
>>> On Thu, Nov 28, 2019 at 6:05 AM guo Maxwell 
>>> wrote:
>>>
 for cassandra or datastax's documentation, commitlog's backup is not
 mentioned.
 only snapshot and incremental backup is described to do backup .

 Though commitlog's archive for keyspace/table is not support but
 commitlog' replay (though you must put log to commitlog_dir and restart the
 process)
 support the feature of keyspace/table' replay filter (using
 -Dcassandra.replayList with the keyspace1.table1,keyspace1.table2 format to
 replay the specified keyspace/table)

 Snapshot do affect the storage, for us we got snapshot one week a time
 under the low business peak and making snapshot got throttle ,for you you
 may
 see the issue (https://issues.apache.org/jira/browse/CASSANDRA-13019)



 Adarsh Kumar  于2019年11月28日周四 上午1:00写道:

> Thanks Guo and Eric for replying,
>
> I have some confusions about commit log backup:
>
>1. commit log archival technique is (
>
> https://support.datastax.com/hc/en-us/articles/115001593706-Manual-Backup-and-Restore-with-Point-in-time-and-table-level-restore-
>) as good as an incremental backup, as it also captures commit logs 
> after
>memtable flush.
>2. If we go for "Snapshot + Incremental bk + Commit log", here we
>have to take commit log from commit log directory (is there any SOP for
>this?). As commit logs are not per table or ks, we will have chalange 
> in
>restoring selective tables.
>3. Snapshot based backups are easy to manage and operate due to
>its simplicity. But they are heavy on storage. Any views on this?
>4. Please share any successful strategy that someone is using for
>production. We are still in the design phase and want to implement the 
> best
>solution.
>
> Thanks Eric for sharing link for medusa.
>
> Regards,
> Adarsh Kumar
>
> On Wed, Nov 27, 2019 at 5:16 PM guo Maxwell 
> wrote:
>
>> For me, I think the last one :
>>  Snapshot + Incremental + commitlog
>> is the most meaningful way to do backup and restore, when you make
>> the data backup to some where else like AWS S3.
>>
>>- Snapshot based backup // for incremental data will not be
>>backuped and may lose data when restore to the time latter than 
>> snapshot
>>time;
>>- Incremental backups // better than snapshot backup .but
>>with Insufficient data accuracy. For data remain in the memtable will 
>> be
>>lose;
>>- Snapshot + incremental
>>- Snapshot + commitlog archival // better data precision than
>>made incremental backup, but the data in the non archived 
>> commitlog(not
>>archive and commitlog log not closed) will not restore and will lose. 
>> Also
>>when log is too much, do log reply will cost very mucu time
>>
>> For me ,We use snapshot + incremental + commitlog archive. We read
>> snapshot data and incremental data .Also the log is backuped .But we will
>> not backup the
>> log whose data have been flush to sstable ,for the data will be
>> backuped by the way we do incremental backup .
>>
>> This way , the data will exist in the format of sstable trough

Re: Optimal backup strategy

2019-11-28 Thread guo Maxwell
Same topology means the restore node should got the same tokes with the
backup nodes ;
ex : backup
   node1(1/2/3/4/5) node2(6/7/8/9/10)
restore :
  nodea(1/2/3/4/5) nodeb(6/7/8/9/10)
so node1's commitlog can be replay on nodea .

Adarsh Kumar  于2019年11月29日周五 下午2:03写道:

> Thanks Ahu and Hussein,
>
> So my understanding is:
>
>1. Commit log backup is not documented for Apache Cassandra, hence not
>standard. But can be used for restore on the same machine (For taking
>backup from commit_log_dir). If used on other machine(s) has to be in the
>same topology. Can it be used for replacement node?
>2. For periodic backup Snapshot+Incremental backup is the best option
>
>
> Thanks,
> Adarsh Kumar
>
> On Fri, Nov 29, 2019 at 7:28 AM guo Maxwell  wrote:
>
>> Hossein is right , But for use , we restore to the same cassandra
>> topology ,So it is usable to do replay .But when restore to the
>> same machine it is also usable .
>> Using sstableloader cost too much time and more storage(though will
>> reduce after  restored)
>>
>> Hossein Ghiyasi Mehr  于2019年11月28日周四 下午7:40写道:
>>
>>> commitlog backup isn't usable in another machine.
>>> Backup solution depends on what you want to do: periodic backup or
>>> backup to restore on other machine?
>>> Periodic backup is combine of snapshot and incremental backup. Remove
>>> incremental backup after new snapshot.
>>> Take backup to restore on other machine: You can use snapshot after
>>> flushing memtable or Use sstableloader.
>>>
>>>
>>> 
>>> VafaTech.com - A Total Solution for Data Gathering & Analysis
>>>
>>> On Thu, Nov 28, 2019 at 6:05 AM guo Maxwell 
>>> wrote:
>>>
 for cassandra or datastax's documentation, commitlog's backup is not
 mentioned.
 only snapshot and incremental backup is described to do backup .

 Though commitlog's archive for keyspace/table is not support but
 commitlog' replay (though you must put log to commitlog_dir and restart the
 process)
 support the feature of keyspace/table' replay filter (using
 -Dcassandra.replayList with the keyspace1.table1,keyspace1.table2 format to
 replay the specified keyspace/table)

 Snapshot do affect the storage, for us we got snapshot one week a time
 under the low business peak and making snapshot got throttle ,for you you
 may
 see the issue (https://issues.apache.org/jira/browse/CASSANDRA-13019)



 Adarsh Kumar  于2019年11月28日周四 上午1:00写道:

> Thanks Guo and Eric for replying,
>
> I have some confusions about commit log backup:
>
>1. commit log archival technique is (
>
> https://support.datastax.com/hc/en-us/articles/115001593706-Manual-Backup-and-Restore-with-Point-in-time-and-table-level-restore-
>) as good as an incremental backup, as it also captures commit logs 
> after
>memtable flush.
>2. If we go for "Snapshot + Incremental bk + Commit log", here we
>have to take commit log from commit log directory (is there any SOP for
>this?). As commit logs are not per table or ks, we will have chalange 
> in
>restoring selective tables.
>3. Snapshot based backups are easy to manage and operate due to
>its simplicity. But they are heavy on storage. Any views on this?
>4. Please share any successful strategy that someone is using for
>production. We are still in the design phase and want to implement the 
> best
>solution.
>
> Thanks Eric for sharing link for medusa.
>
> Regards,
> Adarsh Kumar
>
> On Wed, Nov 27, 2019 at 5:16 PM guo Maxwell 
> wrote:
>
>> For me, I think the last one :
>>  Snapshot + Incremental + commitlog
>> is the most meaningful way to do backup and restore, when you make
>> the data backup to some where else like AWS S3.
>>
>>- Snapshot based backup // for incremental data will not be
>>backuped and may lose data when restore to the time latter than 
>> snapshot
>>time;
>>- Incremental backups // better than snapshot backup .but
>>with Insufficient data accuracy. For data remain in the memtable will 
>> be
>>lose;
>>- Snapshot + incremental
>>- Snapshot + commitlog archival // better data precision than
>>made incremental backup, but the data in the non archived 
>> commitlog(not
>>archive and commitlog log not closed) will not restore and will lose. 
>> Also
>>when log is too much, do log reply will cost very mucu time
>>
>> For me ,We use snapshot + incremental + commitlog archive. We read
>> snapshot data and incremental data .Also the log is backuped .But we will
>> not backup the
>> log whose data have been flush to sstable ,for the data will be
>> backuped by the way we do incremental backup .
>>
>> This way , the data will exist in the format of sstable trough
>> 

Re: Optimal backup strategy

2019-11-28 Thread Adarsh Kumar
Thanks Ahu and Hussein,

So my understanding is:

   1. Commit log backup is not documented for Apache Cassandra, hence not
   standard. But can be used for restore on the same machine (For taking
   backup from commit_log_dir). If used on other machine(s) has to be in the
   same topology. Can it be used for replacement node?
   2. For periodic backup Snapshot+Incremental backup is the best option


Thanks,
Adarsh Kumar

On Fri, Nov 29, 2019 at 7:28 AM guo Maxwell  wrote:

> Hossein is right , But for use , we restore to the same cassandra topology
> ,So it is usable to do replay .But when restore to the
> same machine it is also usable .
> Using sstableloader cost too much time and more storage(though will reduce
> after  restored)
>
> Hossein Ghiyasi Mehr  于2019年11月28日周四 下午7:40写道:
>
>> commitlog backup isn't usable in another machine.
>> Backup solution depends on what you want to do: periodic backup or backup
>> to restore on other machine?
>> Periodic backup is combine of snapshot and incremental backup. Remove
>> incremental backup after new snapshot.
>> Take backup to restore on other machine: You can use snapshot after
>> flushing memtable or Use sstableloader.
>>
>>
>> 
>> VafaTech.com - A Total Solution for Data Gathering & Analysis
>>
>> On Thu, Nov 28, 2019 at 6:05 AM guo Maxwell  wrote:
>>
>>> for cassandra or datastax's documentation, commitlog's backup is not
>>> mentioned.
>>> only snapshot and incremental backup is described to do backup .
>>>
>>> Though commitlog's archive for keyspace/table is not support but
>>> commitlog' replay (though you must put log to commitlog_dir and restart the
>>> process)
>>> support the feature of keyspace/table' replay filter (using
>>> -Dcassandra.replayList with the keyspace1.table1,keyspace1.table2 format to
>>> replay the specified keyspace/table)
>>>
>>> Snapshot do affect the storage, for us we got snapshot one week a time
>>> under the low business peak and making snapshot got throttle ,for you you
>>> may
>>> see the issue (https://issues.apache.org/jira/browse/CASSANDRA-13019)
>>>
>>>
>>>
>>> Adarsh Kumar  于2019年11月28日周四 上午1:00写道:
>>>
 Thanks Guo and Eric for replying,

 I have some confusions about commit log backup:

1. commit log archival technique is (

 https://support.datastax.com/hc/en-us/articles/115001593706-Manual-Backup-and-Restore-with-Point-in-time-and-table-level-restore-
) as good as an incremental backup, as it also captures commit logs 
 after
memtable flush.
2. If we go for "Snapshot + Incremental bk + Commit log", here we
have to take commit log from commit log directory (is there any SOP for
this?). As commit logs are not per table or ks, we will have chalange in
restoring selective tables.
3. Snapshot based backups are easy to manage and operate due to its
simplicity. But they are heavy on storage. Any views on this?
4. Please share any successful strategy that someone is using for
production. We are still in the design phase and want to implement the 
 best
solution.

 Thanks Eric for sharing link for medusa.

 Regards,
 Adarsh Kumar

 On Wed, Nov 27, 2019 at 5:16 PM guo Maxwell 
 wrote:

> For me, I think the last one :
>  Snapshot + Incremental + commitlog
> is the most meaningful way to do backup and restore, when you make the
> data backup to some where else like AWS S3.
>
>- Snapshot based backup // for incremental data will not be
>backuped and may lose data when restore to the time latter than 
> snapshot
>time;
>- Incremental backups // better than snapshot backup .but
>with Insufficient data accuracy. For data remain in the memtable will 
> be
>lose;
>- Snapshot + incremental
>- Snapshot + commitlog archival // better data precision than made
>incremental backup, but the data in the non archived commitlog(not 
> archive
>and commitlog log not closed) will not restore and will lose. Also 
> when log
>is too much, do log reply will cost very mucu time
>
> For me ,We use snapshot + incremental + commitlog archive. We read
> snapshot data and incremental data .Also the log is backuped .But we will
> not backup the
> log whose data have been flush to sstable ,for the data will be
> backuped by the way we do incremental backup .
>
> This way , the data will exist in the format of sstable trough
> snapshot backup and incremental backup . The log number will be very small
> .And log replay will not cost much time.
>
>
>
> Eric LELEU  于2019年11月27日周三 下午4:13写道:
>
>> Hi,
>> TheLastPickle & Spotify have released Medusa as Cassandra Backup tool.
>>
>> See :
>> https://thelastpickle.com/blog/2019/11/05/cassandra-medusa-backup-tool-is-open-source.html
>>

Re: Optimal backup strategy

2019-11-28 Thread guo Maxwell
Hossein is right , But for use , we restore to the same cassandra topology
,So it is usable to do replay .But when restore to the
same machine it is also usable .
Using sstableloader cost too much time and more storage(though will reduce
after  restored)

Hossein Ghiyasi Mehr  于2019年11月28日周四 下午7:40写道:

> commitlog backup isn't usable in another machine.
> Backup solution depends on what you want to do: periodic backup or backup
> to restore on other machine?
> Periodic backup is combine of snapshot and incremental backup. Remove
> incremental backup after new snapshot.
> Take backup to restore on other machine: You can use snapshot after
> flushing memtable or Use sstableloader.
>
>
> 
> VafaTech.com - A Total Solution for Data Gathering & Analysis
>
> On Thu, Nov 28, 2019 at 6:05 AM guo Maxwell  wrote:
>
>> for cassandra or datastax's documentation, commitlog's backup is not
>> mentioned.
>> only snapshot and incremental backup is described to do backup .
>>
>> Though commitlog's archive for keyspace/table is not support but
>> commitlog' replay (though you must put log to commitlog_dir and restart the
>> process)
>> support the feature of keyspace/table' replay filter (using
>> -Dcassandra.replayList with the keyspace1.table1,keyspace1.table2 format to
>> replay the specified keyspace/table)
>>
>> Snapshot do affect the storage, for us we got snapshot one week a time
>> under the low business peak and making snapshot got throttle ,for you you
>> may
>> see the issue (https://issues.apache.org/jira/browse/CASSANDRA-13019)
>>
>>
>>
>> Adarsh Kumar  于2019年11月28日周四 上午1:00写道:
>>
>>> Thanks Guo and Eric for replying,
>>>
>>> I have some confusions about commit log backup:
>>>
>>>1. commit log archival technique is (
>>>
>>> https://support.datastax.com/hc/en-us/articles/115001593706-Manual-Backup-and-Restore-with-Point-in-time-and-table-level-restore-
>>>) as good as an incremental backup, as it also captures commit logs after
>>>memtable flush.
>>>2. If we go for "Snapshot + Incremental bk + Commit log", here we
>>>have to take commit log from commit log directory (is there any SOP for
>>>this?). As commit logs are not per table or ks, we will have chalange in
>>>restoring selective tables.
>>>3. Snapshot based backups are easy to manage and operate due to its
>>>simplicity. But they are heavy on storage. Any views on this?
>>>4. Please share any successful strategy that someone is using for
>>>production. We are still in the design phase and want to implement the 
>>> best
>>>solution.
>>>
>>> Thanks Eric for sharing link for medusa.
>>>
>>> Regards,
>>> Adarsh Kumar
>>>
>>> On Wed, Nov 27, 2019 at 5:16 PM guo Maxwell 
>>> wrote:
>>>
 For me, I think the last one :
  Snapshot + Incremental + commitlog
 is the most meaningful way to do backup and restore, when you make the
 data backup to some where else like AWS S3.

- Snapshot based backup // for incremental data will not be
backuped and may lose data when restore to the time latter than snapshot
time;
- Incremental backups // better than snapshot backup .but
with Insufficient data accuracy. For data remain in the memtable will be
lose;
- Snapshot + incremental
- Snapshot + commitlog archival // better data precision than made
incremental backup, but the data in the non archived commitlog(not 
 archive
and commitlog log not closed) will not restore and will lose. Also when 
 log
is too much, do log reply will cost very mucu time

 For me ,We use snapshot + incremental + commitlog archive. We read
 snapshot data and incremental data .Also the log is backuped .But we will
 not backup the
 log whose data have been flush to sstable ,for the data will be
 backuped by the way we do incremental backup .

 This way , the data will exist in the format of sstable trough snapshot
 backup and incremental backup . The log number will be very small .And log
 replay will not cost much time.



 Eric LELEU  于2019年11月27日周三 下午4:13写道:

> Hi,
> TheLastPickle & Spotify have released Medusa as Cassandra Backup tool.
>
> See :
> https://thelastpickle.com/blog/2019/11/05/cassandra-medusa-backup-tool-is-open-source.html
>
> Hope this link will help you.
>
> Eric
>
>
> Le 27/11/2019 à 08:10, Adarsh Kumar a écrit :
>
> Hi,
>
> I was looking for the backup strategies of Cassandra. After some study
> I came to know that there are the following options:
>
>- Snapshot based backup
>- Incremental backups
>- Snapshot + incremental
>- Snapshot + commitlog archival
>- Snapshot + Incremental + commitlog
>
> Which is the most suitable and feasible approach? Also which of these
> is used most.
> Please let me know if there is any other 

Re: Optimal backup strategy

2019-11-28 Thread Hossein Ghiyasi Mehr
commitlog backup isn't usable in another machine.
Backup solution depends on what you want to do: periodic backup or backup
to restore on other machine?
Periodic backup is combine of snapshot and incremental backup. Remove
incremental backup after new snapshot.
Take backup to restore on other machine: You can use snapshot after
flushing memtable or Use sstableloader.



VafaTech.com - A Total Solution for Data Gathering & Analysis

On Thu, Nov 28, 2019 at 6:05 AM guo Maxwell  wrote:

> for cassandra or datastax's documentation, commitlog's backup is not
> mentioned.
> only snapshot and incremental backup is described to do backup .
>
> Though commitlog's archive for keyspace/table is not support but
> commitlog' replay (though you must put log to commitlog_dir and restart the
> process)
> support the feature of keyspace/table' replay filter (using
> -Dcassandra.replayList with the keyspace1.table1,keyspace1.table2 format to
> replay the specified keyspace/table)
>
> Snapshot do affect the storage, for us we got snapshot one week a time
> under the low business peak and making snapshot got throttle ,for you you
> may
> see the issue (https://issues.apache.org/jira/browse/CASSANDRA-13019)
>
>
>
> Adarsh Kumar  于2019年11月28日周四 上午1:00写道:
>
>> Thanks Guo and Eric for replying,
>>
>> I have some confusions about commit log backup:
>>
>>1. commit log archival technique is (
>>
>> https://support.datastax.com/hc/en-us/articles/115001593706-Manual-Backup-and-Restore-with-Point-in-time-and-table-level-restore-
>>) as good as an incremental backup, as it also captures commit logs after
>>memtable flush.
>>2. If we go for "Snapshot + Incremental bk + Commit log", here we
>>have to take commit log from commit log directory (is there any SOP for
>>this?). As commit logs are not per table or ks, we will have chalange in
>>restoring selective tables.
>>3. Snapshot based backups are easy to manage and operate due to its
>>simplicity. But they are heavy on storage. Any views on this?
>>4. Please share any successful strategy that someone is using for
>>production. We are still in the design phase and want to implement the 
>> best
>>solution.
>>
>> Thanks Eric for sharing link for medusa.
>>
>> Regards,
>> Adarsh Kumar
>>
>> On Wed, Nov 27, 2019 at 5:16 PM guo Maxwell  wrote:
>>
>>> For me, I think the last one :
>>>  Snapshot + Incremental + commitlog
>>> is the most meaningful way to do backup and restore, when you make the
>>> data backup to some where else like AWS S3.
>>>
>>>- Snapshot based backup // for incremental data will not be backuped
>>>and may lose data when restore to the time latter than snapshot time;
>>>- Incremental backups // better than snapshot backup .but
>>>with Insufficient data accuracy. For data remain in the memtable will be
>>>lose;
>>>- Snapshot + incremental
>>>- Snapshot + commitlog archival // better data precision than made
>>>incremental backup, but the data in the non archived commitlog(not 
>>> archive
>>>and commitlog log not closed) will not restore and will lose. Also when 
>>> log
>>>is too much, do log reply will cost very mucu time
>>>
>>> For me ,We use snapshot + incremental + commitlog archive. We read
>>> snapshot data and incremental data .Also the log is backuped .But we will
>>> not backup the
>>> log whose data have been flush to sstable ,for the data will be backuped
>>> by the way we do incremental backup .
>>>
>>> This way , the data will exist in the format of sstable trough snapshot
>>> backup and incremental backup . The log number will be very small .And log
>>> replay will not cost much time.
>>>
>>>
>>>
>>> Eric LELEU  于2019年11月27日周三 下午4:13写道:
>>>
 Hi,
 TheLastPickle & Spotify have released Medusa as Cassandra Backup tool.

 See :
 https://thelastpickle.com/blog/2019/11/05/cassandra-medusa-backup-tool-is-open-source.html

 Hope this link will help you.

 Eric


 Le 27/11/2019 à 08:10, Adarsh Kumar a écrit :

 Hi,

 I was looking for the backup strategies of Cassandra. After some study
 I came to know that there are the following options:

- Snapshot based backup
- Incremental backups
- Snapshot + incremental
- Snapshot + commitlog archival
- Snapshot + Incremental + commitlog

 Which is the most suitable and feasible approach? Also which of these
 is used most.
 Please let me know if there is any other option to tool available.

 Thanks in advance.

 Regards,
 Adarsh Kumar


>>>
>>> --
>>> you are the apple of my eye !
>>>
>>
>
> --
> you are the apple of my eye !
>


Re: Optimal backup strategy

2019-11-27 Thread guo Maxwell
for cassandra or datastax's documentation, commitlog's backup is not
mentioned.
only snapshot and incremental backup is described to do backup .

Though commitlog's archive for keyspace/table is not support but commitlog'
replay (though you must put log to commitlog_dir and restart the process)
support the feature of keyspace/table' replay filter (using
-Dcassandra.replayList with the keyspace1.table1,keyspace1.table2 format to
replay the specified keyspace/table)

Snapshot do affect the storage, for us we got snapshot one week a time
under the low business peak and making snapshot got throttle ,for you you
may
see the issue (https://issues.apache.org/jira/browse/CASSANDRA-13019)



Adarsh Kumar  于2019年11月28日周四 上午1:00写道:

> Thanks Guo and Eric for replying,
>
> I have some confusions about commit log backup:
>
>1. commit log archival technique is (
>
> https://support.datastax.com/hc/en-us/articles/115001593706-Manual-Backup-and-Restore-with-Point-in-time-and-table-level-restore-
>) as good as an incremental backup, as it also captures commit logs after
>memtable flush.
>2. If we go for "Snapshot + Incremental bk + Commit log", here we have
>to take commit log from commit log directory (is there any SOP for this?).
>As commit logs are not per table or ks, we will have chalange in restoring
>selective tables.
>3. Snapshot based backups are easy to manage and operate due to its
>simplicity. But they are heavy on storage. Any views on this?
>4. Please share any successful strategy that someone is using for
>production. We are still in the design phase and want to implement the best
>solution.
>
> Thanks Eric for sharing link for medusa.
>
> Regards,
> Adarsh Kumar
>
> On Wed, Nov 27, 2019 at 5:16 PM guo Maxwell  wrote:
>
>> For me, I think the last one :
>>  Snapshot + Incremental + commitlog
>> is the most meaningful way to do backup and restore, when you make the
>> data backup to some where else like AWS S3.
>>
>>- Snapshot based backup // for incremental data will not be backuped
>>and may lose data when restore to the time latter than snapshot time;
>>- Incremental backups // better than snapshot backup .but
>>with Insufficient data accuracy. For data remain in the memtable will be
>>lose;
>>- Snapshot + incremental
>>- Snapshot + commitlog archival // better data precision than made
>>incremental backup, but the data in the non archived commitlog(not archive
>>and commitlog log not closed) will not restore and will lose. Also when 
>> log
>>is too much, do log reply will cost very mucu time
>>
>> For me ,We use snapshot + incremental + commitlog archive. We read
>> snapshot data and incremental data .Also the log is backuped .But we will
>> not backup the
>> log whose data have been flush to sstable ,for the data will be backuped
>> by the way we do incremental backup .
>>
>> This way , the data will exist in the format of sstable trough snapshot
>> backup and incremental backup . The log number will be very small .And log
>> replay will not cost much time.
>>
>>
>>
>> Eric LELEU  于2019年11月27日周三 下午4:13写道:
>>
>>> Hi,
>>> TheLastPickle & Spotify have released Medusa as Cassandra Backup tool.
>>>
>>> See :
>>> https://thelastpickle.com/blog/2019/11/05/cassandra-medusa-backup-tool-is-open-source.html
>>>
>>> Hope this link will help you.
>>>
>>> Eric
>>>
>>>
>>> Le 27/11/2019 à 08:10, Adarsh Kumar a écrit :
>>>
>>> Hi,
>>>
>>> I was looking for the backup strategies of Cassandra. After some study I
>>> came to know that there are the following options:
>>>
>>>- Snapshot based backup
>>>- Incremental backups
>>>- Snapshot + incremental
>>>- Snapshot + commitlog archival
>>>- Snapshot + Incremental + commitlog
>>>
>>> Which is the most suitable and feasible approach? Also which of these is
>>> used most.
>>> Please let me know if there is any other option to tool available.
>>>
>>> Thanks in advance.
>>>
>>> Regards,
>>> Adarsh Kumar
>>>
>>>
>>
>> --
>> you are the apple of my eye !
>>
>

-- 
you are the apple of my eye !


Re: Optimal backup strategy

2019-11-27 Thread Adarsh Kumar
Thanks Guo and Eric for replying,

I have some confusions about commit log backup:

   1. commit log archival technique is (
   
https://support.datastax.com/hc/en-us/articles/115001593706-Manual-Backup-and-Restore-with-Point-in-time-and-table-level-restore-
   ) as good as an incremental backup, as it also captures commit logs after
   memtable flush.
   2. If we go for "Snapshot + Incremental bk + Commit log", here we have
   to take commit log from commit log directory (is there any SOP for this?).
   As commit logs are not per table or ks, we will have chalange in restoring
   selective tables.
   3. Snapshot based backups are easy to manage and operate due to its
   simplicity. But they are heavy on storage. Any views on this?
   4. Please share any successful strategy that someone is using for
   production. We are still in the design phase and want to implement the best
   solution.

Thanks Eric for sharing link for medusa.

Regards,
Adarsh Kumar

On Wed, Nov 27, 2019 at 5:16 PM guo Maxwell  wrote:

> For me, I think the last one :
>  Snapshot + Incremental + commitlog
> is the most meaningful way to do backup and restore, when you make the
> data backup to some where else like AWS S3.
>
>- Snapshot based backup // for incremental data will not be backuped
>and may lose data when restore to the time latter than snapshot time;
>- Incremental backups // better than snapshot backup .but
>with Insufficient data accuracy. For data remain in the memtable will be
>lose;
>- Snapshot + incremental
>- Snapshot + commitlog archival // better data precision than made
>incremental backup, but the data in the non archived commitlog(not archive
>and commitlog log not closed) will not restore and will lose. Also when log
>is too much, do log reply will cost very mucu time
>
> For me ,We use snapshot + incremental + commitlog archive. We read
> snapshot data and incremental data .Also the log is backuped .But we will
> not backup the
> log whose data have been flush to sstable ,for the data will be backuped
> by the way we do incremental backup .
>
> This way , the data will exist in the format of sstable trough snapshot
> backup and incremental backup . The log number will be very small .And log
> replay will not cost much time.
>
>
>
> Eric LELEU  于2019年11月27日周三 下午4:13写道:
>
>> Hi,
>> TheLastPickle & Spotify have released Medusa as Cassandra Backup tool.
>>
>> See :
>> https://thelastpickle.com/blog/2019/11/05/cassandra-medusa-backup-tool-is-open-source.html
>>
>> Hope this link will help you.
>>
>> Eric
>>
>>
>> Le 27/11/2019 à 08:10, Adarsh Kumar a écrit :
>>
>> Hi,
>>
>> I was looking for the backup strategies of Cassandra. After some study I
>> came to know that there are the following options:
>>
>>- Snapshot based backup
>>- Incremental backups
>>- Snapshot + incremental
>>- Snapshot + commitlog archival
>>- Snapshot + Incremental + commitlog
>>
>> Which is the most suitable and feasible approach? Also which of these is
>> used most.
>> Please let me know if there is any other option to tool available.
>>
>> Thanks in advance.
>>
>> Regards,
>> Adarsh Kumar
>>
>>
>
> --
> you are the apple of my eye !
>


Re: Optimal backup strategy

2019-11-27 Thread guo Maxwell
For me, I think the last one :
 Snapshot + Incremental + commitlog
is the most meaningful way to do backup and restore, when you make the data
backup to some where else like AWS S3.

   - Snapshot based backup // for incremental data will not be backuped and
   may lose data when restore to the time latter than snapshot time;
   - Incremental backups // better than snapshot backup .but
   with Insufficient data accuracy. For data remain in the memtable will be
   lose;
   - Snapshot + incremental
   - Snapshot + commitlog archival // better data precision than made
   incremental backup, but the data in the non archived commitlog(not archive
   and commitlog log not closed) will not restore and will lose. Also when log
   is too much, do log reply will cost very mucu time

For me ,We use snapshot + incremental + commitlog archive. We read snapshot
data and incremental data .Also the log is backuped .But we will not backup
the
log whose data have been flush to sstable ,for the data will be backuped by
the way we do incremental backup .

This way , the data will exist in the format of sstable trough snapshot
backup and incremental backup . The log number will be very small .And log
replay will not cost much time.



Eric LELEU  于2019年11月27日周三 下午4:13写道:

> Hi,
> TheLastPickle & Spotify have released Medusa as Cassandra Backup tool.
>
> See :
> https://thelastpickle.com/blog/2019/11/05/cassandra-medusa-backup-tool-is-open-source.html
>
> Hope this link will help you.
>
> Eric
>
>
> Le 27/11/2019 à 08:10, Adarsh Kumar a écrit :
>
> Hi,
>
> I was looking for the backup strategies of Cassandra. After some study I
> came to know that there are the following options:
>
>- Snapshot based backup
>- Incremental backups
>- Snapshot + incremental
>- Snapshot + commitlog archival
>- Snapshot + Incremental + commitlog
>
> Which is the most suitable and feasible approach? Also which of these is
> used most.
> Please let me know if there is any other option to tool available.
>
> Thanks in advance.
>
> Regards,
> Adarsh Kumar
>
>

-- 
you are the apple of my eye !


Re: Optimal backup strategy

2019-11-27 Thread Eric LELEU

Hi,

TheLastPickle & Spotify have released Medusa as Cassandra Backup tool.

See : 
https://thelastpickle.com/blog/2019/11/05/cassandra-medusa-backup-tool-is-open-source.html


Hope this link will help you.

Eric


Le 27/11/2019 à 08:10, Adarsh Kumar a écrit :

Hi,

I was looking for the backup strategies of Cassandra. After some study 
I came to know that there are the following options:


  * Snapshot based backup
  * Incremental backups
  * Snapshot + incremental
  * Snapshot + commitlog archival
  * Snapshot + Incremental + commitlog

Which is the most suitable and feasible approach? Also which of these 
is used most.

Please let me know if there is any other option to tool available.

Thanks in advance.

Regards,
Adarsh Kumar


Optimal backup strategy

2019-11-26 Thread Adarsh Kumar
Hi,

I was looking for the backup strategies of Cassandra. After some study I
came to know that there are the following options:

   - Snapshot based backup
   - Incremental backups
   - Snapshot + incremental
   - Snapshot + commitlog archival
   - Snapshot + Incremental + commitlog

Which is the most suitable and feasible approach? Also which of these is
used most.
Please let me know if there is any other option to tool available.

Thanks in advance.

Regards,
Adarsh Kumar


Re: Backup strategy

2016-06-16 Thread Dennis Lovely
Snapshot would flush your memtable to disk and you could stream your
sstables out.  Incremental backups would be the differences that have
occurred since your last snapshot as far as I'm aware.  Since it's
reasonably unfeasible to constantly stream out full snapshots (depending on
the density of your data on disk), incremental backups are a faster
approach to keeping a remote location synched with your sstable changes,
which would make it much more likely to succesfully restore to points in
time.

On Thu, Jun 16, 2016 at 4:35 PM, Rakesh Kumar 
wrote:

> On Thu, Jun 16, 2016 at 7:30 PM, Bhuvan Rawal  wrote:
> > 2. Snapshotting : Hardlinks of sstables will get created. This is a very
> > fast process and latest data is captured into sstables after flushing
> > memtables, snapshots will be created in snapshots directory. But snapshot
> > does not provide you the feature to go back to a certain point in time
> but
> > incremental backups give you that feature.
>
> Does that mean that the only point-in-time recovery possible is using
> incremental backup. In other words C* does not have a concept of
> rolling forward commit logs to a point in time (like RDBMS do). Pls
> clarify.  thanks
>


Re: Backup strategy

2016-06-16 Thread Dennis Lovely
Periodic snapshots + incremental backups I think are pretty good in terms
of restoring to point in time.  But you must manage cleaning up your
snapshots + incremental backups on your own.  I believe that tablesnap (
https://github.com/JeremyGrosser/tablesnap) is a pretty decent approach in
terms of keeping your sstables, per node, synched to a location off of your
host (on S3 in fact).  Not sure how portable it is to other block storage
services however.  S3+Lifecycle policy to go to Glacier would likely be the
most cost effective for long term retention.

On Thu, Jun 16, 2016 at 4:30 PM, Bhuvan Rawal <bhu1ra...@gmail.com> wrote:

> Also if we talk about backup strategy for Cassandra Data then essentially
> there are couple of strategies that are adopted:
>
> 1. Incremental Backups. The old sstables will remain inside a backup
> directory and can be shipped to a storage location like AWS Glacier, etc.
> 2. Snapshotting : Hardlinks of sstables will get created. This is a very
> fast process and latest data is captured into sstables after flushing
> memtables, snapshots will be created in snapshots directory. But snapshot
> does not provide you the feature to go back to a certain point in time but
> incremental backups give you that feature.
>
> Depending on the use case, you can use 1 or 2 or both.
>
> On Fri, Jun 17, 2016 at 4:46 AM, Bhuvan Rawal <bhu1ra...@gmail.com> wrote:
>
>> What kind of data are we talking here?
>> Is it time series data with infrequent updates and only inserts or
>> frequently updated data. How frequently is old data read. I ask this
>> because your Node size planning and Compaction Strategy will essentially
>> depend on these.
>>
>> I have known people go upto 3-5 TB per node if data is not updated
>> frequently.
>>
>> Regards,
>> Bhuvan
>>
>> On Fri, Jun 17, 2016 at 4:31 AM, <vasu.no...@gmail.com> wrote:
>>
>>> Bhuvan,
>>>
>>> Thanks for the info but actually I'm not looking for migration strategy.
>>> just want to backup strategy and retention policy best practices
>>>
>>> Thanks,
>>> Vasu
>>>
>>> On Jun 16, 2016, at 6:51 PM, Bhuvan Rawal <bhu1ra...@gmail.com> wrote:
>>>
>>> Hi Vasu,
>>>
>>> Planet Cassandra has a documentation page for basic info about migrating
>>> to cassandra from MySQL. What to expect and what not to. It can be found
>>> here <http://planetcassandra.org/mysql-to-cassandra-migration/>.
>>>
>>> I had a look at this slide
>>> <http://www.slideshare.net/planetcassandra/migration-best-practices-from-rdbms-to-cassandra-without-a-hitch>
>>>  a
>>> while back. It provides a pretty reliable 4 Phase Sync strategy, starting
>>> from Slide 31. Also the QA session of the talk is informative too -
>>> http://www.doanduyhai.com/blog/?p=1757.
>>>
>>> Best Regards,
>>> Bhuvan
>>>
>>> On Fri, Jun 17, 2016 at 4:03 AM, <vasu.no...@gmail.com> wrote:
>>>
>>>> Hi ,
>>>>
>>>> I'm from relational world recently started working on Cassandra. I'm
>>>> just wondering what is backup best practices for DB around 100 Tb with
>>>> multi DC setup.
>>>>
>>>>
>>>> Thanks,
>>>> Vasu
>>>
>>>
>>>
>>
>


Re: Backup strategy

2016-06-16 Thread Rakesh Kumar
On Thu, Jun 16, 2016 at 7:30 PM, Bhuvan Rawal  wrote:
> 2. Snapshotting : Hardlinks of sstables will get created. This is a very
> fast process and latest data is captured into sstables after flushing
> memtables, snapshots will be created in snapshots directory. But snapshot
> does not provide you the feature to go back to a certain point in time but
> incremental backups give you that feature.

Does that mean that the only point-in-time recovery possible is using
incremental backup. In other words C* does not have a concept of
rolling forward commit logs to a point in time (like RDBMS do). Pls
clarify.  thanks


Re: Backup strategy

2016-06-16 Thread Bhuvan Rawal
Also if we talk about backup strategy for Cassandra Data then essentially
there are couple of strategies that are adopted:

1. Incremental Backups. The old sstables will remain inside a backup
directory and can be shipped to a storage location like AWS Glacier, etc.
2. Snapshotting : Hardlinks of sstables will get created. This is a very
fast process and latest data is captured into sstables after flushing
memtables, snapshots will be created in snapshots directory. But snapshot
does not provide you the feature to go back to a certain point in time but
incremental backups give you that feature.

Depending on the use case, you can use 1 or 2 or both.

On Fri, Jun 17, 2016 at 4:46 AM, Bhuvan Rawal <bhu1ra...@gmail.com> wrote:

> What kind of data are we talking here?
> Is it time series data with infrequent updates and only inserts or
> frequently updated data. How frequently is old data read. I ask this
> because your Node size planning and Compaction Strategy will essentially
> depend on these.
>
> I have known people go upto 3-5 TB per node if data is not updated
> frequently.
>
> Regards,
> Bhuvan
>
> On Fri, Jun 17, 2016 at 4:31 AM, <vasu.no...@gmail.com> wrote:
>
>> Bhuvan,
>>
>> Thanks for the info but actually I'm not looking for migration strategy.
>> just want to backup strategy and retention policy best practices
>>
>> Thanks,
>> Vasu
>>
>> On Jun 16, 2016, at 6:51 PM, Bhuvan Rawal <bhu1ra...@gmail.com> wrote:
>>
>> Hi Vasu,
>>
>> Planet Cassandra has a documentation page for basic info about migrating
>> to cassandra from MySQL. What to expect and what not to. It can be found
>> here <http://planetcassandra.org/mysql-to-cassandra-migration/>.
>>
>> I had a look at this slide
>> <http://www.slideshare.net/planetcassandra/migration-best-practices-from-rdbms-to-cassandra-without-a-hitch>
>>  a
>> while back. It provides a pretty reliable 4 Phase Sync strategy, starting
>> from Slide 31. Also the QA session of the talk is informative too -
>> http://www.doanduyhai.com/blog/?p=1757.
>>
>> Best Regards,
>> Bhuvan
>>
>> On Fri, Jun 17, 2016 at 4:03 AM, <vasu.no...@gmail.com> wrote:
>>
>>> Hi ,
>>>
>>> I'm from relational world recently started working on Cassandra. I'm
>>> just wondering what is backup best practices for DB around 100 Tb with
>>> multi DC setup.
>>>
>>>
>>> Thanks,
>>> Vasu
>>
>>
>>
>


Re: Backup strategy

2016-06-16 Thread Bhuvan Rawal
What kind of data are we talking here?
Is it time series data with infrequent updates and only inserts or
frequently updated data. How frequently is old data read. I ask this
because your Node size planning and Compaction Strategy will essentially
depend on these.

I have known people go upto 3-5 TB per node if data is not updated
frequently.

Regards,
Bhuvan

On Fri, Jun 17, 2016 at 4:31 AM, <vasu.no...@gmail.com> wrote:

> Bhuvan,
>
> Thanks for the info but actually I'm not looking for migration strategy.
> just want to backup strategy and retention policy best practices
>
> Thanks,
> Vasu
>
> On Jun 16, 2016, at 6:51 PM, Bhuvan Rawal <bhu1ra...@gmail.com> wrote:
>
> Hi Vasu,
>
> Planet Cassandra has a documentation page for basic info about migrating
> to cassandra from MySQL. What to expect and what not to. It can be found
> here <http://planetcassandra.org/mysql-to-cassandra-migration/>.
>
> I had a look at this slide
> <http://www.slideshare.net/planetcassandra/migration-best-practices-from-rdbms-to-cassandra-without-a-hitch>
>  a
> while back. It provides a pretty reliable 4 Phase Sync strategy, starting
> from Slide 31. Also the QA session of the talk is informative too -
> http://www.doanduyhai.com/blog/?p=1757.
>
> Best Regards,
> Bhuvan
>
> On Fri, Jun 17, 2016 at 4:03 AM, <vasu.no...@gmail.com> wrote:
>
>> Hi ,
>>
>> I'm from relational world recently started working on Cassandra. I'm just
>> wondering what is backup best practices for DB around 100 Tb with multi DC
>> setup.
>>
>>
>> Thanks,
>> Vasu
>
>
>


Re: Backup strategy

2016-06-16 Thread vasu . nosql
Bhuvan,

Thanks for the info but actually I'm not looking for migration strategy. just 
want to backup strategy and retention policy best practices 

Thanks,
Vasu

> On Jun 16, 2016, at 6:51 PM, Bhuvan Rawal <bhu1ra...@gmail.com> wrote:
> 
> Hi Vasu,
> 
> Planet Cassandra has a documentation page for basic info about migrating to 
> cassandra from MySQL. What to expect and what not to. It can be found here.
> 
> I had a look at this slide a while back. It provides a pretty reliable 4 
> Phase Sync strategy, starting from Slide 31. Also the QA session of the talk 
> is informative too - http://www.doanduyhai.com/blog/?p=1757. 
> 
> Best Regards,
> Bhuvan
> 
>> On Fri, Jun 17, 2016 at 4:03 AM, <vasu.no...@gmail.com> wrote:
>> Hi ,
>> 
>> I'm from relational world recently started working on Cassandra. I'm just 
>> wondering what is backup best practices for DB around 100 Tb with multi DC 
>> setup.
>> 
>> 
>> Thanks,
>> Vasu
> 


Re: Backup strategy

2016-06-16 Thread Bhuvan Rawal
Hi Vasu,

Planet Cassandra has a documentation page for basic info about migrating to
cassandra from MySQL. What to expect and what not to. It can be found here
.

I had a look at this slide

a
while back. It provides a pretty reliable 4 Phase Sync strategy, starting
from Slide 31. Also the QA session of the talk is informative too -
http://www.doanduyhai.com/blog/?p=1757.

Best Regards,
Bhuvan

On Fri, Jun 17, 2016 at 4:03 AM,  wrote:

> Hi ,
>
> I'm from relational world recently started working on Cassandra. I'm just
> wondering what is backup best practices for DB around 100 Tb with multi DC
> setup.
>
>
> Thanks,
> Vasu


Backup strategy

2016-06-16 Thread vasu . nosql
Hi ,

I'm from relational world recently started working on Cassandra. I'm just 
wondering what is backup best practices for DB around 100 Tb with multi DC 
setup.


Thanks,
Vasu

Re: What is your backup strategy for Cassandra?

2015-09-24 Thread Luigi Tagliamonte
Since I'm running on AWS we wrote a script that for each column performs a
snapshot and sync it on S3, and at the end of the script i'm also grabbing
the node tokens and store them on S3.
In case of restore i will use this procedure
<http://docs.datastax.com/en/cassandra/2.0/cassandra/operations/ops_snapshot_restore_new_cluster.html>
.

On Mon, Sep 21, 2015 at 9:23 PM, Sanjay Baronia <
sanjay.baro...@triliodata.com> wrote:

> John,
>
> Yes the Trilio solution is private and today, it is for Cassandra running
> in Vmware and OpenStack environment. AWS support is on the roadmap. Will
> reach out separately to give you a demo after the summit.
>
> Thanks,
>
> Sanjay
>
> _
>
>
>
> *Sanjay Baronia VP of Product & Solutions Management Trilio Data *(c)
> 508-335-2306
> sanjay.baro...@triliodata.com
>
> [image: Trilio-Business Assurance_300 Pixels] <http://www.triliodata.com/>
>
> *Experience Trilio* *in action*, please *click here
> <i...@triliodata.com?subject=Demo%20Request.>* to request a demo today!
>
>
> From: John Wong <gokoproj...@gmail.com>
> Reply-To: Cassandra Maillist <user@cassandra.apache.org>
> Date: Friday, September 18, 2015 at 8:02 PM
> To: Cassandra Maillist <user@cassandra.apache.org>
> Subject: Re: What is your backup strategy for Cassandra?
>
>
>
> On Fri, Sep 18, 2015 at 3:02 PM, Sanjay Baronia <
> sanjay.baro...@triliodata.com> wrote:
>
>>
>> Will be at the Cassandra summit next week if any of you would like a demo.
>>
>>
>>
>
> Sanjay, is Trilio Data's work private? Unfortunately I will not attend the
> Summit, but maybe Trilio can also talk about this in, say, a Cassandra
> Planet blog post? I'd like to see a demo or get a little more technical. If
> open source would be cool.
>
> I didn't implement our solution, but the current solution is based on full
> snapshot copies to a remote server for storage using rsync (only transfers
> what is needed). On our remote server we have a complete backup of every
> hour, so if you cd into the data directory you can get every node's exact
> moment-in-time data like you are browsing on the actual nodes.
>
> We are an AWS shop so we can further optimize our cost by using EBS
> snapshot so the volume can reduce (currently we provisioned 4000GB which is
> too much). Anyway, s3 we tried, and is an okay solution. The bad thing is
> performance plus ability to quickly go back in time. With EBS I can create
> a dozen volumes from the same snapshot, attach each to my each of my node,
> and cp -r files over.
>
> John
>
>>
>> From: Maciek Sakrejda <mac...@heroku.com>
>> Reply-To: Cassandra Maillist <user@cassandra.apache.org>
>> Date: Friday, September 18, 2015 at 2:09 PM
>> To: Cassandra Maillist <user@cassandra.apache.org>
>> Subject: Re: What is your backup strategy for Cassandra?
>>
>> On Thu, Sep 17, 2015 at 7:46 PM, Marc Tamsky <mtam...@gmail.com> wrote:
>>
>>> This seems like an apt time to quote [1]:
>>>
>>> > Remember that you get 1 point for making a backup and 10,000 points
>>> for restoring one.
>>>
>>> Restoring from backups is my goal.
>>>
>>> The commonly recommended tools (tablesnap, cassandra_snapshotter) all
>>> seem to leave the restore operation as a pretty complicated exercise for
>>> the operator.
>>>
>>> Do any include a working way to restore, on a different host, all of
>>> node X's data from backups to the correct directories, such that the
>>> restored files are in the proper places and the node restart method [2]
>>> "just works"?
>>>
>>
>> As someone getting started with Cassandra, I'm very much interested in
>> this as well. It seems that for the most part, folks seem to rely on
>> replication and node replacement to recover from failures, and perhaps this
>> is a testament for how well this works, but as long as we're hauling out
>> aphorisms, "RAID is not a backup" seems to (partially) apply here too.
>>
>> I'd love to hear more about how the community does restores, too. This
>> isn't complaining about shoddy tooling: this is trying to understand--and
>> hopefully, in time, improve--the status quo re: disaster recovery. E.g.,
>> given that tableslurp operates on a single table at a time, do people
>> normally just restore single tables? Is that used when there's filesystem
>> or disk corruption? Bugs? Other issues? Looking forward to learning more.
>>
>> Thanks,
>> Maciek
>>
>
>


-- 
Luigi
---
“The only way to get smarter is by playing a smarter opponent.”


Re: What is your backup strategy for Cassandra?

2015-09-21 Thread Sanjay Baronia
John,

Yes the Trilio solution is private and today, it is for Cassandra running in 
Vmware and OpenStack environment. AWS support is on the roadmap. Will reach out 
separately to give you a demo after the summit.

Thanks,

Sanjay
_
Sanjay Baronia
VP of Product & Solutions Management
Trilio Data
(c) 508-335-2306
sanjay.baro...@triliodata.com<mailto:sanjay.baro...@triliodata.com>
[Trilio-Business Assurance_300 Pixels]<http://www.triliodata.com/>

Experience Trilio in action, please click 
here<mailto:i...@triliodata.com?subject=Demo%20Request.> to request a demo 
today!
[cid:A671941A-2E52-4BB7-B7F8-994DC2C6BDB6]

From: John Wong <gokoproj...@gmail.com<mailto:gokoproj...@gmail.com>>
Reply-To: Cassandra Maillist 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Friday, September 18, 2015 at 8:02 PM
To: Cassandra Maillist 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: What is your backup strategy for Cassandra?



On Fri, Sep 18, 2015 at 3:02 PM, Sanjay Baronia 
<sanjay.baro...@triliodata.com<mailto:sanjay.baro...@triliodata.com>> wrote:

Will be at the Cassandra summit next week if any of you would like a demo.


Sanjay, is Trilio Data's work private? Unfortunately I will not attend the 
Summit, but maybe Trilio can also talk about this in, say, a Cassandra Planet 
blog post? I'd like to see a demo or get a little more technical. If open 
source would be cool.

I didn't implement our solution, but the current solution is based on full 
snapshot copies to a remote server for storage using rsync (only transfers what 
is needed). On our remote server we have a complete backup of every hour, so if 
you cd into the data directory you can get every node's exact moment-in-time 
data like you are browsing on the actual nodes.

We are an AWS shop so we can further optimize our cost by using EBS snapshot so 
the volume can reduce (currently we provisioned 4000GB which is too much). 
Anyway, s3 we tried, and is an okay solution. The bad thing is performance plus 
ability to quickly go back in time. With EBS I can create a dozen volumes from 
the same snapshot, attach each to my each of my node, and cp -r files over.

John

From: Maciek Sakrejda <mac...@heroku.com<mailto:mac...@heroku.com>>
Reply-To: Cassandra Maillist 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Friday, September 18, 2015 at 2:09 PM
To: Cassandra Maillist 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: What is your backup strategy for Cassandra?

On Thu, Sep 17, 2015 at 7:46 PM, Marc Tamsky 
<mtam...@gmail.com<mailto:mtam...@gmail.com>> wrote:
This seems like an apt time to quote [1]:

> Remember that you get 1 point for making a backup and 10,000 points for 
> restoring one.

Restoring from backups is my goal.

The commonly recommended tools (tablesnap, cassandra_snapshotter) all seem to 
leave the restore operation as a pretty complicated exercise for the operator.

Do any include a working way to restore, on a different host, all of node X's 
data from backups to the correct directories, such that the restored files are 
in the proper places and the node restart method [2] "just works"?

As someone getting started with Cassandra, I'm very much interested in this as 
well. It seems that for the most part, folks seem to rely on replication and 
node replacement to recover from failures, and perhaps this is a testament for 
how well this works, but as long as we're hauling out aphorisms, "RAID is not a 
backup" seems to (partially) apply here too.

I'd love to hear more about how the community does restores, too. This isn't 
complaining about shoddy tooling: this is trying to understand--and hopefully, 
in time, improve--the status quo re: disaster recovery. E.g., given that 
tableslurp operates on a single table at a time, do people normally just 
restore single tables? Is that used when there's filesystem or disk corruption? 
Bugs? Other issues? Looking forward to learning more.

Thanks,
Maciek



Re: What is your backup strategy for Cassandra?

2015-09-18 Thread Maciek Sakrejda
On Thu, Sep 17, 2015 at 7:46 PM, Marc Tamsky  wrote:

> This seems like an apt time to quote [1]:
>
> > Remember that you get 1 point for making a backup and 10,000 points for
> restoring one.
>
> Restoring from backups is my goal.
>
> The commonly recommended tools (tablesnap, cassandra_snapshotter) all seem
> to leave the restore operation as a pretty complicated exercise for the
> operator.
>
> Do any include a working way to restore, on a different host, all of node
> X's data from backups to the correct directories, such that the restored
> files are in the proper places and the node restart method [2] "just works"?
>

As someone getting started with Cassandra, I'm very much interested in this
as well. It seems that for the most part, folks seem to rely on replication
and node replacement to recover from failures, and perhaps this is a
testament for how well this works, but as long as we're hauling out
aphorisms, "RAID is not a backup" seems to (partially) apply here too.

I'd love to hear more about how the community does restores, too. This
isn't complaining about shoddy tooling: this is trying to understand--and
hopefully, in time, improve--the status quo re: disaster recovery. E.g.,
given that tableslurp operates on a single table at a time, do people
normally just restore single tables? Is that used when there's filesystem
or disk corruption? Bugs? Other issues? Looking forward to learning more.

Thanks,
Maciek


Re: What is your backup strategy for Cassandra?

2015-09-18 Thread John Wong
On Fri, Sep 18, 2015 at 3:02 PM, Sanjay Baronia <
sanjay.baro...@triliodata.com> wrote:

>
> Will be at the Cassandra summit next week if any of you would like a demo.
>
>
>

Sanjay, is Trilio Data's work private? Unfortunately I will not attend the
Summit, but maybe Trilio can also talk about this in, say, a Cassandra
Planet blog post? I'd like to see a demo or get a little more technical. If
open source would be cool.

I didn't implement our solution, but the current solution is based on full
snapshot copies to a remote server for storage using rsync (only transfers
what is needed). On our remote server we have a complete backup of every
hour, so if you cd into the data directory you can get every node's exact
moment-in-time data like you are browsing on the actual nodes.

We are an AWS shop so we can further optimize our cost by using EBS
snapshot so the volume can reduce (currently we provisioned 4000GB which is
too much). Anyway, s3 we tried, and is an okay solution. The bad thing is
performance plus ability to quickly go back in time. With EBS I can create
a dozen volumes from the same snapshot, attach each to my each of my node,
and cp -r files over.

John

>
> From: Maciek Sakrejda <mac...@heroku.com>
> Reply-To: Cassandra Maillist <user@cassandra.apache.org>
> Date: Friday, September 18, 2015 at 2:09 PM
> To: Cassandra Maillist <user@cassandra.apache.org>
> Subject: Re: What is your backup strategy for Cassandra?
>
> On Thu, Sep 17, 2015 at 7:46 PM, Marc Tamsky <mtam...@gmail.com> wrote:
>
>> This seems like an apt time to quote [1]:
>>
>> > Remember that you get 1 point for making a backup and 10,000 points for
>> restoring one.
>>
>> Restoring from backups is my goal.
>>
>> The commonly recommended tools (tablesnap, cassandra_snapshotter) all
>> seem to leave the restore operation as a pretty complicated exercise for
>> the operator.
>>
>> Do any include a working way to restore, on a different host, all of node
>> X's data from backups to the correct directories, such that the restored
>> files are in the proper places and the node restart method [2] "just works"?
>>
>
> As someone getting started with Cassandra, I'm very much interested in
> this as well. It seems that for the most part, folks seem to rely on
> replication and node replacement to recover from failures, and perhaps this
> is a testament for how well this works, but as long as we're hauling out
> aphorisms, "RAID is not a backup" seems to (partially) apply here too.
>
> I'd love to hear more about how the community does restores, too. This
> isn't complaining about shoddy tooling: this is trying to understand--and
> hopefully, in time, improve--the status quo re: disaster recovery. E.g.,
> given that tableslurp operates on a single table at a time, do people
> normally just restore single tables? Is that used when there's filesystem
> or disk corruption? Bugs? Other issues? Looking forward to learning more.
>
> Thanks,
> Maciek
>


Re: What is your backup strategy for Cassandra?

2015-09-18 Thread Sanjay Baronia
Trilio Data provides an elegant backup and recovery  solution for scaleout 
Cassandra in VMware & OpenStack environment with key highlights as follows:
-Discovers topology changes for accurate point in time backups
-Speeds recovery by an order of magnitude as it takes an environmental and 
cluster-wide snapshot
-Eliminates maintenance of inherently error-prone script based backups

Will be at the Cassandra summit next week if any of you would like a demo.

Regards,

Sanjay
508-335-2306
_
Sanjay Baronia
VP of Product & Solutions Management
Trilio Data
(c) 508-335-2306
sanjay.baro...@triliodata.com<mailto:sanjay.baro...@triliodata.com>


From: Maciek Sakrejda <mac...@heroku.com<mailto:mac...@heroku.com>>
Reply-To: Cassandra Maillist 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Date: Friday, September 18, 2015 at 2:09 PM
To: Cassandra Maillist 
<user@cassandra.apache.org<mailto:user@cassandra.apache.org>>
Subject: Re: What is your backup strategy for Cassandra?

On Thu, Sep 17, 2015 at 7:46 PM, Marc Tamsky 
<mtam...@gmail.com<mailto:mtam...@gmail.com>> wrote:
This seems like an apt time to quote [1]:

> Remember that you get 1 point for making a backup and 10,000 points for 
> restoring one.

Restoring from backups is my goal.

The commonly recommended tools (tablesnap, cassandra_snapshotter) all seem to 
leave the restore operation as a pretty complicated exercise for the operator.

Do any include a working way to restore, on a different host, all of node X's 
data from backups to the correct directories, such that the restored files are 
in the proper places and the node restart method [2] "just works"?

As someone getting started with Cassandra, I'm very much interested in this as 
well. It seems that for the most part, folks seem to rely on replication and 
node replacement to recover from failures, and perhaps this is a testament for 
how well this works, but as long as we're hauling out aphorisms, "RAID is not a 
backup" seems to (partially) apply here too.

I'd love to hear more about how the community does restores, too. This isn't 
complaining about shoddy tooling: this is trying to understand--and hopefully, 
in time, improve--the status quo re: disaster recovery. E.g., given that 
tableslurp operates on a single table at a time, do people normally just 
restore single tables? Is that used when there's filesystem or disk corruption? 
Bugs? Other issues? Looking forward to learning more.

Thanks,
Maciek


Re: What is your backup strategy for Cassandra?

2015-09-17 Thread Marc Tamsky
This seems like an apt time to quote [1]:

> Remember that you get 1 point for making a backup and 10,000 points for
restoring one.

Restoring from backups is my goal.

The commonly recommended tools (tablesnap, cassandra_snapshotter) all seem
to leave the restore operation as a pretty complicated exercise for the
operator.

Do any include a working way to restore, on a different host, all of node
X's data from backups to the correct directories, such that the restored
files are in the proper places and the node restart method [2] "just works"?


On Thu, Sep 17, 2015 at 6:47 PM, Robert Coli  wrote:

> tl;dr - tablesnap works. There are awkward aspects to its use, but if you
> are operating Cassandra in AWS it's probably the best off the shelf
> off-node backup.
>

Have folks here ever used tableslurp to restore a backup taken with
tablesnap?
How would you rate the difficulty of restore?

>From my limited testing, tableslurp looks like it can only restore a single
table within a keyspace per execution.

I have hundreds of tables... so without automation around tableslurp, that
doesn't seem like a reliable path toward a full restore.

Perhaps someone has written a tool that drives tableslurp so it "just
works" ?


[1] http://serverfault.com/a/277092/218999

[2]
http://docs.datastax.com/en/cassandra/1.2/cassandra/operations/ops_backup_noderestart_t.html


Re: What is your backup strategy for Cassandra?

2015-09-09 Thread Robert Coli
On Sun, Sep 6, 2015 at 12:32 AM, Gene  wrote:

> I've seen quite a few blog posts here and there about various back up
> strategies.  I'm wondering if anyone on this list would be willing to share
> theirs.
>

https://github.com/JeremyGrosser/tablesnap


> Things I'm curious about:
>
> 1. Data size
>

Up to hundreds of gigs per node.


> 2. Frequency for full snapshots
>

Never/always (depends on your perspective).


> 3. Frequency for copying snapshots off of the Cassandra nodes
>

As SSTables are flushed.


> 4. Do you use the incremental backups feature
>

No.


> 5. Do you use commitlog archiving
>

No.


> 6. What method you use to copy data off of the cluster (e.g. NFS, rsync,
> rsync+ssh, etc)
>

S3 upload.


> 7. Do you compress your backups, if so how soon (e.g. compress backups
> older than N days)
>

My SSTables are already snappy compressed, so I am skeptical of benefit
from re-compression.


> 8. Do you use any Off the Shelf scripts for your backups (e.g. tablesnap,
> cassandra_snapshotter, etc)
>

tablesnap


> 9. Do you utilise AWS for your backups, or do you keep it local (or
> offsite on your own hardware)
>

AWS.

tl;dr - tablesnap works. There are awkward aspects to its use, but if you
are operating Cassandra in AWS it's probably the best off the shelf
off-node backup.


What is your backup strategy for Cassandra?

2015-09-06 Thread Gene
Hello everyone,

I'm new to this mailing list, and still fairly new to Cassandra.  I'm a
systems administrator and have had a 3-node Cassandra cluster with a
replication factor of 3 running in Production for about a year now.  We
have about 200 GB of data per node currently.

Up until recently I have just been performing snapshots and clearing them
out as needed.  I recently implemented an automated process to perform
snapshots of our data and copy them off of our cluster via rsync+ssh.
Pretty soon I'll also be utilising the incremental backup feature for
sstables (cassandra.yaml:incremental_backups), and will be taking a look at
archiving for commitlog as well (commitlog_archiving.properties).

I've seen quite a few blog posts here and there about various back up
strategies.  I'm wondering if anyone on this list would be willing to share
theirs.

Things I'm curious about:

1. Data size
2. Frequency for full snapshots
3. Frequency for copying snapshots off of the Cassandra nodes
4. Do you use the incremental backups feature
5. Do you use commitlog archiving
6. What method you use to copy data off of the cluster (e.g. NFS, rsync,
rsync+ssh, etc)
7. Do you compress your backups, if so how soon (e.g. compress backups
older than N days)
8. Do you use any Off the Shelf scripts for your backups (e.g. tablesnap,
cassandra_snapshotter, etc)
9. Do you utilise AWS for your backups, or do you keep it local (or offsite
on your own hardware)
10. Anything else you'd like to add, especially if I missed something
important

I'm not asking for the best, perfect method for Cassandra backups. I'd just
like to see what others are doing and hopefully use some ideas to improve
our processes.

Thanks in advance for any responses, and sorry for the wall of text.

-Gene


Re: Backup strategy

2013-11-07 Thread Sridhar Chellappa
Yes. I am taking a Snapshot and then offloading the full data into S3.  How
will Table Snap help?


On Wed, Nov 6, 2013 at 6:57 AM, Robert Coli rc...@eventbrite.com wrote:

 On Tue, Nov 5, 2013 at 4:36 PM, Sridhar Chellappa 
 schellap2...@gmail.comwrote:


1. *If not, what is the right  backup strategy ?*

 You didn't specify, but it sounds like you are doing a snapshot and then
 a full offhost backup of the sstables?

 Perhaps instead of point in time full backups, you could continuously back
 up with something like tablesnap [1]? tablesnap watches the datadir for
 newly created SSTables and then writes the file and a manifest containing
 which other files were also in the directory at the time. These two pieces
 of information allow you to restore to the point in time when any given
 (immutable) data file was created.

 =Rob

 [1] https://github.com/synack/tablesnap



Re: Backup strategy

2013-11-07 Thread Robert Coli
On Thu, Nov 7, 2013 at 6:28 AM, Sridhar Chellappa schellap2...@gmail.comwrote:

 Yes. I am taking a Snapshot and then offloading the full data into S3.
  How will Table Snap help?


As I detailed in my previous mail :

1) incremental style backup, instead of snapshot + full
2) tracks meta information about backup sets
3) comes with tools to expire old sets
4) also comes with restore tool to restore sets.

More detail at :

https://github.com/synack/tablesnap

=Rob


Re: Backup strategy

2013-11-07 Thread Dan Simpson
Thanks for sharing tablesnap.  It's just what I have been looking for.


On Thu, Nov 7, 2013 at 5:10 PM, Robert Coli rc...@eventbrite.com wrote:

 On Thu, Nov 7, 2013 at 6:28 AM, Sridhar Chellappa 
 schellap2...@gmail.comwrote:

 Yes. I am taking a Snapshot and then offloading the full data into S3.
  How will Table Snap help?


 As I detailed in my previous mail :

 1) incremental style backup, instead of snapshot + full
 2) tracks meta information about backup sets
 3) comes with tools to expire old sets
 4) also comes with restore tool to restore sets.

 More detail at :

 https://github.com/synack/tablesnap

 =Rob




Backup strategy

2013-11-05 Thread Sridhar Chellappa
We are running into problems where Backup jobs are taking away a huge
bandwidth out of the C* nodes. While we can schedule a particular timing
window for backups alone, the request pattern is so random; there is no
particular time where we can schedule backups, periodically.

My current thinking is to run backups against a replica that does not serve
requests. Questions:


   1.
*Is it the right strategy? *
   2.
*if it is - how do I pull a replica out from serving requests ? *
   3. *If not, what is the right  backup strategy ?*


Re: Backup strategy

2013-11-05 Thread Aaron Turner
Why not just have a small DC/ring of nodes which you just do your
snapshots/backups from?

I wouldn't take nodes offline from the ring just to back them up.

The other option is to add sufficient nodes to handle your existing
request I/O + backups.  Sounds like you might be already I/O
constrained.
--
Aaron Turner
http://synfin.net/ Twitter: @synfinatic
https://github.com/synfinatic/tcpreplay - Pcap editing and replay
tools for Unix  Windows
Those who would give up essential Liberty, to purchase a little temporary
Safety, deserve neither Liberty nor Safety.
-- Benjamin Franklin


On Tue, Nov 5, 2013 at 4:36 PM, Sridhar Chellappa
schellap2...@gmail.com wrote:
 We are running into problems where Backup jobs are taking away a huge
 bandwidth out of the C* nodes. While we can schedule a particular timing
 window for backups alone, the request pattern is so random; there is no
 particular time where we can schedule backups, periodically.

 My current thinking is to run backups against a replica that does not serve
 requests. Questions:

 Is it the right strategy?
 if it is - how do I pull a replica out from serving requests ?
 If not, what is the right  backup strategy ?


Re: Backup strategy

2013-11-05 Thread Ray Sutton
I don't understand how the creation of a snapshot causes any load
whatsoever. By definition, a snapshot is a hard link of an existing
SSTable. The SSTable is not being physically copied so there is no disk
I/O, it's just a reference to an inode.

--
Ray  //o-o\\



On Tue, Nov 5, 2013 at 8:09 PM, Aaron Turner synfina...@gmail.com wrote:

 Why not just have a small DC/ring of nodes which you just do your
 snapshots/backups from?

 I wouldn't take nodes offline from the ring just to back them up.

 The other option is to add sufficient nodes to handle your existing
 request I/O + backups.  Sounds like you might be already I/O
 constrained.
 --
 Aaron Turner
 http://synfin.net/ Twitter: @synfinatic
 https://github.com/synfinatic/tcpreplay - Pcap editing and replay
 tools for Unix  Windows
 Those who would give up essential Liberty, to purchase a little temporary
 Safety, deserve neither Liberty nor Safety.
 -- Benjamin Franklin


 On Tue, Nov 5, 2013 at 4:36 PM, Sridhar Chellappa
 schellap2...@gmail.com wrote:
  We are running into problems where Backup jobs are taking away a huge
  bandwidth out of the C* nodes. While we can schedule a particular timing
  window for backups alone, the request pattern is so random; there is no
  particular time where we can schedule backups, periodically.
 
  My current thinking is to run backups against a replica that does not
 serve
  requests. Questions:
 
  Is it the right strategy?
  if it is - how do I pull a replica out from serving requests ?
  If not, what is the right  backup strategy ?



Re: Backup strategy

2013-11-05 Thread Robert Coli
On Tue, Nov 5, 2013 at 4:36 PM, Sridhar Chellappa
schellap2...@gmail.comwrote:


1. *If not, what is the right  backup strategy ?*

 You didn't specify, but it sounds like you are doing a snapshot and then a
full offhost backup of the sstables?

Perhaps instead of point in time full backups, you could continuously back
up with something like tablesnap [1]? tablesnap watches the datadir for
newly created SSTables and then writes the file and a manifest containing
which other files were also in the directory at the time. These two pieces
of information allow you to restore to the point in time when any given
(immutable) data file was created.

=Rob

[1] https://github.com/synack/tablesnap


Re: backup strategy

2013-05-09 Thread aaron morton
Assuming you are using the SimpleStrategy or the NetworkTopologyStrategy and 
one rack per DC. If you backed up every 2nd node you would get one copy *IF* 
all nodes were consistent on disk. That can be a reasonably large if that you 
need to monitor.

It's easier to back up all the nodes it will also make it easier to restore the 
cluster.

Cheers

-
Aaron Morton
Freelance Cassandra Consultant
New Zealand

@aaronmorton
http://www.thelastpickle.com

On 8/05/2013, at 8:54 AM, Kanwar Sangha kan...@mavenir.com wrote:

 Hi – If we have a RF=2 in a 4 node cluster, how do we ensure that the backup 
 taken is only for 1 copy of the data ? in other words, is it possible for us 
 to take back-up only from 2 nodes and not all 4 and still have at least 1 
 copy of the data ?
  
 Thanks,
 Kanwar
  
  
  



backup strategy

2013-05-07 Thread Kanwar Sangha
Hi - If we have a RF=2 in a 4 node cluster, how do we ensure that the backup 
taken is only for 1 copy of the data ? in other words, is it possible for us to 
take back-up only from 2 nodes and not all 4 and still have at least 1 copy of 
the data ?

Thanks,
Kanwar





Re: Backup Strategy

2010-11-12 Thread Rob Coli

On 11/9/10 5:15 AM, Wayne wrote:

We are trying to use snapshots etc. to back up the data
but it is slow (hours) and slows down the entire node.


The snapshot process (as I understand it, and with the caveat that this 
is the code path without JNA available) first flushes all memtables 
(this can take a while, and can trigger minor compaction) and then does 
the following per SSTable :


a) flushes all memtables ()
b) fork process (this can take a while depending on heap size)
c) ln /path/to/SSTable-etc.db /path/to/snapshot

In general this process should not take hours. Are you perhaps, in a 
case where you have a very large number of SSTable files in a dir and 
are not using JNA? I have seen snapshots lag in those circumstances, but 
those circumstances were usually pathological..


=Rob


Backup Strategy

2010-11-09 Thread Wayne
I got some very good advice on manual compaction so I thought I would throw
out another question on raid/backup strategies for production clusters.

We are debating going with raid 0 vs. raid 10 on our nodes for data storage.
Currently all storage we use is raid 10 as drives always fail and raid 10
basically makes a drive failure a non event. With Cassandra and a
replication factor of 3 we start thinking that maybe raid 0 is good enough.
Also since we are buying a lot more inexpensive servers raid 0 just seems to
hit that price point a lot more.

The problem now becomes how do we deal with the drives that WILL fail in a
raid 0 node? We are trying to use snapshots etc. to back up the data but it
is slow (hours) and slows down the entire node. We assume this will work if
we backup every 2 days at the least in that hinted handoff/reads could help
bring the data back into sync. If we can not backup every 1-2 days then we
are stuck with nodetool repair, decommission, etc. and using some of
Cassandra's build in capabilities but here things become more out of our
control and we are afraid to trust it. Like many in recent posts we have
been less than successful in testing this out in the .6.x branch.

Can anyone share their decisions for the same and how they managed to deal
with these issues? Coming from the relational world raid 10 has been an
assumption for years, and we are not sure whether this assumption should
be dropped or held on to. Our nodes in dev are currently around 500Gb so for
us the question is how can we restore a node with this amount of data and
how long will it take? Drives can and will fail, how can we make recovery a
non event? What is our total recovery time window? We want it to be in hours
after drive replacement (which will be in minutes).

Thanks.

Wayne


Re: Backup Strategy

2010-11-09 Thread Edward Capriolo
On Tue, Nov 9, 2010 at 8:15 AM, Wayne wav...@gmail.com wrote:
 I got some very good advice on manual compaction so I thought I would throw
 out another question on raid/backup strategies for production clusters.

 We are debating going with raid 0 vs. raid 10 on our nodes for data storage.
 Currently all storage we use is raid 10 as drives always fail and raid 10
 basically makes a drive failure a non event. With Cassandra and a
 replication factor of 3 we start thinking that maybe raid 0 is good enough.
 Also since we are buying a lot more inexpensive servers raid 0 just seems to
 hit that price point a lot more.

 The problem now becomes how do we deal with the drives that WILL fail in a
 raid 0 node? We are trying to use snapshots etc. to back up the data but it
 is slow (hours) and slows down the entire node. We assume this will work if
 we backup every 2 days at the least in that hinted handoff/reads could help
 bring the data back into sync. If we can not backup every 1-2 days then we
 are stuck with nodetool repair, decommission, etc. and using some of
 Cassandra's build in capabilities but here things become more out of our
 control and we are afraid to trust it. Like many in recent posts we have
 been less than successful in testing this out in the .6.x branch.

 Can anyone share their decisions for the same and how they managed to deal
 with these issues? Coming from the relational world raid 10 has been an
 assumption for years, and we are not sure whether this assumption should
 be dropped or held on to. Our nodes in dev are currently around 500Gb so for
 us the question is how can we restore a node with this amount of data and
 how long will it take? Drives can and will fail, how can we make recovery a
 non event? What is our total recovery time window? We want it to be in hours
 after drive replacement (which will be in minutes).

 Thanks.

 Wayne


Wayne,

We were more worried about a DR scenario.

Since SSTables are write once they make good candidates for
incremental and/or differential backups. One option is do run
cassandra snapshots and do incremental backups on that directory.

We are doing something somewhat cool that I wanted to share. I hacked
together an application that is something like cassandra/hadoop/rsync.
Essentially take the SSTables from each node that are not in hadoop
and copy them there. Write an index file of what SSTables lived on
that node at time of snapshot. This gives us a couple of days
retention as well.

Snapshots X times daily and off cluster once a day. Makes me feel
safer about our RAID-0

I have seen you mention in two threads that you are looking to do
500GB/node. You have brought up the point yourself How long will it
take to recover a 500 GB Node? Good question. Neighbour nodes need to
anti-compact and stream data to the new node. (This is being optimized
in 7.0 but still involves some heavy lifting). You may want to look at
more nodes with less storage per node if you are worried about how
long recovering a RAID-0 node will take. These things can take time
(depending on hardware and load) and pretty much need to restart from
0 if they do not complete.


Re: Backup Strategy

2010-11-09 Thread Wayne
Thanks for the details. I think we were slowly starting to realize a similar
pattern, but you definitely helped fill in the gaps: home brew rsync with
lzop in the middle. We have raid1 system/commit log drives we are copying to
once a day, and off cluster...maybe once a week.

Thanks



On Tue, Nov 9, 2010 at 12:04 PM, Edward Capriolo edlinuxg...@gmail.comwrote:

 On Tue, Nov 9, 2010 at 8:15 AM, Wayne wav...@gmail.com wrote:
  I got some very good advice on manual compaction so I thought I would
 throw
  out another question on raid/backup strategies for production clusters.
 
  We are debating going with raid 0 vs. raid 10 on our nodes for data
 storage.
  Currently all storage we use is raid 10 as drives always fail and raid 10
  basically makes a drive failure a non event. With Cassandra and a
  replication factor of 3 we start thinking that maybe raid 0 is good
 enough.
  Also since we are buying a lot more inexpensive servers raid 0 just seems
 to
  hit that price point a lot more.
 
  The problem now becomes how do we deal with the drives that WILL fail in
 a
  raid 0 node? We are trying to use snapshots etc. to back up the data but
 it
  is slow (hours) and slows down the entire node. We assume this will work
 if
  we backup every 2 days at the least in that hinted handoff/reads could
 help
  bring the data back into sync. If we can not backup every 1-2 days then
 we
  are stuck with nodetool repair, decommission, etc. and using some of
  Cassandra's build in capabilities but here things become more out of our
  control and we are afraid to trust it. Like many in recent posts we
 have
  been less than successful in testing this out in the .6.x branch.
 
  Can anyone share their decisions for the same and how they managed to
 deal
  with these issues? Coming from the relational world raid 10 has been an
  assumption for years, and we are not sure whether this assumption
 should
  be dropped or held on to. Our nodes in dev are currently around 500Gb so
 for
  us the question is how can we restore a node with this amount of data and
  how long will it take? Drives can and will fail, how can we make recovery
 a
  non event? What is our total recovery time window? We want it to be in
 hours
  after drive replacement (which will be in minutes).
 
  Thanks.
 
  Wayne
 

 Wayne,

 We were more worried about a DR scenario.

 Since SSTables are write once they make good candidates for
 incremental and/or differential backups. One option is do run
 cassandra snapshots and do incremental backups on that directory.

 We are doing something somewhat cool that I wanted to share. I hacked
 together an application that is something like cassandra/hadoop/rsync.
 Essentially take the SSTables from each node that are not in hadoop
 and copy them there. Write an index file of what SSTables lived on
 that node at time of snapshot. This gives us a couple of days
 retention as well.

 Snapshots X times daily and off cluster once a day. Makes me feel
 safer about our RAID-0

 I have seen you mention in two threads that you are looking to do
 500GB/node. You have brought up the point yourself How long will it
 take to recover a 500 GB Node? Good question. Neighbour nodes need to
 anti-compact and stream data to the new node. (This is being optimized
 in 7.0 but still involves some heavy lifting). You may want to look at
 more nodes with less storage per node if you are worried about how
 long recovering a RAID-0 node will take. These things can take time
 (depending on hardware and load) and pretty much need to restart from
 0 if they do not complete.