RE: Mechanism to Bulk Export from Cassandra on daily Basis

2020-02-21 Thread JOHN, BIBIN
CDC from Cassandra works using Oracle Goldengate for Bigdata, we are doing that 
and publishing to kafka. But one of the downstream need batch files with 
complete dataset.
I am evaluating some options based on previous responses.

Thanks
Bibin John

From: Peter Corless 
Sent: Friday, February 21, 2020 2:15 PM
To: user@cassandra.apache.org
Subject: Re: Mechanism to Bulk Export from Cassandra on daily Basis

Question: would daily deltas be a good use of CDC? (Rather than export entire 
tables.)

(I can understand that this might make analytics hard if you need to span 
multiple resultant daily files.)

Perhaps along with CDC, maybe set up the tables for export via a Kafka topic?

(https://docs.lenses.io/connectors/source/cassandra.html<https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.lenses.io_connectors_source_cassandra.html=DwMFaQ=LFYZ-o9_HUMeMTSQicvjIg=eFshZuDXOwvmW_UjVcAH8Q=UkSNN00d9DQnu8K66j74GovxF7z1hG6QrbiBTC1BIdU=0RCm7kSG_X6W60hLV1_6ApXvAfZNMNq-YekhddoRqbs=>)

Or maybe some sort of exporter using Apache Spark?

https://github.com/scylladb/scylla-migrator<https://urldefense.proofpoint.com/v2/url?u=https-3A__github.com_scylladb_scylla-2Dmigrator=DwMFaQ=LFYZ-o9_HUMeMTSQicvjIg=eFshZuDXOwvmW_UjVcAH8Q=UkSNN00d9DQnu8K66j74GovxF7z1hG6QrbiBTC1BIdU=pGkLLwEXfo93DoW_k4yo16nryA0ZJozLNoZGmoGK1LQ=>

I'm just trying to throw out a few other ideas on how to solve the exportation 
problem.

On Fri, Feb 21, 2020, 8:45 AM Durity, Sean R 
mailto:sean_r_dur...@homedepot.com>> wrote:
I would also push for something besides a full refresh, if at all possible. It 
feels like a waste of resources to me – and not predictably scalable. 
Suggestions: use a queue to send writes to both systems. If the downstream 
system doesn’t handle TTL, perhaps set an expiration date and a purge query on 
the downstream target.

If you have to do the full refresh, perhaps a Spark job would be a decent 
solution. I would probably create a separate DC (with a lower replication 
factor and smaller number of nodes) just to handle the analytical/unload kind 
of workload (if the other functions of the cluster might be impacted by the 
unload).

DSBulk from DataStax is very fast and scriptable, too.

Sean Durity – Staff Systems Engineer, Cassandra

From: JOHN, BIBIN mailto:bj9...@att.com>>
Sent: Wednesday, February 19, 2020 5:25 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: [EXTERNAL] RE: Mechanism to Bulk Export from Cassandra on daily Basis

Thank you for suggestion. Full refresh is currently designed because with delta 
we cannot identify what got deleted. So downstreams prefer full data everyday.


Thanks
Bibin John

From: Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>>
Sent: Wednesday, February 19, 2020 3:14 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: Mechanism to Bulk Export from Cassandra on daily Basis

To the question of ‘best approach’, so far the comments have been about 
alternatives in tools.

Another axis you might want to consider is from the data model viewpoint.  So, 
for example, let’s say you have 600M rows.  You want to do a daily transfer of 
data for some reason.  First question that comes to mind is, do you need all 
the data every day?  Usually that would only be the case if all of the data is 
at risk of changing.

Generally the way I’d cut down the pain on something like this is to figure out 
if the data model currently does, or could be made to, only mutate in a limited 
subset.  Then maybe all you are transferring are the daily changes.  Systems 
based on catching up to daily changes will usually be pulling single-digit 
percentages of data volume compared to the entire storage footprint.  That’s 
not only a lot less data to pull, it’s also a lot less impact on the ongoing 
operations of the cluster while you are pulling that data.

R

From: "JOHN, BIBIN" mailto:bj9...@att.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Date: Wednesday, February 19, 2020 at 1:13 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Subject: Mechanism to Bulk Export from Cassandra on daily Basis

Message from External Sender
Team,
We have a requirement to bulk export data from Cassandra on daily basis? Table 
contain close to 600M records and cluster is having 12 nodes. What is the best 
approach to do this?


Thanks
Bibin John



The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
cli

Re: Mechanism to Bulk Export from Cassandra on daily Basis

2020-02-21 Thread Peter Corless
Question: would daily deltas be a good use of CDC? (Rather than export
entire tables.)

(I can understand that this might make analytics hard if you need to span
multiple resultant daily files.)

Perhaps along with CDC, maybe set up the tables for export via a Kafka
topic?

(https://docs.lenses.io/connectors/source/cassandra.html)

Or maybe some sort of exporter using Apache Spark?

https://github.com/scylladb/scylla-migrator

I'm just trying to throw out a few other ideas on how to solve the
exportation problem.

On Fri, Feb 21, 2020, 8:45 AM Durity, Sean R 
wrote:

> I would also push for something besides a full refresh, if at all
> possible. It feels like a waste of resources to me – and not predictably
> scalable. Suggestions: use a queue to send writes to both systems. If the
> downstream system doesn’t handle TTL, perhaps set an expiration date and a
> purge query on the downstream target.
>
>
>
> If you have to do the full refresh, perhaps a Spark job would be a decent
> solution. I would probably create a separate DC (with a lower replication
> factor and smaller number of nodes) just to handle the analytical/unload
> kind of workload (if the other functions of the cluster might be impacted
> by the unload).
>
>
>
> DSBulk from DataStax is very fast and scriptable, too.
>
>
>
> Sean Durity – Staff Systems Engineer, Cassandra
>
>
>
> *From:* JOHN, BIBIN 
> *Sent:* Wednesday, February 19, 2020 5:25 PM
> *To:* user@cassandra.apache.org
> *Subject:* [EXTERNAL] RE: Mechanism to Bulk Export from Cassandra on
> daily Basis
>
>
>
> Thank you for suggestion. Full refresh is currently designed because with
> delta we cannot identify what got deleted. So downstreams prefer full data
> everyday.
>
>
>
>
>
> Thanks
>
> Bibin John
>
>
>
> *From:* Reid Pinchback 
> *Sent:* Wednesday, February 19, 2020 3:14 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Mechanism to Bulk Export from Cassandra on daily Basis
>
>
>
> To the question of ‘best approach’, so far the comments have been about
> alternatives in tools.
>
>
>
> Another axis you might want to consider is from the data model viewpoint.
> So, for example, let’s say you have 600M rows.  You want to do a daily
> transfer of data for some reason.  First question that comes to mind is, do
> you need all the data every day?  Usually that would only be the case if
> all of the data is at risk of changing.
>
>
>
> Generally the way I’d cut down the pain on something like this is to
> figure out if the data model currently does, or could be made to, only
> mutate in a limited subset.  Then maybe all you are transferring are the
> daily changes.  Systems based on catching up to daily changes will usually
> be pulling single-digit percentages of data volume compared to the entire
> storage footprint.  That’s not only a lot less data to pull, it’s also a
> lot less impact on the ongoing operations of the cluster while you are
> pulling that data.
>
>
> R
>
>
>
> *From: *"JOHN, BIBIN" 
> *Reply-To: *"user@cassandra.apache.org" 
> *Date: *Wednesday, February 19, 2020 at 1:13 PM
> *To: *"user@cassandra.apache.org" 
> *Subject: *Mechanism to Bulk Export from Cassandra on daily Basis
>
>
>
> *Message from External Sender*
>
> Team,
>
> We have a requirement to bulk export data from Cassandra on daily basis?
> Table contain close to 600M records and cluster is having 12 nodes. What is
> the best approach to do this?
>
>
>
>
>
> Thanks
>
> Bibin John
>
> --
>
> The information in this Internet Email is confidential and may be legally
> privileged. It is intended solely for the addressee. Access to this Email
> by anyone else is unauthorized. If you are not the intended recipient, any
> disclosure, copying, distribution or any action taken or omitted to be
> taken in reliance on it, is prohibited and may be unlawful. When addressed
> to our clients any opinions or advice contained in this Email are subject
> to the terms and conditions expressed in any applicable governing The Home
> Depot terms of business or client engagement letter. The Home Depot
> disclaims all responsibility and liability for the accuracy and content of
> this attachment and for any damages or losses arising from any
> inaccuracies, errors, viruses, e.g., worms, trojan horses, etc., or other
> items of a destructive nature, which may be contained in this attachment
> and shall not be liable for direct, indirect, consequential or special
> damages in connection with this e-mail message or its attachment.
>


RE: Mechanism to Bulk Export from Cassandra on daily Basis

2020-02-21 Thread Durity, Sean R
I would also push for something besides a full refresh, if at all possible. It 
feels like a waste of resources to me – and not predictably scalable. 
Suggestions: use a queue to send writes to both systems. If the downstream 
system doesn’t handle TTL, perhaps set an expiration date and a purge query on 
the downstream target.

If you have to do the full refresh, perhaps a Spark job would be a decent 
solution. I would probably create a separate DC (with a lower replication 
factor and smaller number of nodes) just to handle the analytical/unload kind 
of workload (if the other functions of the cluster might be impacted by the 
unload).

DSBulk from DataStax is very fast and scriptable, too.

Sean Durity – Staff Systems Engineer, Cassandra

From: JOHN, BIBIN 
Sent: Wednesday, February 19, 2020 5:25 PM
To: user@cassandra.apache.org
Subject: [EXTERNAL] RE: Mechanism to Bulk Export from Cassandra on daily Basis

Thank you for suggestion. Full refresh is currently designed because with delta 
we cannot identify what got deleted. So downstreams prefer full data everyday.


Thanks
Bibin John

From: Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>>
Sent: Wednesday, February 19, 2020 3:14 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: Re: Mechanism to Bulk Export from Cassandra on daily Basis

To the question of ‘best approach’, so far the comments have been about 
alternatives in tools.

Another axis you might want to consider is from the data model viewpoint.  So, 
for example, let’s say you have 600M rows.  You want to do a daily transfer of 
data for some reason.  First question that comes to mind is, do you need all 
the data every day?  Usually that would only be the case if all of the data is 
at risk of changing.

Generally the way I’d cut down the pain on something like this is to figure out 
if the data model currently does, or could be made to, only mutate in a limited 
subset.  Then maybe all you are transferring are the daily changes.  Systems 
based on catching up to daily changes will usually be pulling single-digit 
percentages of data volume compared to the entire storage footprint.  That’s 
not only a lot less data to pull, it’s also a lot less impact on the ongoing 
operations of the cluster while you are pulling that data.

R

From: "JOHN, BIBIN" mailto:bj9...@att.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Date: Wednesday, February 19, 2020 at 1:13 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Subject: Mechanism to Bulk Export from Cassandra on daily Basis

Message from External Sender
Team,
We have a requirement to bulk export data from Cassandra on daily basis? Table 
contain close to 600M records and cluster is having 12 nodes. What is the best 
approach to do this?


Thanks
Bibin John



The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.


RE: Mechanism to Bulk Export from Cassandra on daily Basis

2020-02-19 Thread JOHN, BIBIN
Thank you for suggestion. Full refresh is currently designed because with delta 
we cannot identify what got deleted. So downstreams prefer full data everyday.


Thanks
Bibin John

From: Reid Pinchback 
Sent: Wednesday, February 19, 2020 3:14 PM
To: user@cassandra.apache.org
Subject: Re: Mechanism to Bulk Export from Cassandra on daily Basis

To the question of ‘best approach’, so far the comments have been about 
alternatives in tools.

Another axis you might want to consider is from the data model viewpoint.  So, 
for example, let’s say you have 600M rows.  You want to do a daily transfer of 
data for some reason.  First question that comes to mind is, do you need all 
the data every day?  Usually that would only be the case if all of the data is 
at risk of changing.

Generally the way I’d cut down the pain on something like this is to figure out 
if the data model currently does, or could be made to, only mutate in a limited 
subset.  Then maybe all you are transferring are the daily changes.  Systems 
based on catching up to daily changes will usually be pulling single-digit 
percentages of data volume compared to the entire storage footprint.  That’s 
not only a lot less data to pull, it’s also a lot less impact on the ongoing 
operations of the cluster while you are pulling that data.

R

From: "JOHN, BIBIN" mailto:bj9...@att.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Date: Wednesday, February 19, 2020 at 1:13 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Subject: Mechanism to Bulk Export from Cassandra on daily Basis

Message from External Sender
Team,
We have a requirement to bulk export data from Cassandra on daily basis? Table 
contain close to 600M records and cluster is having 12 nodes. What is the best 
approach to do this?


Thanks
Bibin John


Re: Mechanism to Bulk Export from Cassandra on daily Basis

2020-02-19 Thread Reid Pinchback
To the question of ‘best approach’, so far the comments have been about 
alternatives in tools.

Another axis you might want to consider is from the data model viewpoint.  So, 
for example, let’s say you have 600M rows.  You want to do a daily transfer of 
data for some reason.  First question that comes to mind is, do you need all 
the data every day?  Usually that would only be the case if all of the data is 
at risk of changing.

Generally the way I’d cut down the pain on something like this is to figure out 
if the data model currently does, or could be made to, only mutate in a limited 
subset.  Then maybe all you are transferring are the daily changes.  Systems 
based on catching up to daily changes will usually be pulling single-digit 
percentages of data volume compared to the entire storage footprint.  That’s 
not only a lot less data to pull, it’s also a lot less impact on the ongoing 
operations of the cluster while you are pulling that data.

R

From: "JOHN, BIBIN" 
Reply-To: "user@cassandra.apache.org" 
Date: Wednesday, February 19, 2020 at 1:13 PM
To: "user@cassandra.apache.org" 
Subject: Mechanism to Bulk Export from Cassandra on daily Basis

Message from External Sender
Team,
We have a requirement to bulk export data from Cassandra on daily basis? Table 
contain close to 600M records and cluster is having 12 nodes. What is the best 
approach to do this?


Thanks
Bibin John


Re: Mechanism to Bulk Export from Cassandra on daily Basis

2020-02-19 Thread Aakash Pandhi
John,
copy is not recommended for than 2 millions rows, so copy is ruled out in your 
case for those 30 tables you mentioned.

Sincerely,

Aakash Pandhi
 

On Wednesday, February 19, 2020, 02:26:15 PM CST, Amanda Moran 
 wrote:  
 
 HI there-
DataStax recently released their bulkloader into the OSS community. 

I would take a look and at least try it out: 
https://docs.datastax.com/en/dsbulk/doc/dsbulk/dsbulkAbout.html 

Good luck! 

Amanda 

On Wed, Feb 19, 2020 at 12:10 PM JOHN, BIBIN  wrote:


Thanks for the response. We need to export into a flat file and send to another 
analytical application. There are 137 tables and 30 of them are have 300M+ 
records. So “COPY TO” taking lot of time.

 

Thank you

Bibin John

 

From: Aakash Pandhi  
Sent: Wednesday, February 19, 2020 12:51 PM
To: user@cassandra.apache.org
Subject: Re: Mechanism to Bulk Export from Cassandra on daily Basis

 

John,

 

Greetings, 

 

Requirement is to just export data from table and stage it somewhere? OR export 
it and load them in another cluster/table?

 

sstableloader is a utility which can help you as it is designed for bulk 
loading.  

Sincerely,

Aakash Pandhi

 

 

On Wednesday, February 19, 2020, 10:13:32 AM PST, JOHN, BIBIN  
wrote:

 

 

Team,

We have a requirement to bulk export data from Cassandra on daily basis? Table 
contain close to 600M records and cluster is having 12 nodes. What is the best 
approach to do this?

 

 

Thanks

Bibin John

  

Re: Mechanism to Bulk Export from Cassandra on daily Basis

2020-02-19 Thread Amanda Moran
HI there-

DataStax recently released their bulkloader into the OSS community.

I would take a look and at least try it out:
https://docs.datastax.com/en/dsbulk/doc/dsbulk/dsbulkAbout.html

Good luck!

Amanda

On Wed, Feb 19, 2020 at 12:10 PM JOHN, BIBIN  wrote:

> Thanks for the response. We need to export into a flat file and send to
> another analytical application. There are 137 tables and 30 of them are
> have 300M+ records. So “COPY TO” taking lot of time.
>
>
>
> Thank you
>
> Bibin John
>
>
>
> *From:* Aakash Pandhi 
> *Sent:* Wednesday, February 19, 2020 12:51 PM
> *To:* user@cassandra.apache.org
> *Subject:* Re: Mechanism to Bulk Export from Cassandra on daily Basis
>
>
>
> John,
>
>
>
> Greetings,
>
>
>
> Requirement is to just export data from table and stage it somewhere? OR
> export it and load them in another cluster/table?
>
>
>
> sstableloader is a utility which can help you as it is designed for bulk
> loading.
>
> Sincerely,
>
> Aakash Pandhi
>
>
>
>
>
> On Wednesday, February 19, 2020, 10:13:32 AM PST, JOHN, BIBIN <
> bj9...@att.com> wrote:
>
>
>
>
>
> Team,
>
> We have a requirement to bulk export data from Cassandra on daily basis?
> Table contain close to 600M records and cluster is having 12 nodes. What is
> the best approach to do this?
>
>
>
>
>
> Thanks
>
> Bibin John
>


RE: Mechanism to Bulk Export from Cassandra on daily Basis

2020-02-19 Thread JOHN, BIBIN
Thanks for the response. We need to export into a flat file and send to another 
analytical application. There are 137 tables and 30 of them are have 300M+ 
records. So “COPY TO” taking lot of time.

Thank you
Bibin John

From: Aakash Pandhi 
Sent: Wednesday, February 19, 2020 12:51 PM
To: user@cassandra.apache.org
Subject: Re: Mechanism to Bulk Export from Cassandra on daily Basis

John,

Greetings,

Requirement is to just export data from table and stage it somewhere? OR export 
it and load them in another cluster/table?

sstableloader is a utility which can help you as it is designed for bulk 
loading.

Sincerely,

Aakash Pandhi


On Wednesday, February 19, 2020, 10:13:32 AM PST, JOHN, BIBIN 
mailto:bj9...@att.com>> wrote:



Team,

We have a requirement to bulk export data from Cassandra on daily basis? Table 
contain close to 600M records and cluster is having 12 nodes. What is the best 
approach to do this?





Thanks

Bibin John


Re: Mechanism to Bulk Export from Cassandra on daily Basis

2020-02-19 Thread Aakash Pandhi
John,
Greetings, 
Requirement is to just export data from table and stage it somewhere? OR export 
it and load them in another cluster/table? sstableloader is a utility which can 
help you as it is designed for bulk loading.  
Sincerely,

Aakash Pandhi
 

On Wednesday, February 19, 2020, 10:13:32 AM PST, JOHN, BIBIN 
 wrote:  
 
  
Team,
 
We have a requirement to bulk export data from Cassandra on daily basis? Table 
contain close to 600M records and cluster is having 12 nodes. What is the best 
approach to do this?
 
  
 
  
 
Thanks
 
Bibin John 
   

Mechanism to Bulk Export from Cassandra on daily Basis

2020-02-19 Thread JOHN, BIBIN
Team,
We have a requirement to bulk export data from Cassandra on daily basis? Table 
contain close to 600M records and cluster is having 12 nodes. What is the best 
approach to do this?


Thanks
Bibin John