DataStax Enterprise bundles spark and spark connector on the DSE nodes and 
handles much of the plumbing work (and monitoring, etc.). Worth a look.


Sean Durity

From: Avi Levi [mailto:a...@indeni.com]
Sent: Tuesday, August 22, 2017 2:46 AM
To: user@cassandra.apache.org
Subject: Re: Getting all unique keys

Thanks Christophe, we will definitely consider that in the future.

On Mon, Aug 21, 2017 at 3:01 PM, Christophe Schmitz 
<christo...@instaclustr.com<mailto:christo...@instaclustr.com>> wrote:
Hi Avi,

The spark-project documentation is quite good, as well as the 
spark-cassandra-connector github project, which contains some basic examples 
you can easily get inspired from. A few random advice you might find usefull:
- You will want one spark worker on each node, and a spark master on either one 
of the node, or on a separate node.
- Pay close attention at your port configuration (firewall) as the spark error 
log does not always give you the right hint.
- Pay close attention at your heap size. Make sure to configure your heap size 
such as Cassandra heap size + spark heap size < your node memory (take into 
account Cassandra off heap usage if enabled, OS etc...)
- If your Cassandra data center is used in production, make sure you throttle 
read / write from Spark, pay attention to your latencies, and consider using a 
separate analytic cassandra data center if you get serious with Spark.
- More or less everyone I know find that writing spark jobs in scala is 
natural, while writing them in java is painful :D

Getting spark running will be a bit of an investment at the beginning, but 
overall you will find out it allows you to run queries you can't naturally run 
in Cassandra, like the one you described.

Cheers,

Christophe

On 21 August 2017 at 16:16, Avi Levi <a...@indeni.com<mailto:a...@indeni.com>> 
wrote:
Thanks Christophe,
we didn't want to add too many moving parts but is sound like a good solution. 
do you have any reference / link that I can look at ?

Cheers
Avi

On Mon, Aug 21, 2017 at 3:43 AM, Christophe Schmitz 
<christo...@instaclustr.com<mailto:christo...@instaclustr.com>> wrote:
Hi Avi,

Have you thought of using Spark for that work? If you collocate the spark 
workers on each Cassandra nodes, the spark-cassandra connector will split 
automatically the token range for you in such a way that each spark worker only 
hit the Cassandra local node. This will also be done in parallel. Should be 
much faster that way.

Cheers,
Christophe


On 21 August 2017 at 01:34, Avi Levi <a...@indeni.com<mailto:a...@indeni.com>> 
wrote:
Thank you very much , one question . you wrote that I do not need distinct here 
since it's a part from the primary key. but only the combination is unique 
(PRIMARY KEY (id, timestamp) ) . also if I take the last token and feed it back 
as you showed wouldn't I get overlapping boundaries ?

On Sun, Aug 20, 2017 at 6:18 PM, Eric Stevens 
<migh...@gmail.com<mailto:migh...@gmail.com>> wrote:
You should be able to fairly efficiently iterate all the partition keys like:

select id, token(id) from table where token(id) >= -9204925292781066255 limit 
1000;
 id                                         | system.token(id)
--------------------------------------------+----------------------
...
 0xb90ea1db5c29f2f6d435426dccf77cca6320fac9 | -7821793584824523686

Take the last token you receive and feed it back in, skipping duplicates from 
the previous page (on the unlikely chance that you have two ID's with a token 
collision on the page boundary):

select id, token(id) from table where token(id) >= -7821793584824523686 limit 
1000;
 id                                         | system.token(id)
--------------------------------------------+---------------------
...
 0xc6289d729c9087fb5a1fe624b0b883ab82a9bffe | -434806781044590339

Continue until you have no more results.  You don't really need distinct here: 
it's part of your primary key, it must already be distinct.

If you want to parallelize it, split the ring into n ranges and include it as 
an upper bound for each segment.

select id, token(id) from table where token(id) >= -9204925292781066255 AND 
token(id) < $rangeUpperBound limit 1000;


On Sun, Aug 20, 2017 at 12:33 AM Avi Levi 
<a...@indeni.com<mailto:a...@indeni.com>> wrote:
I need to get all unique keys (not the complete primary key, just the partition 
key) in order to aggregate all the relevant records of that key and apply some 
calculations on it.


CREATE TABLE my_table (



    id text,



    timestamp bigint,



    value double,



    PRIMARY KEY (id, timestamp) )

I know that to query like this

SELECT DISTINCT id FROM my_table

is not very efficient but how about the approach presented 
here<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.scylladb.com_2017_02_13_efficient-2Dfull-2Dtable-2Dscans-2Dwith-2Dscylla-2D1-2D6_&d=DwMFaQ&c=MtgQEAMQGqekjTjiAhkudQ&r=aC_gxC6z_4f9GLlbWiKzHm1vucZTtVYWDDvyLkh8IaQ&m=1sB228OLAkVlZ44o5jSF7SbV2AZOSaLZNAAudK1lQFM&s=9TX6yetLXapKzXWLGI6vNlLwoyNp61mK5GYNhMrHz9k&e=>
 sending queries in parallel and using the token

SELECT DISTINCT id FROM my_table WHERE token(id) >= -9204925292781066255 AND 
token(id) <= -9223372036854775808;

or I can just maintain another table with the unique keys

CREATE TABLE id_only ( id text,



    PRIMARY KEY (id) )

but I tend not to since it is error prone and will enforce other procedures to 
maintain data integrity between those two tables .

any ideas ?

Thanks

Avi




--

Christophe Schmitz
Director of consulting EMEA




--

Christophe Schmitz
Director of consulting EMEA


________________________________

The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.

Reply via email to