RE: data modeling qu: use a Map datatype, or just simple rows... ?

2020-10-01 Thread Durity, Sean R
I’m a little late on this one, but I would choose approach 1. It is much more 
comprehensible to anyone who comes afterwards. And it should easily scale in 
Cassandra to whatever volume you have. I think I would call the table 
recent_users to make it very clear the purpose of the table. It is also 
extensible if you want to add new features like last_page_visited or 
last_purchase_date – just add new columns. Data expiration is automatic with 
the default TTL on the table.

Approach 2 could have tombstone issues within the map. And it would be hard to 
extend for new features. I think clean-up would require a separate process, 
too. I don’t think you can expire rows within a map column using TTL.

Sean Durity

From: Rahul Singh 
Sent: Saturday, September 19, 2020 10:41 AM
To: user@cassandra.apache.org; Attila Wind 
Subject: [EXTERNAL] Re: data modeling qu: use a Map datatype, or just simple 
rows... ?

Not necessarily. A deterministic hash randomizes a key that may be susceptible 
to “clustering” that also may need to be used in other non Cassandra systems.

This way records can be accessed in both systems while leveraging the 
partitioner in Cassandra without pitfalls.

The same can be done with natural string keys like “email.”

Best regards,
Rahul Singh

From: Sagar Jambhulkar 
mailto:sagar.jambhul...@gmail.com>>
Sent: Saturday, September 19, 2020 6:45:25 AM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org> 
mailto:user@cassandra.apache.org>>; Attila Wind 
mailto:attila.wind@swf.technology>>
Subject: Re: data modeling qu: use a Map datatype, or just simple rows... ?

Don't really see a difference in two options. Won't the partitioner run on user 
id and create a hash for you? Unless your hash function is better than 
partitioner.

On Fri, 18 Sep 2020, 21:33 Attila Wind, 
mailto:attilaw@swf.technology>> wrote:
Hey guys,
I'm curious about your experiences regarding a data modeling question we are 
facing with.
At the moment we see 2 major different approaches in terms of how to build the 
tables
But I'm googling around already for days with no luck to find any useful 
material explaining to me how a Map (as collection datatype) works on the 
storage engine, and what could surprise us later if we . So decided to ask this 
question... (If someone has some nice pointers here maybe that is also much 
appreciated!)
So
To describe the problem in a simplified form
 • Imagine you have users (everyone is identified with a UUID),
 • and we want to answer a simple question: "have we seen this guy before?"
 • we "just" want to be able to answer this question for a limited time - let's 
say for 3 months
 • but... there are lots of lots of users we run into... many millions / 
each day...
 • and ~15-20% of them are returning users only - so many guys we just 
might see once
We are thinking about something like a big big Map, in a form of
userId => lastSeenTimestamp

Obviously if we would have something like that then answering the above 
question is simply:
if(map.get(userId) != null)  => TRUE - we have seen the guy before
Regarding the 2 major modelling approaches I mentioned above

Approach 1
Just simply use a table, something like this

CREATE TABLE IF NOT EXISTS users (
user_idvarchar,
last_seenint,-- a UNIX timestamp is enough, thats 
why int
PRIMARY KEY (user_id)
) 
AND default_time_to_live = <3 months of seconds>;
Approach 2
 to do not produce that much rows, "cluster" the guys a bit together (into 1 
row) so
introduce a hashing function over the userId, producing a value btw [0; 1]
and go with a table like
CREATE TABLE IF NOT EXISTS users (
user_id_hashint,
users_seenmap,-- this is a userId => last 
timestamp map
PRIMARY KEY (user_id_hash)
) 
AND default_time_to_live = <3 months of seconds>;-- yes, its clearly 
not a good enough way ...

In theory:
 • on a WRITE path both representation gives us a way to do the write without 
the need of read
 • even the READ path is pretty efficient in both cases
 • Approach2 is worse definitely when we come to the cleanup - "remove info if 
older than 3 month"
 • Approach2 might affect the balance of the cluster more - thats clear 
(however not that much due to the "law of large number" and really enough 
random factors)
And what we are struggling around is: what do you think
Which approach would be better over time? So will slow down the cluster less 
considering in compaction etc etc
As far as we can see the real question is:
which hurts more?
 • much more rows, but very small rows (regarding data size), or
 • much less rows, but much bigger rows (regarding data size)
?
Any thoughts, comments, pointers to some related case studies, articles, etc is 
highly appreciated!! :-)
thanks!
--
Attila Wind


Re: data modeling qu: use a Map datatype, or just simple rows... ?

2020-09-19 Thread Rahul Singh
Not necessarily. A deterministic hash randomizes a key that may be susceptible 
to “clustering” that also may need to be used in other non Cassandra systems.

This way records can be accessed in both systems while leveraging the 
partitioner in Cassandra without pitfalls.

The same can be done with natural string keys like “email.”

Best regards,
Rahul Singh
From: Sagar Jambhulkar 
Sent: Saturday, September 19, 2020 6:45:25 AM
To: user@cassandra.apache.org ; Attila Wind 

Subject: Re: data modeling qu: use a Map datatype, or just simple rows... ?

Don't really see a difference in two options. Won't the partitioner run on user 
id and create a hash for you? Unless your hash function is better than 
partitioner.

On Fri, 18 Sep 2020, 21:33 Attila Wind,  wrote:
> Hey guys,
> I'm curious about your experiences regarding a data modeling question we are 
> facing with.
> At the moment we see 2 major different approaches in terms of how to build 
> the tables
> But I'm googling around already for days with no luck to find any useful 
> material explaining to me how a Map (as collection datatype) works on the 
> storage engine, and what could surprise us later if we . So decided to ask 
> this question... (If someone has some nice pointers here maybe that is also 
> much appreciated!)
> So
> To describe the problem in a simplified form
>  • Imagine you have users (everyone is identified with a UUID),
>  • and we want to answer a simple question: "have we seen this guy before?"
>  • we "just" want to be able to answer this question for a limited time - 
> let's say for 3 months
>  • but... there are lots of lots of users we run into... many millions / 
> each day...
>  • and ~15-20% of them are returning users only - so many guys we just 
> might see once
> We are thinking about something like a big big Map, in a form of
>     userId => lastSeenTimestamp
>
> Obviously if we would have something like that then answering the above 
> question is simply:
>     if(map.get(userId) != null)  => TRUE - we have seen the guy before
> Regarding the 2 major modelling approaches I mentioned above
>
> Approach 1
> Just simply use a table, something like this
>
> CREATE TABLE IF NOT EXISTS users (
>     user_id            varchar,
>     last_seen        int,                -- a UNIX timestamp is enough, thats 
> why int
>     PRIMARY KEY (user_id)
> ) 
> AND default_time_to_live = <3 months of seconds>;
> Approach 2
>  to do not produce that much rows, "cluster" the guys a bit together (into 1 
> row) so
> introduce a hashing function over the userId, producing a value btw [0; 1]
> and go with a table like
> CREATE TABLE IF NOT EXISTS users (
>     user_id_hash    int,
>     users_seen        map,            -- this is a userId => last 
> timestamp map
>     PRIMARY KEY (user_id_hash)
> ) 
> AND default_time_to_live = <3 months of seconds>;        -- yes, its clearly 
> not a good enough way ...
>
> In theory:
>  • on a WRITE path both representation gives us a way to do the write without 
> the need of read
>  • even the READ path is pretty efficient in both cases
>  • Approach2 is worse definitely when we come to the cleanup - "remove info 
> if older than 3 month"
>  • Approach2 might affect the balance of the cluster more - thats clear 
> (however not that much due to the "law of large number" and really enough 
> random factors)
> And what we are struggling around is: what do you think
> Which approach would be better over time? So will slow down the cluster less 
> considering in compaction etc etc
> As far as we can see the real question is:
> which hurts more?
>  • much more rows, but very small rows (regarding data size), or
>  • much less rows, but much bigger rows (regarding data size)
> ?
> Any thoughts, comments, pointers to some related case studies, articles, etc 
> is highly appreciated!! :-)
> thanks!
> --
> Attila Wind
>
> http://www.linkedin.com/in/attilaw
> Mobile: +49 176 43556932
>
>


Re: data modeling qu: use a Map datatype, or just simple rows... ?

2020-09-19 Thread Sagar Jambhulkar
Don't really see a difference in two options. Won't the partitioner run on
user id and create a hash for you? Unless your hash function is better than
partitioner.

On Fri, 18 Sep 2020, 21:33 Attila Wind,  wrote:

> Hey guys,
>
> I'm curious about your experiences regarding a data modeling question we
> are facing with.
> At the moment we see 2 major different approaches in terms of how to build
> the tables
> But I'm googling around already for days with no luck to find any useful
> material explaining to me how a Map (as collection datatype) works on the
> storage engine, and what could surprise us later if we . So decided to ask
> this question... (If someone has some nice pointers here maybe that is also
> much appreciated!)
>
> So
> *To describe the problem* in a simplified form
>
>- Imagine you have users (everyone is identified with a UUID),
>- and we want to answer a simple question: "have we seen this guy
>before?"
>- we "just" want to be able to answer this question for a limited time
>- let's say for 3 months
>- but... there are lots of lots of users we run into... many
>millions / each day...
>- and ~15-20% of them are returning users only - so many guys we
>just might see once
>
> We are thinking about something like a big big Map, in a form of
> userId => lastSeenTimestamp
>
> Obviously if we would have something like that then answering the above
> question is simply:
> if(map.get(userId) != null)  => TRUE - we have seen the guy before
>
> Regarding the 2 major modelling approaches I mentioned above
>
> *Approach 1*
> Just simply use a table, something like this
>
> CREATE TABLE IF NOT EXISTS users (
> user_idvarchar,
> last_seenint,-- a UNIX timestamp is enough,
> thats why int
>
> PRIMARY KEY (user_id)
> ) 
> AND default_time_to_live = <3 months of seconds>;
>
>
> *Approach 2 *to do not produce that much rows, "cluster" the guys a bit
> together (into 1 row) so
> introduce a hashing function over the userId, producing a value btw [0;
> 1]
> and go with a table like
>
> CREATE TABLE IF NOT EXISTS users (
> user_id_hashint,
> users_seenmap,-- this is a userId =>
> last timestamp map
> PRIMARY KEY (user_id_hash)
> ) 
> AND default_time_to_live = <3 months of seconds>;-- yes, its
> clearly not a good enough way ...
>
>
> In theory:
>
>- on a WRITE path both representation gives us a way to do the write
>without the need of read
>- even the READ path is pretty efficient in both cases
>- Approach2 is worse definitely when we come to the cleanup - "remove
>info if older than 3 month"
>- Approach2 might affect the balance of the cluster more - thats clear
>(however not that much due to the "law of large number" and really enough
>random factors)
>
> And what we are struggling around is: what do you think
> *Which approach would be better over time? *So will slow down the cluster
> less considering in compaction etc etc
>
> As far as we can see the real question is:
>
> which hurts more?
>
>- much more rows, but very small rows (regarding data size), or
>- much less rows, but much bigger rows (regarding data size)
>
> ?
>
> Any thoughts, comments, pointers to some related case studies, articles,
> etc is highly appreciated!! :-)
>
> thanks!
> --
> Attila Wind
>
> http://www.linkedin.com/in/attilaw
> Mobile: +49 176 43556932
>
>
>


Re: data modeling qu: use a Map datatype, or just simple rows... ?

2020-09-19 Thread onmstester onmstester
I used Cassandra Set (no experience with map ), and one thing for sure is that 
with Cassandra collections you are limited to a few thousands entry per row 
(less than 10K for better performance)


Sent using https://www.zoho.com/mail/




 On Fri, 18 Sep 2020 20:33:21 +0430 Attila Wind  
wrote 


Hey guys,

I'm curious about your experiences regarding a data modeling
  question we are facing with. 
 At the moment we see 2 major different approaches in terms of how
  to build the tables
 But I'm googling around already for days with no luck to find any
  useful material explaining to me how a Map (as collection
  datatype) works on the storage engine, and what could surprise us
  later if we . So decided to ask this question... (If someone has
  some nice pointers here maybe that is also much appreciated!)
So
 To describe the problem in a simplified form
Imagine you have users (everyone is identified with a UUID),

and we want to answer a simple question: "have we seen this
guy before?"

we "just" want to be able to answer this question for a
limited time - let's say for 3 months

but... there are lots of lots of users we run into... many
millions / each day... 

and ~15-20% of them are returning users only - so many
guys we just might see once


We are thinking about something like a big big Map, in a form of
     userId => lastSeenTimestamp
 
 Obviously if we would have something like that then answering the
  above question is simply:
     if(map.get(userId) != null)  => TRUE - we have seen the guy
  before

Regarding the 2 major modelling approaches I mentioned above
 
 Approach 1
 Just simply use a table, something like this
 
 CREATE TABLE IF NOT EXISTS users (
     user_id            varchar,
     last_seen        int,                -- a UNIX timestamp is
  enough, thats why int
    PRIMARY KEY (user_id)
 ) 
 AND default_time_to_live = <3 months of seconds>;
Approach 2
 to do not produce that much rows, "cluster" the guys a bit
  together (into 1 row) so
 introduce a hashing function over the userId, producing a value
  btw [0; 1]
 and go with a table like
CREATE TABLE IF NOT EXISTS users (
     user_id_hash    int,
     users_seen        map,            -- this is
  a userId => last timestamp map

    PRIMARY KEY (user_id_hash)
 ) 
 AND default_time_to_live = <3 months of seconds>;        --
yes, its clearly not a good enough way ...
 
In theory:
on a WRITE path both representation gives us a way to do the
write without the need of read

even the READ path is pretty efficient in both cases

Approach2 is worse definitely when we come to the cleanup -
"remove info if older than 3 month"

Approach2 might affect the balance of the cluster more - thats
clear (however not that much due to the "law of large number"
and really enough random factors)


And what we are struggling around is: what do you think
 Which approach would be better over time? So will slow
  down the cluster less considering in compaction etc etc
As far as we can see the real question is: 

which hurts more?

much more rows, but very small rows (regarding data size), or

much less rows, but much bigger rows (regarding data size)


?

Any thoughts, comments, pointers to some related case studies,
  articles, etc is highly appreciated!! :-)

thanks!

-- 
 Attila Wind
 
  http://www.linkedin.com/in/attilaw
 Mobile: +49 176 43556932

data modeling qu: use a Map datatype, or just simple rows... ?

2020-09-18 Thread Attila Wind

Hey guys,

I'm curious about your experiences regarding a data modeling question we 
are facing with.
At the moment we see 2 major different approaches in terms of how to 
build the tables
But I'm googling around already for days with no luck to find any useful 
material explaining to me how a Map (as collection datatype) works on 
the storage engine, and what could surprise us later if we . So decided 
to ask this question... (If someone has some nice pointers here maybe 
that is also much appreciated!)


So
*To describe the problem* in a simplified form

 * Imagine you have users (everyone is identified with a UUID),
 * and we want to answer a simple question: "have we seen this guy before?"
 * we "just" want to be able to answer this question for a limited time
   - let's say for 3 months
 * but... there are lots of lots of users we run into... many
   millions / each day...
 * and ~15-20% of them are returning users only - so many guys we
   just might see once

We are thinking about something like a big big Map, in a form of
    userId => lastSeenTimestamp

Obviously if we would have something like that then answering the above 
question is simply:

    if(map.get(userId) != null)  => TRUE - we have seen the guy before

Regarding the 2 major modelling approaches I mentioned above

*Approach 1*
Just simply use a table, something like this

CREATE TABLE IF NOT EXISTS users (
    user_id            varchar,
    last_seen        int,                -- a UNIX timestamp is enough, 
thats why int


    PRIMARY KEY (user_id)
) 
AND default_time_to_live = <3 months of seconds>;

*Approach 2
*to do not produce that much rows, "cluster" the guys a bit together 
(into 1 row) so
introduce a hashing function over the userId, producing a value btw [0; 
1]

and go with a table like

CREATE TABLE IF NOT EXISTS users (
    user_id_hash    int,
    users_seen        map,            -- this is a userId => 
last timestamp map


    PRIMARY KEY (user_id_hash)
) 
AND default_time_to_live = <3 months of seconds>;        -- yes, its 
clearly not a good enough way ...



In theory:

 * on a WRITE path both representation gives us a way to do the write
   without the need of read
 * even the READ path is pretty efficient in both cases
 * Approach2 is worse definitely when we come to the cleanup - "remove
   info if older than 3 month"
 * Approach2 might affect the balance of the cluster more - thats clear
   (however not that much due to the "law of large number" and really
   enough random factors)

And what we are struggling around is: what do you think
*Which approach would be better over time? *So will slow down the 
cluster less considering in compaction etc etc


As far as we can see the real question is:

which hurts more?

 * much more rows, but very small rows (regarding data size), or
 * much less rows, but much bigger rows (regarding data size)

?

Any thoughts, comments, pointers to some related case studies, articles, 
etc is highly appreciated!! :-)


thanks!

--
Attila Wind

http://www.linkedin.com/in/attilaw
Mobile: +49 176 43556932