UUIDType

2011-11-20 Thread Guy Incognito
am i correct that neither of Cassandra's UUIDTypes (at least in 0.7) 
compare UUIDs according to RFC4122 (ie as two unsigned longs)?


Re: Efficiency of Cross Data Center Replication...?

2011-11-20 Thread Boris Yen
A quick question, what if DC2 is down, and after a while it comes back on.
how does the data get sync to DC2 in this case? (assume hint is disable)

Thanks in advance.

On Thu, Nov 17, 2011 at 10:46 AM, Jeremiah Jordan 
jeremiah.jor...@morningstar.com wrote:

 Pretty sure data is sent to the coordinating node in DC2 at the same time
 it is sent to replicas in DC1, so I would think 10's of milliseconds after
 the transport time to DC2.

 On Nov 16, 2011, at 3:48 PM, ehers...@gmail.com wrote:

 On a related note - assuming there are available resources across the
 board (cpu and memory on every node, low network latency, non-saturated
 nics/circuits/disks), what's a reasonable expectation for timing on
 replication? Sub-second? Less than five seconds?

 Ernie

 On Wed, Nov 16, 2011 at 4:00 PM, Brian Fleming 
 bigbrianflem...@gmail.comwrote:

 Great - thanks Jake

 B.

 On Wed, Nov 16, 2011 at 8:40 PM, Jake Luciani jak...@gmail.com wrote:

 the former


 On Wed, Nov 16, 2011 at 3:33 PM, Brian Fleming 
 bigbrianflem...@gmail.com wrote:


 Hi All,

 I have a question about inter-data centre replication : if you have 2
 Data Centers, each with a local RF of 2 (i.e. total RF of 4) and write to a
 node in DC1, how efficient is the replication to DC2 - i.e. is that data :
  - replicated over to a single node in DC2 once and internally
 replicated
  or
  - replicated explicitly to two separate nodes?

 Obviously from a LAN resource utilisation perspective, the former would
 be preferable.

 Many thanks,

 Brian




 --
 http://twitter.com/tjake







Re: Network traffic patterns

2011-11-20 Thread Boris Yen
I am just curious about which partitioner you are using?

On Thu, Nov 17, 2011 at 4:30 PM, Philippe watche...@gmail.com wrote:

 Hi Todd
 Yes all equal hardware. Nearly no CPU usage and no memory issues.
 Repairs are running in tens of minutes so i don't understand why
 replication would be backed up.

 Any other ideas?
 Le 17 nov. 2011 02:33, Todd Burruss bburr...@expedia.com a écrit :

 Are all of your machines equal hardware?  Since those machines are sending
 data somewhere, maybe they are behind in replicating and are continuously
 catching up?

 Use a tool like tcpdump to find out where the data is going

 From: Philippe watche...@gmail.com
 Reply-To: user@cassandra.apache.org user@cassandra.apache.org
 Date: Tue, 15 Nov 2011 13:22:38 -0800
 To: user user@cassandra.apache.org
 Subject: Re: Network traffic patterns

 Sorry about the previous message, I've enabled keyboard shortcuts on
 gmail...*sigh*...

 Hello,
 I'm trying to understand the network usage I am seeing in my cluster, can
 anyone shed some light?
 It's an RF=3, 12-node, cassandra 0.8.6 cluster. repair is performed on
 each node once a week, with a rolling schedule.
 The nodes are p13,p14,p15...p24 and are consecutive in that order on the
 ring. Each node is only a cassandra database. I am hitting the cluster from
 another server (p4).

 p4 is doing this with 20 threads in parallel

1. read a lot of data (some columns for hundreds to tens of thousands
of keys, split into 512-key multigets)
2. process the data
3. write back a byte array to cassandra (average size is 400 bytes)
4. go back to 1

 According to my munin graphs, network usage is about as follows. I am not
 surprised at the bias towards p13-p15 as p4 is getting  storing data
 mainly for keys located on one of those nodes.

- p4 : 1.5Mb/s in and out
- p13-p15 : 15Mb/s in and 80Mb/s out
- p16-p24 : 45Mb/s in and 5Mb/s out

 What I don't understand is why p4 is only seeing 1.5Mb/s while I see
 80Mb/s on p13  p15.

 The way I understand this:

- p4 makes a multiget to the cluster, electing to use any node in the
cluster (IN traffic for describe the query)
- coordinator node replays the query on all 3 replicas (so 3 servers
each get the IN traffic, mostly p13-p15)
- each server replies to coordinator
- coordinator chooses matching values and sends back data to p4

 So if p13-p15 are outputting 80Mb/s why am I not seeing 80Mb/s coming
 into p4 which is on the receiving end ?

 Thanks

 2011/11/15 Philippe watche...@gmail.com

 Hello,
 I'm trying to understand the network usage I am seeing in my cluster,
 can anyone shed some light?
 It's an RF=3, 12-node, cassandra 0.8.6 cluster. The nodes are
 p13,p14,p15...p24 and are consecutive in that order on the ring.
 Each node is only a cassandra database. I am hitting the cluster from
 another server (p4).

 The pattern on p4 is the pattern is to

1. read a lot of data (some columns for hundreds to tens of
thousands of keys, split into 512-key multigets)
2. process the data
3. write back a byte array to cassandra (average size is 400 bytes)


 p4 reads as





Re: Efficiency of Cross Data Center Replication...?

2011-11-20 Thread Mohit Anchlia
On Sun, Nov 20, 2011 at 4:01 AM, Boris Yen yulin...@gmail.com wrote:
 A quick question, what if DC2 is down, and after a while it comes back on.
 how does the data get sync to DC2 in this case? (assume hint is disable)
 Thanks in advance.

Manually, use nodetool repair in rolling fashion on all the nodes of DC2

 On Thu, Nov 17, 2011 at 10:46 AM, Jeremiah Jordan
 jeremiah.jor...@morningstar.com wrote:

 Pretty sure data is sent to the coordinating node in DC2 at the same time
 it is sent to replicas in DC1, so I would think 10's of milliseconds after
 the transport time to DC2.
 On Nov 16, 2011, at 3:48 PM, ehers...@gmail.com wrote:

 On a related note - assuming there are available resources across the
 board (cpu and memory on every node, low network latency, non-saturated
 nics/circuits/disks), what's a reasonable expectation for timing on
 replication? Sub-second? Less than five seconds?
 Ernie

 On Wed, Nov 16, 2011 at 4:00 PM, Brian Fleming bigbrianflem...@gmail.com
 wrote:

 Great - thanks Jake
 B.

 On Wed, Nov 16, 2011 at 8:40 PM, Jake Luciani jak...@gmail.com wrote:

 the former

 On Wed, Nov 16, 2011 at 3:33 PM, Brian Fleming
 bigbrianflem...@gmail.com wrote:

 Hi All,

 I have a question about inter-data centre replication : if you have 2
 Data Centers, each with a local RF of 2 (i.e. total RF of 4) and write to 
 a
 node in DC1, how efficient is the replication to DC2 - i.e. is that data :
  - replicated over to a single node in DC2 once and internally
 replicated
  or
  - replicated explicitly to two separate nodes?
 Obviously from a LAN resource utilisation perspective, the former would
 be preferable.
 Many thanks,
 Brian



 --
 http://twitter.com/tjake







Re: Efficiency of Cross Data Center Replication...?

2011-11-20 Thread Jeremiah Jordan
If hinting is off. Read Repair and Manual Repair are the only ways data will 
get there (just like when a single node is down).

On Nov 20, 2011, at 6:01 AM, Boris Yen wrote:

 A quick question, what if DC2 is down, and after a while it comes back on. 
 how does the data get sync to DC2 in this case? (assume hint is disable) 
 
 Thanks in advance.
 
 On Thu, Nov 17, 2011 at 10:46 AM, Jeremiah Jordan 
 jeremiah.jor...@morningstar.com wrote:
 Pretty sure data is sent to the coordinating node in DC2 at the same time it 
 is sent to replicas in DC1, so I would think 10's of milliseconds after the 
 transport time to DC2.
 
 On Nov 16, 2011, at 3:48 PM, ehers...@gmail.com wrote:
 
 On a related note - assuming there are available resources across the board 
 (cpu and memory on every node, low network latency, non-saturated 
 nics/circuits/disks), what's a reasonable expectation for timing on 
 replication? Sub-second? Less than five seconds? 
 
 Ernie
 
 On Wed, Nov 16, 2011 at 4:00 PM, Brian Fleming bigbrianflem...@gmail.com 
 wrote:
 Great - thanks Jake
 
 B.
 
 On Wed, Nov 16, 2011 at 8:40 PM, Jake Luciani jak...@gmail.com wrote:
 the former
 
 
 On Wed, Nov 16, 2011 at 3:33 PM, Brian Fleming bigbrianflem...@gmail.com 
 wrote:
 
 Hi All,
  
 I have a question about inter-data centre replication : if you have 2 Data 
 Centers, each with a local RF of 2 (i.e. total RF of 4) and write to a node 
 in DC1, how efficient is the replication to DC2 - i.e. is that data :
  - replicated over to a single node in DC2 once and internally replicated
  or 
  - replicated explicitly to two separate nodes?
 
 Obviously from a LAN resource utilisation perspective, the former would be 
 preferable.
 
 Many thanks,
 
 Brian
 
 
 
 
 -- 
 http://twitter.com/tjake
 
 
 
 



data agility

2011-11-20 Thread Dotan N.
Hi all,
my question may be more philosophical than related technically
to Cassandra, but please bear with me.

Given that a young startup may not know its product full at the early
stages, but that it definitely points to ~200M users,
would Cassandra will be the right way to go?

That is, the requirement is for a large data store, that can move with
product changes and requirements swiftly.

Given that in Cassandra one thinks hard about the queries, and then builds
a model to suit it best, I was thinking of
this situation as problematic.

So here are some questions:

- would it be wiser to start with a more agile data store (such as mongodb)
and then progress onto Cassandra, when the product itself solidifies?
- given that we start with Cassandra from the get go, what is a common (and
quick in terms of development) way or practice to change data, change
schemas, as the product evolves?
- is it even smart to start with Cassandra? would only startups whose core
business is big data start with it from the get go?
- how would you do map/reduce with Cassandra? how agile is that? (for
example, can you run map/reduce _very_ frequently?)

Thanks!

--
Dotan, @jondot http://twitter.com/jondot


Re: data agility

2011-11-20 Thread David McNelis
Dotan,

I think that if you're in the early stages you have a basic idea of what
your product is going to be, architecturally speaking.  While you may
change your business model, or features at the display layer, I would think
the data models itself would remain relatively similar
throughout...otherwise you'd have another product on your hands, no?

But, even if your requirements radically shift, Cassandra is schemaless, so
you'd be able to make 'structural' changes to your data without as much
risk as in a traditional RDBMS, i.e. MySql.

At the end of the day, I don't think you've given enough information about
your proposed data models for anyone to say, Yes, Cassandra would or would
not be the right choice for your startup.  If well administered, depending
on the services offered, MySQL or Oracle  could support a site with 200M
users, and a poorly designed Cassandra data store could work very poorly
for a site supporting 200 users.

I will say that I think it makes a lot of  sense  to use tradional RDBMS
systems for relational data and a Cassandra-like system when there is a
need  for larger data storage, or something that lends itself well to a
structureless design.  If you are using a framework that supports a good
ORM layer (i.e. Hibernate for Java), you can have  your build process
update your database  schema as you build out your application.  I haven't
done much work in Rails or Django, but I understand those support the
transparent schema updating as well.  That sort of setup can work very
effectively in early development...but that is more a discussion for other
communities.

If you're interested in doing Map/Reduce jobs with Cassandra, look into
Brisk, the system created by DataStax (which is also open source) that
allows you to run Hadoop on top of your Cassandra cluster.  This may not be
exactly what you're looking for when asking this question...but it might
give you the insights you're looking for.

Hope this has been at least somewhat helpful.

David

On Sun, Nov 20, 2011 at 1:06 PM, Dotan N. dip...@gmail.com wrote:

 Hi all,
 my question may be more philosophical than related technically
 to Cassandra, but please bear with me.

 Given that a young startup may not know its product full at the early
 stages, but that it definitely points to ~200M users,
 would Cassandra will be the right way to go?

 That is, the requirement is for a large data store, that can move with
 product changes and requirements swiftly.

 Given that in Cassandra one thinks hard about the queries, and then builds
 a model to suit it best, I was thinking of
 this situation as problematic.

 So here are some questions:

 - would it be wiser to start with a more agile data store (such as
 mongodb) and then progress onto Cassandra, when the product itself
 solidifies?
 - given that we start with Cassandra from the get go, what is a common
 (and quick in terms of development) way or practice to change data, change
 schemas, as the product evolves?
 - is it even smart to start with Cassandra? would only startups whose core
 business is big data start with it from the get go?
 - how would you do map/reduce with Cassandra? how agile is that? (for
 example, can you run map/reduce _very_ frequently?)

 Thanks!

 --
 Dotan, @jondot http://twitter.com/jondot




-- 
*David McNelis*
Lead Software Engineer
Agentis Energy
www.agentisenergy.com
c: 219.384.5143

*A Smart Grid technology company focused on helping consumers of energy
control an often under-managed resource.*


Re: data agility

2011-11-20 Thread Stephen Connolly
if your startup is bootstrapping then cassandra is sometimes to heavy to
start with.

i.e. it needs to be fed ram... you're not going to seriously run it in less
than 1gb per node... that level of ram commitment can be too much while
bootstrapping.

if your startup has enough cash to pay for 3-5 recommended spec (see wiki)
nodes to be up 24/7 then cassandra is a good fit...

a friend of mine is bootstrapping a startup and had to drop back to mysql
while he finds his pain points and customers... he knows he will end up
jumping back to cassandra when he gets enough customers (or a VC) but for
now the running costs are too much to pay from his own pocket... note that
the jdbc driver and cql will make jumping back easy for him (as he still
tests with c*... just runs at present against mysql nuts eh!)

- Stephen

---
Sent from my Android phone, so random spelling mistakes, random nonsense
words and other nonsense are a direct result of using swype to type on the
screen
On 20 Nov 2011 19:07, Dotan N. dip...@gmail.com wrote:

 Hi all,
 my question may be more philosophical than related technically
 to Cassandra, but please bear with me.

 Given that a young startup may not know its product full at the early
 stages, but that it definitely points to ~200M users,
 would Cassandra will be the right way to go?

 That is, the requirement is for a large data store, that can move with
 product changes and requirements swiftly.

 Given that in Cassandra one thinks hard about the queries, and then builds
 a model to suit it best, I was thinking of
 this situation as problematic.

 So here are some questions:

 - would it be wiser to start with a more agile data store (such as
 mongodb) and then progress onto Cassandra, when the product itself
 solidifies?
 - given that we start with Cassandra from the get go, what is a common
 (and quick in terms of development) way or practice to change data, change
 schemas, as the product evolves?
 - is it even smart to start with Cassandra? would only startups whose core
 business is big data start with it from the get go?
 - how would you do map/reduce with Cassandra? how agile is that? (for
 example, can you run map/reduce _very_ frequently?)

 Thanks!

 --
 Dotan, @jondot http://twitter.com/jondot




Re: data agility

2011-11-20 Thread Dotan N.
Thanks David.
Stephen: thanks for the tip, we can run a recommended configuration, so
that wouldn't be an issue. I guess I can focus that my questions are on
complexity of development.

After digesting David's answer, I guess my follow up questions would be
- how would you process data in a cassandra cluster, typically? via one-off
coded offline jobs?
- how easy is map/reduce on existing data (just looked at brisk but it may
be unrelated, any case, not too much written about it)
- how would you do analytics over a cassandra cluster
- given the common examples of time-series, how would you recommend to
aggregate (sum, avg, facet) and provide statistics over the collected data?
for example if it were kinds of logs and you'd like to group all of certain
fields in it, or provide a histogram over it.

Thanks!


--
Dotan, @jondot http://twitter.com/jondot



On Sun, Nov 20, 2011 at 10:32 PM, Stephen Connolly 
stephen.alan.conno...@gmail.com wrote:

 if your startup is bootstrapping then cassandra is sometimes to heavy to
 start with.

 i.e. it needs to be fed ram... you're not going to seriously run it in
 less than 1gb per node... that level of ram commitment can be too much
 while bootstrapping.

 if your startup has enough cash to pay for 3-5 recommended spec (see wiki)
 nodes to be up 24/7 then cassandra is a good fit...

 a friend of mine is bootstrapping a startup and had to drop back to mysql
 while he finds his pain points and customers... he knows he will end up
 jumping back to cassandra when he gets enough customers (or a VC) but for
 now the running costs are too much to pay from his own pocket... note that
 the jdbc driver and cql will make jumping back easy for him (as he still
 tests with c*... just runs at present against mysql nuts eh!)

 - Stephen

 ---
 Sent from my Android phone, so random spelling mistakes, random nonsense
 words and other nonsense are a direct result of using swype to type on the
 screen
 On 20 Nov 2011 19:07, Dotan N. dip...@gmail.com wrote:

 Hi all,
 my question may be more philosophical than related technically
 to Cassandra, but please bear with me.

 Given that a young startup may not know its product full at the early
 stages, but that it definitely points to ~200M users,
 would Cassandra will be the right way to go?

 That is, the requirement is for a large data store, that can move with
 product changes and requirements swiftly.

 Given that in Cassandra one thinks hard about the queries, and then
 builds a model to suit it best, I was thinking of
 this situation as problematic.

 So here are some questions:

 - would it be wiser to start with a more agile data store (such as
 mongodb) and then progress onto Cassandra, when the product itself
 solidifies?
 - given that we start with Cassandra from the get go, what is a common
 (and quick in terms of development) way or practice to change data, change
 schemas, as the product evolves?
 - is it even smart to start with Cassandra? would only startups whose
 core business is big data start with it from the get go?
 - how would you do map/reduce with Cassandra? how agile is that? (for
 example, can you run map/reduce _very_ frequently?)

 Thanks!

 --
 Dotan, @jondot http://twitter.com/jondot




Re: data agility

2011-11-20 Thread Jahangir Mohammed
IMHO, you should start with something very simple RDBMS and meanwhile
getting handle over Cassandra or other noSql technology. Start out with
simple, but always be aware and conscious of the next thing you will have
in stack. It's timetaking to work with new technology if you are in the
phase of prototyping something fast and geared towards a Vc demo. In most
of the cases, you won't need noSql for a while unless there is a very
strong case.

Thanks,
Jahangir
On Nov 20, 2011 4:04 PM, Dotan N. dip...@gmail.com wrote:

 Thanks David.
 Stephen: thanks for the tip, we can run a recommended configuration, so
 that wouldn't be an issue. I guess I can focus that my questions are on
 complexity of development.

 After digesting David's answer, I guess my follow up questions would be
 - how would you process data in a cassandra cluster, typically? via
 one-off coded offline jobs?
 - how easy is map/reduce on existing data (just looked at brisk but it may
 be unrelated, any case, not too much written about it)
 - how would you do analytics over a cassandra cluster
 - given the common examples of time-series, how would you recommend to
 aggregate (sum, avg, facet) and provide statistics over the collected data?
 for example if it were kinds of logs and you'd like to group all of certain
 fields in it, or provide a histogram over it.

 Thanks!


 --
 Dotan, @jondot http://twitter.com/jondot



 On Sun, Nov 20, 2011 at 10:32 PM, Stephen Connolly 
 stephen.alan.conno...@gmail.com wrote:

 if your startup is bootstrapping then cassandra is sometimes to heavy to
 start with.

 i.e. it needs to be fed ram... you're not going to seriously run it in
 less than 1gb per node... that level of ram commitment can be too much
 while bootstrapping.

 if your startup has enough cash to pay for 3-5 recommended spec (see
 wiki) nodes to be up 24/7 then cassandra is a good fit...

 a friend of mine is bootstrapping a startup and had to drop back to mysql
 while he finds his pain points and customers... he knows he will end up
 jumping back to cassandra when he gets enough customers (or a VC) but for
 now the running costs are too much to pay from his own pocket... note that
 the jdbc driver and cql will make jumping back easy for him (as he still
 tests with c*... just runs at present against mysql nuts eh!)

 - Stephen

 ---
 Sent from my Android phone, so random spelling mistakes, random nonsense
 words and other nonsense are a direct result of using swype to type on the
 screen
 On 20 Nov 2011 19:07, Dotan N. dip...@gmail.com wrote:

 Hi all,
 my question may be more philosophical than related technically
 to Cassandra, but please bear with me.

 Given that a young startup may not know its product full at the early
 stages, but that it definitely points to ~200M users,
 would Cassandra will be the right way to go?

 That is, the requirement is for a large data store, that can move with
 product changes and requirements swiftly.

 Given that in Cassandra one thinks hard about the queries, and then
 builds a model to suit it best, I was thinking of
 this situation as problematic.

 So here are some questions:

 - would it be wiser to start with a more agile data store (such as
 mongodb) and then progress onto Cassandra, when the product itself
 solidifies?
 - given that we start with Cassandra from the get go, what is a common
 (and quick in terms of development) way or practice to change data, change
 schemas, as the product evolves?
 - is it even smart to start with Cassandra? would only startups whose
 core business is big data start with it from the get go?
 - how would you do map/reduce with Cassandra? how agile is that? (for
 example, can you run map/reduce _very_ frequently?)

 Thanks!

 --
 Dotan, @jondot http://twitter.com/jondot





Re: read performance problem

2011-11-20 Thread Jahangir Mohammed
There is something wrong with the system. Your benchmarks are way off. How
are you benchmarking? Are you using the stress lib included?
On Nov 19, 2011 8:58 PM, Kent Tong freemant2...@yahoo.com wrote:

 Hi,

 On my computer with 2G RAM and a core 2 duo CPU E4600 @ 2.40GHz, I am
 testing the
 performance of Cassandra. The write performance is good: It can write a
 million records
 in 10 minutes. However, the query performance is poor and it takes 10
 minutes to read
 10K records with sequential keys from 0 to  (about 100 QPS). This is
 far away from
 the 3,xxx QPS found on the net.

 Cassandra decided to use 1G as the Java heap size which seems to be fine
 as at the end
 of the benchmark the swap was barely used (only 1M used).

 I understand that my computer may be not as powerful as those used in the
 other benchmarks,
 but it shouldn't be that far off (1:30), right?

 Any suggestion? Thanks in advance!




Re: data agility

2011-11-20 Thread Dotan N.
Jahangir, thanks! however I've noted that we may very well need to scale to
200M users or entities within a short amount of time - say a year or two,
10M within few months.


--
Dotan, @jondot http://twitter.com/jondot



On Sun, Nov 20, 2011 at 11:14 PM, Jahangir Mohammed md.jahangi...@gmail.com
 wrote:

 IMHO, you should start with something very simple RDBMS and meanwhile
 getting handle over Cassandra or other noSql technology. Start out with
 simple, but always be aware and conscious of the next thing you will have
 in stack. It's timetaking to work with new technology if you are in the
 phase of prototyping something fast and geared towards a Vc demo. In most
 of the cases, you won't need noSql for a while unless there is a very
 strong case.

 Thanks,
 Jahangir
 On Nov 20, 2011 4:04 PM, Dotan N. dip...@gmail.com wrote:

 Thanks David.
 Stephen: thanks for the tip, we can run a recommended configuration, so
 that wouldn't be an issue. I guess I can focus that my questions are on
 complexity of development.

 After digesting David's answer, I guess my follow up questions would be
 - how would you process data in a cassandra cluster, typically? via
 one-off coded offline jobs?
 - how easy is map/reduce on existing data (just looked at brisk but it
 may be unrelated, any case, not too much written about it)
 - how would you do analytics over a cassandra cluster
 - given the common examples of time-series, how would you recommend to
 aggregate (sum, avg, facet) and provide statistics over the collected data?
 for example if it were kinds of logs and you'd like to group all of certain
 fields in it, or provide a histogram over it.

 Thanks!


 --
 Dotan, @jondot http://twitter.com/jondot



 On Sun, Nov 20, 2011 at 10:32 PM, Stephen Connolly 
 stephen.alan.conno...@gmail.com wrote:

 if your startup is bootstrapping then cassandra is sometimes to heavy to
 start with.

 i.e. it needs to be fed ram... you're not going to seriously run it in
 less than 1gb per node... that level of ram commitment can be too much
 while bootstrapping.

 if your startup has enough cash to pay for 3-5 recommended spec (see
 wiki) nodes to be up 24/7 then cassandra is a good fit...

 a friend of mine is bootstrapping a startup and had to drop back to
 mysql while he finds his pain points and customers... he knows he will end
 up jumping back to cassandra when he gets enough customers (or a VC) but
 for now the running costs are too much to pay from his own pocket... note
 that the jdbc driver and cql will make jumping back easy for him (as he
 still tests with c*... just runs at present against mysql nuts eh!)

 - Stephen

 ---
 Sent from my Android phone, so random spelling mistakes, random nonsense
 words and other nonsense are a direct result of using swype to type on the
 screen
 On 20 Nov 2011 19:07, Dotan N. dip...@gmail.com wrote:

 Hi all,
 my question may be more philosophical than related technically
 to Cassandra, but please bear with me.

 Given that a young startup may not know its product full at the early
 stages, but that it definitely points to ~200M users,
 would Cassandra will be the right way to go?

 That is, the requirement is for a large data store, that can move with
 product changes and requirements swiftly.

 Given that in Cassandra one thinks hard about the queries, and then
 builds a model to suit it best, I was thinking of
 this situation as problematic.

 So here are some questions:

 - would it be wiser to start with a more agile data store (such as
 mongodb) and then progress onto Cassandra, when the product itself
 solidifies?
 - given that we start with Cassandra from the get go, what is a common
 (and quick in terms of development) way or practice to change data, change
 schemas, as the product evolves?
 - is it even smart to start with Cassandra? would only startups whose
 core business is big data start with it from the get go?
 - how would you do map/reduce with Cassandra? how agile is that? (for
 example, can you run map/reduce _very_ frequently?)

 Thanks!

 --
 Dotan, @jondot http://twitter.com/jondot





Re: data agility

2011-11-20 Thread Aaron Turner
Sounds like you need to figure out what your product is going to do
and what technology will best fit those requirements.  I know you're
worried about being agile and all that, but scaling requires you to
use the right tool for the job. Worry about new requirements when they
rear their ugly head rather then a dozen of what if scenarios.

You can scale MySQL/etc and Cassandra, MongoDB, etc to 10-200M users
depending on what you're asking your datastore to do.  You haven't
defined that really at all other then some comments about wanting to
do some map/reduce jobs.

Really what you should be doing is figuring out what kind of data you
need to store and your needs like access patterns, availability, ACID
compliance, etc and then figure out what technology is the best fit.
There are tons of Cassandra vs X comparisons for every NoSQL DB in
existence.

Other then that, the map/reduce on Cassandra is more job based rather
then useful for interactive queries so if that is important then
Cassandra prolly isn't a good fit.  You did mention time series data
too, and that's a sweet spot for Cassandra and not something I
personally would put in a document based datastore like MonogoDB.

Good luck.
-Aaron

On Sun, Nov 20, 2011 at 1:24 PM, Dotan N. dip...@gmail.com wrote:
 Jahangir, thanks! however I've noted that we may very well need to scale to
 200M users or entities within a short amount of time - say a year or two,
 10M within few months.

 --
 Dotan, @jondot


 On Sun, Nov 20, 2011 at 11:14 PM, Jahangir Mohammed
 md.jahangi...@gmail.com wrote:

 IMHO, you should start with something very simple RDBMS and meanwhile
 getting handle over Cassandra or other noSql technology. Start out with
 simple, but always be aware and conscious of the next thing you will have in
 stack. It's timetaking to work with new technology if you are in the phase
 of prototyping something fast and geared towards a Vc demo. In most of the
 cases, you won't need noSql for a while unless there is a very strong case.

 Thanks,
 Jahangir

 On Nov 20, 2011 4:04 PM, Dotan N. dip...@gmail.com wrote:

 Thanks David.
 Stephen: thanks for the tip, we can run a recommended configuration, so
 that wouldn't be an issue. I guess I can focus that my questions are on
 complexity of development.
 After digesting David's answer, I guess my follow up questions would be
 - how would you process data in a cassandra cluster, typically? via
 one-off coded offline jobs?
 - how easy is map/reduce on existing data (just looked at brisk but it
 may be unrelated, any case, not too much written about it)
 - how would you do analytics over a cassandra cluster
 - given the common examples of time-series, how would you recommend to
 aggregate (sum, avg, facet) and provide statistics over the collected data?
 for example if it were kinds of logs and you'd like to group all of certain
 fields in it, or provide a histogram over it.
 Thanks!

 --
 Dotan, @jondot


 On Sun, Nov 20, 2011 at 10:32 PM, Stephen Connolly
 stephen.alan.conno...@gmail.com wrote:

 if your startup is bootstrapping then cassandra is sometimes to heavy to
 start with.

 i.e. it needs to be fed ram... you're not going to seriously run it in
 less than 1gb per node... that level of ram commitment can be too much 
 while
 bootstrapping.

 if your startup has enough cash to pay for 3-5 recommended spec (see
 wiki) nodes to be up 24/7 then cassandra is a good fit...

 a friend of mine is bootstrapping a startup and had to drop back to
 mysql while he finds his pain points and customers... he knows he will end
 up jumping back to cassandra when he gets enough customers (or a VC) but 
 for
 now the running costs are too much to pay from his own pocket... note that
 the jdbc driver and cql will make jumping back easy for him (as he still
 tests with c*... just runs at present against mysql nuts eh!)

 - Stephen

 ---
 Sent from my Android phone, so random spelling mistakes, random nonsense
 words and other nonsense are a direct result of using swype to type on the
 screen

 On 20 Nov 2011 19:07, Dotan N. dip...@gmail.com wrote:

 Hi all,
 my question may be more philosophical than related technically
 to Cassandra, but please bear with me.
 Given that a young startup may not know its product full at the early
 stages, but that it definitely points to ~200M users,
 would Cassandra will be the right way to go?
 That is, the requirement is for a large data store, that can move with
 product changes and requirements swiftly.
 Given that in Cassandra one thinks hard about the queries, and then
 builds a model to suit it best, I was thinking of
 this situation as problematic.
 So here are some questions:
 - would it be wiser to start with a more agile data store (such as
 mongodb) and then progress onto Cassandra, when the product itself
 solidifies?
 - given that we start with Cassandra from the get go, what is a common
 (and quick in terms of development) way or practice to change data, change
 schemas, as the 

Re: What sort of load do the tombstones create on the cluster?

2011-11-20 Thread Jahangir Mohammed
Mostly, they are I/O and CPU intensive during major compaction. If ganglia
doesn't have anything suspicious there, then what is performance loss ?
Read or write?
On Nov 17, 2011 1:01 PM, Maxim Potekhin potek...@bnl.gov wrote:

 In view of my unpleasant discovery last week that deletions in Cassandra
 lead to a very real
 and serious performance loss, I'm working on a strategy of moving forward.

 If the tombstones do cause such problem, where should I be looking for
 performance bottlenecks?
 Is it disk, CPU or something else? Thing is, I don't see anything
 outstanding in my Ganglia plots.

 TIA,

 Maxim




Re: data agility

2011-11-20 Thread Dotan N.
Thanks Aaron, I kept this use-case free as to focus on the higher level
description, it might have been a not a good idea.
But generally I think I got a better intuition from the various answers,
thanks!


--
Dotan, @jondot http://twitter.com/jondot



On Sun, Nov 20, 2011 at 11:52 PM, Aaron Turner synfina...@gmail.com wrote:

 Sounds like you need to figure out what your product is going to do
 and what technology will best fit those requirements.  I know you're
 worried about being agile and all that, but scaling requires you to
 use the right tool for the job. Worry about new requirements when they
 rear their ugly head rather then a dozen of what if scenarios.

 You can scale MySQL/etc and Cassandra, MongoDB, etc to 10-200M users
 depending on what you're asking your datastore to do.  You haven't
 defined that really at all other then some comments about wanting to
 do some map/reduce jobs.

 Really what you should be doing is figuring out what kind of data you
 need to store and your needs like access patterns, availability, ACID
 compliance, etc and then figure out what technology is the best fit.
 There are tons of Cassandra vs X comparisons for every NoSQL DB in
 existence.

 Other then that, the map/reduce on Cassandra is more job based rather
 then useful for interactive queries so if that is important then
 Cassandra prolly isn't a good fit.  You did mention time series data
 too, and that's a sweet spot for Cassandra and not something I
 personally would put in a document based datastore like MonogoDB.

 Good luck.
 -Aaron

 On Sun, Nov 20, 2011 at 1:24 PM, Dotan N. dip...@gmail.com wrote:
  Jahangir, thanks! however I've noted that we may very well need
 to scale to
  200M users or entities within a short amount of time - say a year or
 two,
  10M within few months.
 
  --
  Dotan, @jondot
 
 
  On Sun, Nov 20, 2011 at 11:14 PM, Jahangir Mohammed
  md.jahangi...@gmail.com wrote:
 
  IMHO, you should start with something very simple RDBMS and meanwhile
  getting handle over Cassandra or other noSql technology. Start out with
  simple, but always be aware and conscious of the next thing you will
 have in
  stack. It's timetaking to work with new technology if you are in the
 phase
  of prototyping something fast and geared towards a Vc demo. In most of
 the
  cases, you won't need noSql for a while unless there is a very strong
 case.
 
  Thanks,
  Jahangir
 
  On Nov 20, 2011 4:04 PM, Dotan N. dip...@gmail.com wrote:
 
  Thanks David.
  Stephen: thanks for the tip, we can run a recommended configuration, so
  that wouldn't be an issue. I guess I can focus that my questions are on
  complexity of development.
  After digesting David's answer, I guess my follow up questions would be
  - how would you process data in a cassandra cluster, typically? via
  one-off coded offline jobs?
  - how easy is map/reduce on existing data (just looked at brisk but it
  may be unrelated, any case, not too much written about it)
  - how would you do analytics over a cassandra cluster
  - given the common examples of time-series, how would you recommend to
  aggregate (sum, avg, facet) and provide statistics over the collected
 data?
  for example if it were kinds of logs and you'd like to group all of
 certain
  fields in it, or provide a histogram over it.
  Thanks!
 
  --
  Dotan, @jondot
 
 
  On Sun, Nov 20, 2011 at 10:32 PM, Stephen Connolly
  stephen.alan.conno...@gmail.com wrote:
 
  if your startup is bootstrapping then cassandra is sometimes to heavy
 to
  start with.
 
  i.e. it needs to be fed ram... you're not going to seriously run it in
  less than 1gb per node... that level of ram commitment can be too
 much while
  bootstrapping.
 
  if your startup has enough cash to pay for 3-5 recommended spec (see
  wiki) nodes to be up 24/7 then cassandra is a good fit...
 
  a friend of mine is bootstrapping a startup and had to drop back to
  mysql while he finds his pain points and customers... he knows he
 will end
  up jumping back to cassandra when he gets enough customers (or a VC)
 but for
  now the running costs are too much to pay from his own pocket... note
 that
  the jdbc driver and cql will make jumping back easy for him (as he
 still
  tests with c*... just runs at present against mysql nuts eh!)
 
  - Stephen
 
  ---
  Sent from my Android phone, so random spelling mistakes, random
 nonsense
  words and other nonsense are a direct result of using swype to type
 on the
  screen
 
  On 20 Nov 2011 19:07, Dotan N. dip...@gmail.com wrote:
 
  Hi all,
  my question may be more philosophical than related technically
  to Cassandra, but please bear with me.
  Given that a young startup may not know its product full at the early
  stages, but that it definitely points to ~200M users,
  would Cassandra will be the right way to go?
  That is, the requirement is for a large data store, that can move
 with
  product changes and requirements swiftly.
  Given that in Cassandra one thinks hard about the 

Re: data agility

2011-11-20 Thread Milind Parikh
For 99% of current applications requiing a persistent datastore, Oracle,
PgSQL and MySQL variants will suffice.

For the 1% of the applications, consider C* if

 (a) you have given up on distributed transactions (ACIDLY; but
NOT BASEICLY)
 (b) wondering about this new fangled horizonatly scalability
buzzword and wonder why disks cannot spin faster and faster
 (c) need/want to design optimized query paths for your data with a
and b

Rewording a, b and c
  a.1 Cassandra provides best-in-class low latency asynchronous
replication with battle-hardened mechanisms to manage eventual
consistenency in an inherently disordered (entroprophized) world... ACID
versus BASE transactions
  b.1 Cassandra's write path is completely optimized. It will write
as fast as the disk will allow it; but the most important feature is that
if you need to write faster than an individual server will allow, add more
servers. The locality of data principle, the ineorable faster computations
and anti-entropy services enables you to cloud-scale.
   c.1 Writing is easy; but then you actually need to find the
data. And do it at scale--speed wise.  The columnar nature of Cassandra,
designs of the internals in Cassandra and support at the API level
(composite indexes) make it possible to have fast quering capabilities.

Milind


On Sun, Nov 20, 2011 at 2:19 PM, Dotan N. dip...@gmail.com wrote:

 Thanks Aaron, I kept this use-case free as to focus on the higher level
 description, it might have been a not a good idea.
 But generally I think I got a better intuition from the various answers,
 thanks!


 --
 Dotan, @jondot http://twitter.com/jondot



 On Sun, Nov 20, 2011 at 11:52 PM, Aaron Turner synfina...@gmail.comwrote:

 Sounds like you need to figure out what your product is going to do
 and what technology will best fit those requirements.  I know you're
 worried about being agile and all that, but scaling requires you to
 use the right tool for the job. Worry about new requirements when they
 rear their ugly head rather then a dozen of what if scenarios.

 You can scale MySQL/etc and Cassandra, MongoDB, etc to 10-200M users
 depending on what you're asking your datastore to do.  You haven't
 defined that really at all other then some comments about wanting to
 do some map/reduce jobs.

 Really what you should be doing is figuring out what kind of data you
 need to store and your needs like access patterns, availability, ACID
 compliance, etc and then figure out what technology is the best fit.
 There are tons of Cassandra vs X comparisons for every NoSQL DB in
 existence.

 Other then that, the map/reduce on Cassandra is more job based rather
 then useful for interactive queries so if that is important then
 Cassandra prolly isn't a good fit.  You did mention time series data
 too, and that's a sweet spot for Cassandra and not something I
 personally would put in a document based datastore like MonogoDB.

 Good luck.
 -Aaron

 On Sun, Nov 20, 2011 at 1:24 PM, Dotan N. dip...@gmail.com wrote:
  Jahangir, thanks! however I've noted that we may very well need
 to scale to
  200M users or entities within a short amount of time - say a year or
 two,
  10M within few months.
 
  --
  Dotan, @jondot
 
 
  On Sun, Nov 20, 2011 at 11:14 PM, Jahangir Mohammed
  md.jahangi...@gmail.com wrote:
 
  IMHO, you should start with something very simple RDBMS and meanwhile
  getting handle over Cassandra or other noSql technology. Start out with
  simple, but always be aware and conscious of the next thing you will
 have in
  stack. It's timetaking to work with new technology if you are in the
 phase
  of prototyping something fast and geared towards a Vc demo. In most of
 the
  cases, you won't need noSql for a while unless there is a very strong
 case.
 
  Thanks,
  Jahangir
 
  On Nov 20, 2011 4:04 PM, Dotan N. dip...@gmail.com wrote:
 
  Thanks David.
  Stephen: thanks for the tip, we can run a recommended configuration,
 so
  that wouldn't be an issue. I guess I can focus that my questions are
 on
  complexity of development.
  After digesting David's answer, I guess my follow up questions would
 be
  - how would you process data in a cassandra cluster, typically? via
  one-off coded offline jobs?
  - how easy is map/reduce on existing data (just looked at brisk but it
  may be unrelated, any case, not too much written about it)
  - how would you do analytics over a cassandra cluster
  - given the common examples of time-series, how would you recommend to
  aggregate (sum, avg, facet) and provide statistics over the collected
 data?
  for example if it were kinds of logs and you'd like to group all of
 certain
  fields in it, or provide a histogram over it.
  Thanks!
 
  --
  Dotan, @jondot
 
 
  On Sun, Nov 20, 2011 at 10:32 PM, Stephen Connolly
  stephen.alan.conno...@gmail.com wrote:
 
  if your startup is bootstrapping then cassandra is sometimes to
 heavy to
  start with.
 
  i.e. it needs to be 

Re: Network traffic patterns

2011-11-20 Thread Philippe
I'm using BOP.
Le 20 nov. 2011 13:09, Boris Yen yulin...@gmail.com a écrit :

 I am just curious about which partitioner you are using?

 On Thu, Nov 17, 2011 at 4:30 PM, Philippe watche...@gmail.com wrote:

 Hi Todd
 Yes all equal hardware. Nearly no CPU usage and no memory issues.
 Repairs are running in tens of minutes so i don't understand why
 replication would be backed up.

 Any other ideas?
 Le 17 nov. 2011 02:33, Todd Burruss bburr...@expedia.com a écrit :

 Are all of your machines equal hardware?  Since those machines are
 sending data somewhere, maybe they are behind in replicating and are
 continuously catching up?

 Use a tool like tcpdump to find out where the data is going

 From: Philippe watche...@gmail.com
 Reply-To: user@cassandra.apache.org user@cassandra.apache.org
 Date: Tue, 15 Nov 2011 13:22:38 -0800
 To: user user@cassandra.apache.org
 Subject: Re: Network traffic patterns

 Sorry about the previous message, I've enabled keyboard shortcuts on
 gmail...*sigh*...

 Hello,
 I'm trying to understand the network usage I am seeing in my cluster,
 can anyone shed some light?
 It's an RF=3, 12-node, cassandra 0.8.6 cluster. repair is performed on
 each node once a week, with a rolling schedule.
 The nodes are p13,p14,p15...p24 and are consecutive in that order on the
 ring. Each node is only a cassandra database. I am hitting the cluster from
 another server (p4).

 p4 is doing this with 20 threads in parallel

1. read a lot of data (some columns for hundreds to tens of
thousands of keys, split into 512-key multigets)
2. process the data
3. write back a byte array to cassandra (average size is 400 bytes)
4. go back to 1

 According to my munin graphs, network usage is about as follows. I am
 not surprised at the bias towards p13-p15 as p4 is getting  storing data
 mainly for keys located on one of those nodes.

- p4 : 1.5Mb/s in and out
- p13-p15 : 15Mb/s in and 80Mb/s out
- p16-p24 : 45Mb/s in and 5Mb/s out

 What I don't understand is why p4 is only seeing 1.5Mb/s while I see
 80Mb/s on p13  p15.

 The way I understand this:

- p4 makes a multiget to the cluster, electing to use any node in
the cluster (IN traffic for describe the query)
- coordinator node replays the query on all 3 replicas (so 3 servers
each get the IN traffic, mostly p13-p15)
- each server replies to coordinator
- coordinator chooses matching values and sends back data to p4

 So if p13-p15 are outputting 80Mb/s why am I not seeing 80Mb/s coming
 into p4 which is on the receiving end ?

 Thanks

 2011/11/15 Philippe watche...@gmail.com

 Hello,
 I'm trying to understand the network usage I am seeing in my cluster,
 can anyone shed some light?
 It's an RF=3, 12-node, cassandra 0.8.6 cluster. The nodes are
 p13,p14,p15...p24 and are consecutive in that order on the ring.
 Each node is only a cassandra database. I am hitting the cluster from
 another server (p4).

 The pattern on p4 is the pattern is to

1. read a lot of data (some columns for hundreds to tens of
thousands of keys, split into 512-key multigets)
2. process the data
3. write back a byte array to cassandra (average size is 400 bytes)


 p4 reads as






[no subject]

2011-11-20 Thread quinteros8...@gmail.com


--- Sent with mail@metro - the new generation of mobile messaging

Re: Data Model Design for Login Servie

2011-11-20 Thread Maciej Miklas
I will follow exactly this solution - thanks :)

On Fri, Nov 18, 2011 at 9:53 PM, David Jeske dav...@gmail.com wrote:

 On Thu, Nov 17, 2011 at 1:08 PM, Maciej Miklas 
 mac.mik...@googlemail.comwrote:

 A) Skinny rows
  - row key contains login name - this is the main search criteria
  - login data is replicated - each possible login is stored as single row
 which contains all user data - 10 logins for

 single customer create 10 rows, where each row has different key and the
 same content


 To me this seems reasonable. Remember, because of your replication of the
 datavalues you will want a quick way to find all the logins for a given ID,
 so you will also want to store a separate dataset like:

 1122 {
  alfred.tes...@xyz.de =1(where the login is a column key)
  alf...@aad.de =1
 }

 When you do an update, you'll need to fetch the entire row for the
 user-id, and then update all copies of the data. THis can create problems,
 if the data is out of sync (which it will be at certain times because of
 eventual consistency, and might be if something bad happens).

 ...the other option, of course, is to make a login-name indirection. You
 would have only one copy of the user-data stored by ID, and then you would
 store a separate mapping from login-name-to-ID. Of course this would
 require two roundtrips to get the user information from login-id, which is
 something I know you said you didn't want to do.