UUIDType
am i correct that neither of Cassandra's UUIDTypes (at least in 0.7) compare UUIDs according to RFC4122 (ie as two unsigned longs)?
Re: Efficiency of Cross Data Center Replication...?
A quick question, what if DC2 is down, and after a while it comes back on. how does the data get sync to DC2 in this case? (assume hint is disable) Thanks in advance. On Thu, Nov 17, 2011 at 10:46 AM, Jeremiah Jordan jeremiah.jor...@morningstar.com wrote: Pretty sure data is sent to the coordinating node in DC2 at the same time it is sent to replicas in DC1, so I would think 10's of milliseconds after the transport time to DC2. On Nov 16, 2011, at 3:48 PM, ehers...@gmail.com wrote: On a related note - assuming there are available resources across the board (cpu and memory on every node, low network latency, non-saturated nics/circuits/disks), what's a reasonable expectation for timing on replication? Sub-second? Less than five seconds? Ernie On Wed, Nov 16, 2011 at 4:00 PM, Brian Fleming bigbrianflem...@gmail.comwrote: Great - thanks Jake B. On Wed, Nov 16, 2011 at 8:40 PM, Jake Luciani jak...@gmail.com wrote: the former On Wed, Nov 16, 2011 at 3:33 PM, Brian Fleming bigbrianflem...@gmail.com wrote: Hi All, I have a question about inter-data centre replication : if you have 2 Data Centers, each with a local RF of 2 (i.e. total RF of 4) and write to a node in DC1, how efficient is the replication to DC2 - i.e. is that data : - replicated over to a single node in DC2 once and internally replicated or - replicated explicitly to two separate nodes? Obviously from a LAN resource utilisation perspective, the former would be preferable. Many thanks, Brian -- http://twitter.com/tjake
Re: Network traffic patterns
I am just curious about which partitioner you are using? On Thu, Nov 17, 2011 at 4:30 PM, Philippe watche...@gmail.com wrote: Hi Todd Yes all equal hardware. Nearly no CPU usage and no memory issues. Repairs are running in tens of minutes so i don't understand why replication would be backed up. Any other ideas? Le 17 nov. 2011 02:33, Todd Burruss bburr...@expedia.com a écrit : Are all of your machines equal hardware? Since those machines are sending data somewhere, maybe they are behind in replicating and are continuously catching up? Use a tool like tcpdump to find out where the data is going From: Philippe watche...@gmail.com Reply-To: user@cassandra.apache.org user@cassandra.apache.org Date: Tue, 15 Nov 2011 13:22:38 -0800 To: user user@cassandra.apache.org Subject: Re: Network traffic patterns Sorry about the previous message, I've enabled keyboard shortcuts on gmail...*sigh*... Hello, I'm trying to understand the network usage I am seeing in my cluster, can anyone shed some light? It's an RF=3, 12-node, cassandra 0.8.6 cluster. repair is performed on each node once a week, with a rolling schedule. The nodes are p13,p14,p15...p24 and are consecutive in that order on the ring. Each node is only a cassandra database. I am hitting the cluster from another server (p4). p4 is doing this with 20 threads in parallel 1. read a lot of data (some columns for hundreds to tens of thousands of keys, split into 512-key multigets) 2. process the data 3. write back a byte array to cassandra (average size is 400 bytes) 4. go back to 1 According to my munin graphs, network usage is about as follows. I am not surprised at the bias towards p13-p15 as p4 is getting storing data mainly for keys located on one of those nodes. - p4 : 1.5Mb/s in and out - p13-p15 : 15Mb/s in and 80Mb/s out - p16-p24 : 45Mb/s in and 5Mb/s out What I don't understand is why p4 is only seeing 1.5Mb/s while I see 80Mb/s on p13 p15. The way I understand this: - p4 makes a multiget to the cluster, electing to use any node in the cluster (IN traffic for describe the query) - coordinator node replays the query on all 3 replicas (so 3 servers each get the IN traffic, mostly p13-p15) - each server replies to coordinator - coordinator chooses matching values and sends back data to p4 So if p13-p15 are outputting 80Mb/s why am I not seeing 80Mb/s coming into p4 which is on the receiving end ? Thanks 2011/11/15 Philippe watche...@gmail.com Hello, I'm trying to understand the network usage I am seeing in my cluster, can anyone shed some light? It's an RF=3, 12-node, cassandra 0.8.6 cluster. The nodes are p13,p14,p15...p24 and are consecutive in that order on the ring. Each node is only a cassandra database. I am hitting the cluster from another server (p4). The pattern on p4 is the pattern is to 1. read a lot of data (some columns for hundreds to tens of thousands of keys, split into 512-key multigets) 2. process the data 3. write back a byte array to cassandra (average size is 400 bytes) p4 reads as
Re: Efficiency of Cross Data Center Replication...?
On Sun, Nov 20, 2011 at 4:01 AM, Boris Yen yulin...@gmail.com wrote: A quick question, what if DC2 is down, and after a while it comes back on. how does the data get sync to DC2 in this case? (assume hint is disable) Thanks in advance. Manually, use nodetool repair in rolling fashion on all the nodes of DC2 On Thu, Nov 17, 2011 at 10:46 AM, Jeremiah Jordan jeremiah.jor...@morningstar.com wrote: Pretty sure data is sent to the coordinating node in DC2 at the same time it is sent to replicas in DC1, so I would think 10's of milliseconds after the transport time to DC2. On Nov 16, 2011, at 3:48 PM, ehers...@gmail.com wrote: On a related note - assuming there are available resources across the board (cpu and memory on every node, low network latency, non-saturated nics/circuits/disks), what's a reasonable expectation for timing on replication? Sub-second? Less than five seconds? Ernie On Wed, Nov 16, 2011 at 4:00 PM, Brian Fleming bigbrianflem...@gmail.com wrote: Great - thanks Jake B. On Wed, Nov 16, 2011 at 8:40 PM, Jake Luciani jak...@gmail.com wrote: the former On Wed, Nov 16, 2011 at 3:33 PM, Brian Fleming bigbrianflem...@gmail.com wrote: Hi All, I have a question about inter-data centre replication : if you have 2 Data Centers, each with a local RF of 2 (i.e. total RF of 4) and write to a node in DC1, how efficient is the replication to DC2 - i.e. is that data : - replicated over to a single node in DC2 once and internally replicated or - replicated explicitly to two separate nodes? Obviously from a LAN resource utilisation perspective, the former would be preferable. Many thanks, Brian -- http://twitter.com/tjake
Re: Efficiency of Cross Data Center Replication...?
If hinting is off. Read Repair and Manual Repair are the only ways data will get there (just like when a single node is down). On Nov 20, 2011, at 6:01 AM, Boris Yen wrote: A quick question, what if DC2 is down, and after a while it comes back on. how does the data get sync to DC2 in this case? (assume hint is disable) Thanks in advance. On Thu, Nov 17, 2011 at 10:46 AM, Jeremiah Jordan jeremiah.jor...@morningstar.com wrote: Pretty sure data is sent to the coordinating node in DC2 at the same time it is sent to replicas in DC1, so I would think 10's of milliseconds after the transport time to DC2. On Nov 16, 2011, at 3:48 PM, ehers...@gmail.com wrote: On a related note - assuming there are available resources across the board (cpu and memory on every node, low network latency, non-saturated nics/circuits/disks), what's a reasonable expectation for timing on replication? Sub-second? Less than five seconds? Ernie On Wed, Nov 16, 2011 at 4:00 PM, Brian Fleming bigbrianflem...@gmail.com wrote: Great - thanks Jake B. On Wed, Nov 16, 2011 at 8:40 PM, Jake Luciani jak...@gmail.com wrote: the former On Wed, Nov 16, 2011 at 3:33 PM, Brian Fleming bigbrianflem...@gmail.com wrote: Hi All, I have a question about inter-data centre replication : if you have 2 Data Centers, each with a local RF of 2 (i.e. total RF of 4) and write to a node in DC1, how efficient is the replication to DC2 - i.e. is that data : - replicated over to a single node in DC2 once and internally replicated or - replicated explicitly to two separate nodes? Obviously from a LAN resource utilisation perspective, the former would be preferable. Many thanks, Brian -- http://twitter.com/tjake
data agility
Hi all, my question may be more philosophical than related technically to Cassandra, but please bear with me. Given that a young startup may not know its product full at the early stages, but that it definitely points to ~200M users, would Cassandra will be the right way to go? That is, the requirement is for a large data store, that can move with product changes and requirements swiftly. Given that in Cassandra one thinks hard about the queries, and then builds a model to suit it best, I was thinking of this situation as problematic. So here are some questions: - would it be wiser to start with a more agile data store (such as mongodb) and then progress onto Cassandra, when the product itself solidifies? - given that we start with Cassandra from the get go, what is a common (and quick in terms of development) way or practice to change data, change schemas, as the product evolves? - is it even smart to start with Cassandra? would only startups whose core business is big data start with it from the get go? - how would you do map/reduce with Cassandra? how agile is that? (for example, can you run map/reduce _very_ frequently?) Thanks! -- Dotan, @jondot http://twitter.com/jondot
Re: data agility
Dotan, I think that if you're in the early stages you have a basic idea of what your product is going to be, architecturally speaking. While you may change your business model, or features at the display layer, I would think the data models itself would remain relatively similar throughout...otherwise you'd have another product on your hands, no? But, even if your requirements radically shift, Cassandra is schemaless, so you'd be able to make 'structural' changes to your data without as much risk as in a traditional RDBMS, i.e. MySql. At the end of the day, I don't think you've given enough information about your proposed data models for anyone to say, Yes, Cassandra would or would not be the right choice for your startup. If well administered, depending on the services offered, MySQL or Oracle could support a site with 200M users, and a poorly designed Cassandra data store could work very poorly for a site supporting 200 users. I will say that I think it makes a lot of sense to use tradional RDBMS systems for relational data and a Cassandra-like system when there is a need for larger data storage, or something that lends itself well to a structureless design. If you are using a framework that supports a good ORM layer (i.e. Hibernate for Java), you can have your build process update your database schema as you build out your application. I haven't done much work in Rails or Django, but I understand those support the transparent schema updating as well. That sort of setup can work very effectively in early development...but that is more a discussion for other communities. If you're interested in doing Map/Reduce jobs with Cassandra, look into Brisk, the system created by DataStax (which is also open source) that allows you to run Hadoop on top of your Cassandra cluster. This may not be exactly what you're looking for when asking this question...but it might give you the insights you're looking for. Hope this has been at least somewhat helpful. David On Sun, Nov 20, 2011 at 1:06 PM, Dotan N. dip...@gmail.com wrote: Hi all, my question may be more philosophical than related technically to Cassandra, but please bear with me. Given that a young startup may not know its product full at the early stages, but that it definitely points to ~200M users, would Cassandra will be the right way to go? That is, the requirement is for a large data store, that can move with product changes and requirements swiftly. Given that in Cassandra one thinks hard about the queries, and then builds a model to suit it best, I was thinking of this situation as problematic. So here are some questions: - would it be wiser to start with a more agile data store (such as mongodb) and then progress onto Cassandra, when the product itself solidifies? - given that we start with Cassandra from the get go, what is a common (and quick in terms of development) way or practice to change data, change schemas, as the product evolves? - is it even smart to start with Cassandra? would only startups whose core business is big data start with it from the get go? - how would you do map/reduce with Cassandra? how agile is that? (for example, can you run map/reduce _very_ frequently?) Thanks! -- Dotan, @jondot http://twitter.com/jondot -- *David McNelis* Lead Software Engineer Agentis Energy www.agentisenergy.com c: 219.384.5143 *A Smart Grid technology company focused on helping consumers of energy control an often under-managed resource.*
Re: data agility
if your startup is bootstrapping then cassandra is sometimes to heavy to start with. i.e. it needs to be fed ram... you're not going to seriously run it in less than 1gb per node... that level of ram commitment can be too much while bootstrapping. if your startup has enough cash to pay for 3-5 recommended spec (see wiki) nodes to be up 24/7 then cassandra is a good fit... a friend of mine is bootstrapping a startup and had to drop back to mysql while he finds his pain points and customers... he knows he will end up jumping back to cassandra when he gets enough customers (or a VC) but for now the running costs are too much to pay from his own pocket... note that the jdbc driver and cql will make jumping back easy for him (as he still tests with c*... just runs at present against mysql nuts eh!) - Stephen --- Sent from my Android phone, so random spelling mistakes, random nonsense words and other nonsense are a direct result of using swype to type on the screen On 20 Nov 2011 19:07, Dotan N. dip...@gmail.com wrote: Hi all, my question may be more philosophical than related technically to Cassandra, but please bear with me. Given that a young startup may not know its product full at the early stages, but that it definitely points to ~200M users, would Cassandra will be the right way to go? That is, the requirement is for a large data store, that can move with product changes and requirements swiftly. Given that in Cassandra one thinks hard about the queries, and then builds a model to suit it best, I was thinking of this situation as problematic. So here are some questions: - would it be wiser to start with a more agile data store (such as mongodb) and then progress onto Cassandra, when the product itself solidifies? - given that we start with Cassandra from the get go, what is a common (and quick in terms of development) way or practice to change data, change schemas, as the product evolves? - is it even smart to start with Cassandra? would only startups whose core business is big data start with it from the get go? - how would you do map/reduce with Cassandra? how agile is that? (for example, can you run map/reduce _very_ frequently?) Thanks! -- Dotan, @jondot http://twitter.com/jondot
Re: data agility
Thanks David. Stephen: thanks for the tip, we can run a recommended configuration, so that wouldn't be an issue. I guess I can focus that my questions are on complexity of development. After digesting David's answer, I guess my follow up questions would be - how would you process data in a cassandra cluster, typically? via one-off coded offline jobs? - how easy is map/reduce on existing data (just looked at brisk but it may be unrelated, any case, not too much written about it) - how would you do analytics over a cassandra cluster - given the common examples of time-series, how would you recommend to aggregate (sum, avg, facet) and provide statistics over the collected data? for example if it were kinds of logs and you'd like to group all of certain fields in it, or provide a histogram over it. Thanks! -- Dotan, @jondot http://twitter.com/jondot On Sun, Nov 20, 2011 at 10:32 PM, Stephen Connolly stephen.alan.conno...@gmail.com wrote: if your startup is bootstrapping then cassandra is sometimes to heavy to start with. i.e. it needs to be fed ram... you're not going to seriously run it in less than 1gb per node... that level of ram commitment can be too much while bootstrapping. if your startup has enough cash to pay for 3-5 recommended spec (see wiki) nodes to be up 24/7 then cassandra is a good fit... a friend of mine is bootstrapping a startup and had to drop back to mysql while he finds his pain points and customers... he knows he will end up jumping back to cassandra when he gets enough customers (or a VC) but for now the running costs are too much to pay from his own pocket... note that the jdbc driver and cql will make jumping back easy for him (as he still tests with c*... just runs at present against mysql nuts eh!) - Stephen --- Sent from my Android phone, so random spelling mistakes, random nonsense words and other nonsense are a direct result of using swype to type on the screen On 20 Nov 2011 19:07, Dotan N. dip...@gmail.com wrote: Hi all, my question may be more philosophical than related technically to Cassandra, but please bear with me. Given that a young startup may not know its product full at the early stages, but that it definitely points to ~200M users, would Cassandra will be the right way to go? That is, the requirement is for a large data store, that can move with product changes and requirements swiftly. Given that in Cassandra one thinks hard about the queries, and then builds a model to suit it best, I was thinking of this situation as problematic. So here are some questions: - would it be wiser to start with a more agile data store (such as mongodb) and then progress onto Cassandra, when the product itself solidifies? - given that we start with Cassandra from the get go, what is a common (and quick in terms of development) way or practice to change data, change schemas, as the product evolves? - is it even smart to start with Cassandra? would only startups whose core business is big data start with it from the get go? - how would you do map/reduce with Cassandra? how agile is that? (for example, can you run map/reduce _very_ frequently?) Thanks! -- Dotan, @jondot http://twitter.com/jondot
Re: data agility
IMHO, you should start with something very simple RDBMS and meanwhile getting handle over Cassandra or other noSql technology. Start out with simple, but always be aware and conscious of the next thing you will have in stack. It's timetaking to work with new technology if you are in the phase of prototyping something fast and geared towards a Vc demo. In most of the cases, you won't need noSql for a while unless there is a very strong case. Thanks, Jahangir On Nov 20, 2011 4:04 PM, Dotan N. dip...@gmail.com wrote: Thanks David. Stephen: thanks for the tip, we can run a recommended configuration, so that wouldn't be an issue. I guess I can focus that my questions are on complexity of development. After digesting David's answer, I guess my follow up questions would be - how would you process data in a cassandra cluster, typically? via one-off coded offline jobs? - how easy is map/reduce on existing data (just looked at brisk but it may be unrelated, any case, not too much written about it) - how would you do analytics over a cassandra cluster - given the common examples of time-series, how would you recommend to aggregate (sum, avg, facet) and provide statistics over the collected data? for example if it were kinds of logs and you'd like to group all of certain fields in it, or provide a histogram over it. Thanks! -- Dotan, @jondot http://twitter.com/jondot On Sun, Nov 20, 2011 at 10:32 PM, Stephen Connolly stephen.alan.conno...@gmail.com wrote: if your startup is bootstrapping then cassandra is sometimes to heavy to start with. i.e. it needs to be fed ram... you're not going to seriously run it in less than 1gb per node... that level of ram commitment can be too much while bootstrapping. if your startup has enough cash to pay for 3-5 recommended spec (see wiki) nodes to be up 24/7 then cassandra is a good fit... a friend of mine is bootstrapping a startup and had to drop back to mysql while he finds his pain points and customers... he knows he will end up jumping back to cassandra when he gets enough customers (or a VC) but for now the running costs are too much to pay from his own pocket... note that the jdbc driver and cql will make jumping back easy for him (as he still tests with c*... just runs at present against mysql nuts eh!) - Stephen --- Sent from my Android phone, so random spelling mistakes, random nonsense words and other nonsense are a direct result of using swype to type on the screen On 20 Nov 2011 19:07, Dotan N. dip...@gmail.com wrote: Hi all, my question may be more philosophical than related technically to Cassandra, but please bear with me. Given that a young startup may not know its product full at the early stages, but that it definitely points to ~200M users, would Cassandra will be the right way to go? That is, the requirement is for a large data store, that can move with product changes and requirements swiftly. Given that in Cassandra one thinks hard about the queries, and then builds a model to suit it best, I was thinking of this situation as problematic. So here are some questions: - would it be wiser to start with a more agile data store (such as mongodb) and then progress onto Cassandra, when the product itself solidifies? - given that we start with Cassandra from the get go, what is a common (and quick in terms of development) way or practice to change data, change schemas, as the product evolves? - is it even smart to start with Cassandra? would only startups whose core business is big data start with it from the get go? - how would you do map/reduce with Cassandra? how agile is that? (for example, can you run map/reduce _very_ frequently?) Thanks! -- Dotan, @jondot http://twitter.com/jondot
Re: read performance problem
There is something wrong with the system. Your benchmarks are way off. How are you benchmarking? Are you using the stress lib included? On Nov 19, 2011 8:58 PM, Kent Tong freemant2...@yahoo.com wrote: Hi, On my computer with 2G RAM and a core 2 duo CPU E4600 @ 2.40GHz, I am testing the performance of Cassandra. The write performance is good: It can write a million records in 10 minutes. However, the query performance is poor and it takes 10 minutes to read 10K records with sequential keys from 0 to (about 100 QPS). This is far away from the 3,xxx QPS found on the net. Cassandra decided to use 1G as the Java heap size which seems to be fine as at the end of the benchmark the swap was barely used (only 1M used). I understand that my computer may be not as powerful as those used in the other benchmarks, but it shouldn't be that far off (1:30), right? Any suggestion? Thanks in advance!
Re: data agility
Jahangir, thanks! however I've noted that we may very well need to scale to 200M users or entities within a short amount of time - say a year or two, 10M within few months. -- Dotan, @jondot http://twitter.com/jondot On Sun, Nov 20, 2011 at 11:14 PM, Jahangir Mohammed md.jahangi...@gmail.com wrote: IMHO, you should start with something very simple RDBMS and meanwhile getting handle over Cassandra or other noSql technology. Start out with simple, but always be aware and conscious of the next thing you will have in stack. It's timetaking to work with new technology if you are in the phase of prototyping something fast and geared towards a Vc demo. In most of the cases, you won't need noSql for a while unless there is a very strong case. Thanks, Jahangir On Nov 20, 2011 4:04 PM, Dotan N. dip...@gmail.com wrote: Thanks David. Stephen: thanks for the tip, we can run a recommended configuration, so that wouldn't be an issue. I guess I can focus that my questions are on complexity of development. After digesting David's answer, I guess my follow up questions would be - how would you process data in a cassandra cluster, typically? via one-off coded offline jobs? - how easy is map/reduce on existing data (just looked at brisk but it may be unrelated, any case, not too much written about it) - how would you do analytics over a cassandra cluster - given the common examples of time-series, how would you recommend to aggregate (sum, avg, facet) and provide statistics over the collected data? for example if it were kinds of logs and you'd like to group all of certain fields in it, or provide a histogram over it. Thanks! -- Dotan, @jondot http://twitter.com/jondot On Sun, Nov 20, 2011 at 10:32 PM, Stephen Connolly stephen.alan.conno...@gmail.com wrote: if your startup is bootstrapping then cassandra is sometimes to heavy to start with. i.e. it needs to be fed ram... you're not going to seriously run it in less than 1gb per node... that level of ram commitment can be too much while bootstrapping. if your startup has enough cash to pay for 3-5 recommended spec (see wiki) nodes to be up 24/7 then cassandra is a good fit... a friend of mine is bootstrapping a startup and had to drop back to mysql while he finds his pain points and customers... he knows he will end up jumping back to cassandra when he gets enough customers (or a VC) but for now the running costs are too much to pay from his own pocket... note that the jdbc driver and cql will make jumping back easy for him (as he still tests with c*... just runs at present against mysql nuts eh!) - Stephen --- Sent from my Android phone, so random spelling mistakes, random nonsense words and other nonsense are a direct result of using swype to type on the screen On 20 Nov 2011 19:07, Dotan N. dip...@gmail.com wrote: Hi all, my question may be more philosophical than related technically to Cassandra, but please bear with me. Given that a young startup may not know its product full at the early stages, but that it definitely points to ~200M users, would Cassandra will be the right way to go? That is, the requirement is for a large data store, that can move with product changes and requirements swiftly. Given that in Cassandra one thinks hard about the queries, and then builds a model to suit it best, I was thinking of this situation as problematic. So here are some questions: - would it be wiser to start with a more agile data store (such as mongodb) and then progress onto Cassandra, when the product itself solidifies? - given that we start with Cassandra from the get go, what is a common (and quick in terms of development) way or practice to change data, change schemas, as the product evolves? - is it even smart to start with Cassandra? would only startups whose core business is big data start with it from the get go? - how would you do map/reduce with Cassandra? how agile is that? (for example, can you run map/reduce _very_ frequently?) Thanks! -- Dotan, @jondot http://twitter.com/jondot
Re: data agility
Sounds like you need to figure out what your product is going to do and what technology will best fit those requirements. I know you're worried about being agile and all that, but scaling requires you to use the right tool for the job. Worry about new requirements when they rear their ugly head rather then a dozen of what if scenarios. You can scale MySQL/etc and Cassandra, MongoDB, etc to 10-200M users depending on what you're asking your datastore to do. You haven't defined that really at all other then some comments about wanting to do some map/reduce jobs. Really what you should be doing is figuring out what kind of data you need to store and your needs like access patterns, availability, ACID compliance, etc and then figure out what technology is the best fit. There are tons of Cassandra vs X comparisons for every NoSQL DB in existence. Other then that, the map/reduce on Cassandra is more job based rather then useful for interactive queries so if that is important then Cassandra prolly isn't a good fit. You did mention time series data too, and that's a sweet spot for Cassandra and not something I personally would put in a document based datastore like MonogoDB. Good luck. -Aaron On Sun, Nov 20, 2011 at 1:24 PM, Dotan N. dip...@gmail.com wrote: Jahangir, thanks! however I've noted that we may very well need to scale to 200M users or entities within a short amount of time - say a year or two, 10M within few months. -- Dotan, @jondot On Sun, Nov 20, 2011 at 11:14 PM, Jahangir Mohammed md.jahangi...@gmail.com wrote: IMHO, you should start with something very simple RDBMS and meanwhile getting handle over Cassandra or other noSql technology. Start out with simple, but always be aware and conscious of the next thing you will have in stack. It's timetaking to work with new technology if you are in the phase of prototyping something fast and geared towards a Vc demo. In most of the cases, you won't need noSql for a while unless there is a very strong case. Thanks, Jahangir On Nov 20, 2011 4:04 PM, Dotan N. dip...@gmail.com wrote: Thanks David. Stephen: thanks for the tip, we can run a recommended configuration, so that wouldn't be an issue. I guess I can focus that my questions are on complexity of development. After digesting David's answer, I guess my follow up questions would be - how would you process data in a cassandra cluster, typically? via one-off coded offline jobs? - how easy is map/reduce on existing data (just looked at brisk but it may be unrelated, any case, not too much written about it) - how would you do analytics over a cassandra cluster - given the common examples of time-series, how would you recommend to aggregate (sum, avg, facet) and provide statistics over the collected data? for example if it were kinds of logs and you'd like to group all of certain fields in it, or provide a histogram over it. Thanks! -- Dotan, @jondot On Sun, Nov 20, 2011 at 10:32 PM, Stephen Connolly stephen.alan.conno...@gmail.com wrote: if your startup is bootstrapping then cassandra is sometimes to heavy to start with. i.e. it needs to be fed ram... you're not going to seriously run it in less than 1gb per node... that level of ram commitment can be too much while bootstrapping. if your startup has enough cash to pay for 3-5 recommended spec (see wiki) nodes to be up 24/7 then cassandra is a good fit... a friend of mine is bootstrapping a startup and had to drop back to mysql while he finds his pain points and customers... he knows he will end up jumping back to cassandra when he gets enough customers (or a VC) but for now the running costs are too much to pay from his own pocket... note that the jdbc driver and cql will make jumping back easy for him (as he still tests with c*... just runs at present against mysql nuts eh!) - Stephen --- Sent from my Android phone, so random spelling mistakes, random nonsense words and other nonsense are a direct result of using swype to type on the screen On 20 Nov 2011 19:07, Dotan N. dip...@gmail.com wrote: Hi all, my question may be more philosophical than related technically to Cassandra, but please bear with me. Given that a young startup may not know its product full at the early stages, but that it definitely points to ~200M users, would Cassandra will be the right way to go? That is, the requirement is for a large data store, that can move with product changes and requirements swiftly. Given that in Cassandra one thinks hard about the queries, and then builds a model to suit it best, I was thinking of this situation as problematic. So here are some questions: - would it be wiser to start with a more agile data store (such as mongodb) and then progress onto Cassandra, when the product itself solidifies? - given that we start with Cassandra from the get go, what is a common (and quick in terms of development) way or practice to change data, change schemas, as the
Re: What sort of load do the tombstones create on the cluster?
Mostly, they are I/O and CPU intensive during major compaction. If ganglia doesn't have anything suspicious there, then what is performance loss ? Read or write? On Nov 17, 2011 1:01 PM, Maxim Potekhin potek...@bnl.gov wrote: In view of my unpleasant discovery last week that deletions in Cassandra lead to a very real and serious performance loss, I'm working on a strategy of moving forward. If the tombstones do cause such problem, where should I be looking for performance bottlenecks? Is it disk, CPU or something else? Thing is, I don't see anything outstanding in my Ganglia plots. TIA, Maxim
Re: data agility
Thanks Aaron, I kept this use-case free as to focus on the higher level description, it might have been a not a good idea. But generally I think I got a better intuition from the various answers, thanks! -- Dotan, @jondot http://twitter.com/jondot On Sun, Nov 20, 2011 at 11:52 PM, Aaron Turner synfina...@gmail.com wrote: Sounds like you need to figure out what your product is going to do and what technology will best fit those requirements. I know you're worried about being agile and all that, but scaling requires you to use the right tool for the job. Worry about new requirements when they rear their ugly head rather then a dozen of what if scenarios. You can scale MySQL/etc and Cassandra, MongoDB, etc to 10-200M users depending on what you're asking your datastore to do. You haven't defined that really at all other then some comments about wanting to do some map/reduce jobs. Really what you should be doing is figuring out what kind of data you need to store and your needs like access patterns, availability, ACID compliance, etc and then figure out what technology is the best fit. There are tons of Cassandra vs X comparisons for every NoSQL DB in existence. Other then that, the map/reduce on Cassandra is more job based rather then useful for interactive queries so if that is important then Cassandra prolly isn't a good fit. You did mention time series data too, and that's a sweet spot for Cassandra and not something I personally would put in a document based datastore like MonogoDB. Good luck. -Aaron On Sun, Nov 20, 2011 at 1:24 PM, Dotan N. dip...@gmail.com wrote: Jahangir, thanks! however I've noted that we may very well need to scale to 200M users or entities within a short amount of time - say a year or two, 10M within few months. -- Dotan, @jondot On Sun, Nov 20, 2011 at 11:14 PM, Jahangir Mohammed md.jahangi...@gmail.com wrote: IMHO, you should start with something very simple RDBMS and meanwhile getting handle over Cassandra or other noSql technology. Start out with simple, but always be aware and conscious of the next thing you will have in stack. It's timetaking to work with new technology if you are in the phase of prototyping something fast and geared towards a Vc demo. In most of the cases, you won't need noSql for a while unless there is a very strong case. Thanks, Jahangir On Nov 20, 2011 4:04 PM, Dotan N. dip...@gmail.com wrote: Thanks David. Stephen: thanks for the tip, we can run a recommended configuration, so that wouldn't be an issue. I guess I can focus that my questions are on complexity of development. After digesting David's answer, I guess my follow up questions would be - how would you process data in a cassandra cluster, typically? via one-off coded offline jobs? - how easy is map/reduce on existing data (just looked at brisk but it may be unrelated, any case, not too much written about it) - how would you do analytics over a cassandra cluster - given the common examples of time-series, how would you recommend to aggregate (sum, avg, facet) and provide statistics over the collected data? for example if it were kinds of logs and you'd like to group all of certain fields in it, or provide a histogram over it. Thanks! -- Dotan, @jondot On Sun, Nov 20, 2011 at 10:32 PM, Stephen Connolly stephen.alan.conno...@gmail.com wrote: if your startup is bootstrapping then cassandra is sometimes to heavy to start with. i.e. it needs to be fed ram... you're not going to seriously run it in less than 1gb per node... that level of ram commitment can be too much while bootstrapping. if your startup has enough cash to pay for 3-5 recommended spec (see wiki) nodes to be up 24/7 then cassandra is a good fit... a friend of mine is bootstrapping a startup and had to drop back to mysql while he finds his pain points and customers... he knows he will end up jumping back to cassandra when he gets enough customers (or a VC) but for now the running costs are too much to pay from his own pocket... note that the jdbc driver and cql will make jumping back easy for him (as he still tests with c*... just runs at present against mysql nuts eh!) - Stephen --- Sent from my Android phone, so random spelling mistakes, random nonsense words and other nonsense are a direct result of using swype to type on the screen On 20 Nov 2011 19:07, Dotan N. dip...@gmail.com wrote: Hi all, my question may be more philosophical than related technically to Cassandra, but please bear with me. Given that a young startup may not know its product full at the early stages, but that it definitely points to ~200M users, would Cassandra will be the right way to go? That is, the requirement is for a large data store, that can move with product changes and requirements swiftly. Given that in Cassandra one thinks hard about the
Re: data agility
For 99% of current applications requiing a persistent datastore, Oracle, PgSQL and MySQL variants will suffice. For the 1% of the applications, consider C* if (a) you have given up on distributed transactions (ACIDLY; but NOT BASEICLY) (b) wondering about this new fangled horizonatly scalability buzzword and wonder why disks cannot spin faster and faster (c) need/want to design optimized query paths for your data with a and b Rewording a, b and c a.1 Cassandra provides best-in-class low latency asynchronous replication with battle-hardened mechanisms to manage eventual consistenency in an inherently disordered (entroprophized) world... ACID versus BASE transactions b.1 Cassandra's write path is completely optimized. It will write as fast as the disk will allow it; but the most important feature is that if you need to write faster than an individual server will allow, add more servers. The locality of data principle, the ineorable faster computations and anti-entropy services enables you to cloud-scale. c.1 Writing is easy; but then you actually need to find the data. And do it at scale--speed wise. The columnar nature of Cassandra, designs of the internals in Cassandra and support at the API level (composite indexes) make it possible to have fast quering capabilities. Milind On Sun, Nov 20, 2011 at 2:19 PM, Dotan N. dip...@gmail.com wrote: Thanks Aaron, I kept this use-case free as to focus on the higher level description, it might have been a not a good idea. But generally I think I got a better intuition from the various answers, thanks! -- Dotan, @jondot http://twitter.com/jondot On Sun, Nov 20, 2011 at 11:52 PM, Aaron Turner synfina...@gmail.comwrote: Sounds like you need to figure out what your product is going to do and what technology will best fit those requirements. I know you're worried about being agile and all that, but scaling requires you to use the right tool for the job. Worry about new requirements when they rear their ugly head rather then a dozen of what if scenarios. You can scale MySQL/etc and Cassandra, MongoDB, etc to 10-200M users depending on what you're asking your datastore to do. You haven't defined that really at all other then some comments about wanting to do some map/reduce jobs. Really what you should be doing is figuring out what kind of data you need to store and your needs like access patterns, availability, ACID compliance, etc and then figure out what technology is the best fit. There are tons of Cassandra vs X comparisons for every NoSQL DB in existence. Other then that, the map/reduce on Cassandra is more job based rather then useful for interactive queries so if that is important then Cassandra prolly isn't a good fit. You did mention time series data too, and that's a sweet spot for Cassandra and not something I personally would put in a document based datastore like MonogoDB. Good luck. -Aaron On Sun, Nov 20, 2011 at 1:24 PM, Dotan N. dip...@gmail.com wrote: Jahangir, thanks! however I've noted that we may very well need to scale to 200M users or entities within a short amount of time - say a year or two, 10M within few months. -- Dotan, @jondot On Sun, Nov 20, 2011 at 11:14 PM, Jahangir Mohammed md.jahangi...@gmail.com wrote: IMHO, you should start with something very simple RDBMS and meanwhile getting handle over Cassandra or other noSql technology. Start out with simple, but always be aware and conscious of the next thing you will have in stack. It's timetaking to work with new technology if you are in the phase of prototyping something fast and geared towards a Vc demo. In most of the cases, you won't need noSql for a while unless there is a very strong case. Thanks, Jahangir On Nov 20, 2011 4:04 PM, Dotan N. dip...@gmail.com wrote: Thanks David. Stephen: thanks for the tip, we can run a recommended configuration, so that wouldn't be an issue. I guess I can focus that my questions are on complexity of development. After digesting David's answer, I guess my follow up questions would be - how would you process data in a cassandra cluster, typically? via one-off coded offline jobs? - how easy is map/reduce on existing data (just looked at brisk but it may be unrelated, any case, not too much written about it) - how would you do analytics over a cassandra cluster - given the common examples of time-series, how would you recommend to aggregate (sum, avg, facet) and provide statistics over the collected data? for example if it were kinds of logs and you'd like to group all of certain fields in it, or provide a histogram over it. Thanks! -- Dotan, @jondot On Sun, Nov 20, 2011 at 10:32 PM, Stephen Connolly stephen.alan.conno...@gmail.com wrote: if your startup is bootstrapping then cassandra is sometimes to heavy to start with. i.e. it needs to be
Re: Network traffic patterns
I'm using BOP. Le 20 nov. 2011 13:09, Boris Yen yulin...@gmail.com a écrit : I am just curious about which partitioner you are using? On Thu, Nov 17, 2011 at 4:30 PM, Philippe watche...@gmail.com wrote: Hi Todd Yes all equal hardware. Nearly no CPU usage and no memory issues. Repairs are running in tens of minutes so i don't understand why replication would be backed up. Any other ideas? Le 17 nov. 2011 02:33, Todd Burruss bburr...@expedia.com a écrit : Are all of your machines equal hardware? Since those machines are sending data somewhere, maybe they are behind in replicating and are continuously catching up? Use a tool like tcpdump to find out where the data is going From: Philippe watche...@gmail.com Reply-To: user@cassandra.apache.org user@cassandra.apache.org Date: Tue, 15 Nov 2011 13:22:38 -0800 To: user user@cassandra.apache.org Subject: Re: Network traffic patterns Sorry about the previous message, I've enabled keyboard shortcuts on gmail...*sigh*... Hello, I'm trying to understand the network usage I am seeing in my cluster, can anyone shed some light? It's an RF=3, 12-node, cassandra 0.8.6 cluster. repair is performed on each node once a week, with a rolling schedule. The nodes are p13,p14,p15...p24 and are consecutive in that order on the ring. Each node is only a cassandra database. I am hitting the cluster from another server (p4). p4 is doing this with 20 threads in parallel 1. read a lot of data (some columns for hundreds to tens of thousands of keys, split into 512-key multigets) 2. process the data 3. write back a byte array to cassandra (average size is 400 bytes) 4. go back to 1 According to my munin graphs, network usage is about as follows. I am not surprised at the bias towards p13-p15 as p4 is getting storing data mainly for keys located on one of those nodes. - p4 : 1.5Mb/s in and out - p13-p15 : 15Mb/s in and 80Mb/s out - p16-p24 : 45Mb/s in and 5Mb/s out What I don't understand is why p4 is only seeing 1.5Mb/s while I see 80Mb/s on p13 p15. The way I understand this: - p4 makes a multiget to the cluster, electing to use any node in the cluster (IN traffic for describe the query) - coordinator node replays the query on all 3 replicas (so 3 servers each get the IN traffic, mostly p13-p15) - each server replies to coordinator - coordinator chooses matching values and sends back data to p4 So if p13-p15 are outputting 80Mb/s why am I not seeing 80Mb/s coming into p4 which is on the receiving end ? Thanks 2011/11/15 Philippe watche...@gmail.com Hello, I'm trying to understand the network usage I am seeing in my cluster, can anyone shed some light? It's an RF=3, 12-node, cassandra 0.8.6 cluster. The nodes are p13,p14,p15...p24 and are consecutive in that order on the ring. Each node is only a cassandra database. I am hitting the cluster from another server (p4). The pattern on p4 is the pattern is to 1. read a lot of data (some columns for hundreds to tens of thousands of keys, split into 512-key multigets) 2. process the data 3. write back a byte array to cassandra (average size is 400 bytes) p4 reads as
[no subject]
--- Sent with mail@metro - the new generation of mobile messaging
Re: Data Model Design for Login Servie
I will follow exactly this solution - thanks :) On Fri, Nov 18, 2011 at 9:53 PM, David Jeske dav...@gmail.com wrote: On Thu, Nov 17, 2011 at 1:08 PM, Maciej Miklas mac.mik...@googlemail.comwrote: A) Skinny rows - row key contains login name - this is the main search criteria - login data is replicated - each possible login is stored as single row which contains all user data - 10 logins for single customer create 10 rows, where each row has different key and the same content To me this seems reasonable. Remember, because of your replication of the datavalues you will want a quick way to find all the logins for a given ID, so you will also want to store a separate dataset like: 1122 { alfred.tes...@xyz.de =1(where the login is a column key) alf...@aad.de =1 } When you do an update, you'll need to fetch the entire row for the user-id, and then update all copies of the data. THis can create problems, if the data is out of sync (which it will be at certain times because of eventual consistency, and might be if something bad happens). ...the other option, of course, is to make a login-name indirection. You would have only one copy of the user-data stored by ID, and then you would store a separate mapping from login-name-to-ID. Of course this would require two roundtrips to get the user information from login-id, which is something I know you said you didn't want to do.