Re: Why isn't there a separate JVM per table?
I agree with Jon. The actor based model would be the logical approach to get to be more “efficient.” Until then fault tolerance has to be built into the driver to contact another node if in the middle and then reconcile the commitlog later. I’ve seen many people combine an external queue to deal with the GC issues by adding yet another layer of asynchronicity. (If it’s not a word it is now) Even in systems like SQL servers there are internal queues that get locked up due to memory, storage, or cpu pressures. It’s not a GC pause but it may as well be. Even with all the tweaking the only way to get beyond is distributed asynchronous systems that are self healing. -- Rahul Singh rahul.si...@anant.us Anant Corporation On Feb 23, 2018, 4:34 AM -0500, Brian Hess , wrote: > Something folks haven't raised, but would be another impediment here is that > in Cassandra if you submit a batch (logged or unlogged) for two tables in the > same keyspace with the same partition then Cassandra collapses them into the > same Mutation and the two INSERTs are processed atomically. There are a few > (maybe more than a few) things that take advantage of this fact. > > If you move each table to its own JVM then you cannot really achieve this > atomicity. So, at most you would want to consider a JVM per keyspace (or > consider touching a lot of code or changing a pretty fundamental/deep > contract in Cassandra). > > >Brian > > Sent from my iPhone > > > On Feb 22, 2018, at 7:10 PM, J. D. Jordan wrote: > > > > I would be careful with anything per table for memory sizing. We used to > > have many caches and things that could be tuned per table, but they have > > all since changed to being per node, as it was a real PITA to get them > > right. Having to do per table heap/gc/memtable/cache tuning just sounds > > like a usability nightmare. > > > > -Jeremiah > > > > On Feb 22, 2018, at 6:59 PM, kurt greaves wrote: > > > > > > > > > > ... compaction on its own jvm was also something I was thinking about, > > > > but > > > > then I realized even more JVM sharding could be done at the table level. > > > > > > > > > Compaction in it's own JVM makes sense. At the table level I'm not so sure > > > about. Gotta be some serious overheads from running that many JVM's. > > > Keyspace might be reasonable purely to isolate bad tables, but for the > > > most > > > part I'd think isolating every table isn't that beneficial and pretty > > > complicated. In most cases people just fix their modelling so that they > > > don't generate large amounts of GC, and hopefully test enough so they know > > > how it will behave in production. > > > > > > If we did at the table level we would inevitable have to make each > > > individual table incredibly tune-able which would be a bit tedious IMO. > > > There's no way for us to smartly decide how much heap/memtable space/etc > > > each table should use (not without some decent AI, anyway). > > > > > > > - > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > > For additional commands, e-mail: dev-h...@cassandra.apache.org > > > > - > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > For additional commands, e-mail: dev-h...@cassandra.apache.org >
Re: Why isn't there a separate JVM per table?
Something folks haven't raised, but would be another impediment here is that in Cassandra if you submit a batch (logged or unlogged) for two tables in the same keyspace with the same partition then Cassandra collapses them into the same Mutation and the two INSERTs are processed atomically. There are a few (maybe more than a few) things that take advantage of this fact. If you move each table to its own JVM then you cannot really achieve this atomicity. So, at most you would want to consider a JVM per keyspace (or consider touching a lot of code or changing a pretty fundamental/deep contract in Cassandra). >Brian Sent from my iPhone > On Feb 22, 2018, at 7:10 PM, J. D. Jordan wrote: > > I would be careful with anything per table for memory sizing. We used to have > many caches and things that could be tuned per table, but they have all since > changed to being per node, as it was a real PITA to get them right. Having > to do per table heap/gc/memtable/cache tuning just sounds like a usability > nightmare. > > -Jeremiah > > On Feb 22, 2018, at 6:59 PM, kurt greaves wrote: > >>> >>> ... compaction on its own jvm was also something I was thinking about, but >>> then I realized even more JVM sharding could be done at the table level. >> >> >> Compaction in it's own JVM makes sense. At the table level I'm not so sure >> about. Gotta be some serious overheads from running that many JVM's. >> Keyspace might be reasonable purely to isolate bad tables, but for the most >> part I'd think isolating every table isn't that beneficial and pretty >> complicated. In most cases people just fix their modelling so that they >> don't generate large amounts of GC, and hopefully test enough so they know >> how it will behave in production. >> >> If we did at the table level we would inevitable have to make each >> individual table incredibly tune-able which would be a bit tedious IMO. >> There's no way for us to smartly decide how much heap/memtable space/etc >> each table should use (not without some decent AI, anyway). >> > > - > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > For additional commands, e-mail: dev-h...@cassandra.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org
Re: Why isn't there a separate JVM per table?
There's an incredible amount of work that would need to be done in order to make any of this happen. Basically a full rewrite of the entire codebase. Years of effort. The codebase would have to move to a shared-nothing actor & message based communication mechanism before any of this is possible. Fun in theory, but considering removing singletons has been a multi-year, many failure effort, I suspect we might need 10 years to refactor Cassandra to use multiple JVMs. By then maybe we'll have a pauseless / low pause collector and it won't matter. On Thu, Feb 22, 2018 at 3:59 PM kurt greaves wrote: > > > > ... compaction on its own jvm was also something I was thinking about, > but > > then I realized even more JVM sharding could be done at the table level. > > > Compaction in it's own JVM makes sense. At the table level I'm not so sure > about. Gotta be some serious overheads from running that many JVM's. > Keyspace might be reasonable purely to isolate bad tables, but for the most > part I'd think isolating every table isn't that beneficial and pretty > complicated. In most cases people just fix their modelling so that they > don't generate large amounts of GC, and hopefully test enough so they know > how it will behave in production. > > If we did at the table level we would inevitable have to make each > individual table incredibly tune-able which would be a bit tedious IMO. > There's no way for us to smartly decide how much heap/memtable space/etc > each table should use (not without some decent AI, anyway). > >
Re: Why isn't there a separate JVM per table?
I would be careful with anything per table for memory sizing. We used to have many caches and things that could be tuned per table, but they have all since changed to being per node, as it was a real PITA to get them right. Having to do per table heap/gc/memtable/cache tuning just sounds like a usability nightmare. -Jeremiah On Feb 22, 2018, at 6:59 PM, kurt greaves wrote: >> >> ... compaction on its own jvm was also something I was thinking about, but >> then I realized even more JVM sharding could be done at the table level. > > > Compaction in it's own JVM makes sense. At the table level I'm not so sure > about. Gotta be some serious overheads from running that many JVM's. > Keyspace might be reasonable purely to isolate bad tables, but for the most > part I'd think isolating every table isn't that beneficial and pretty > complicated. In most cases people just fix their modelling so that they > don't generate large amounts of GC, and hopefully test enough so they know > how it will behave in production. > > If we did at the table level we would inevitable have to make each > individual table incredibly tune-able which would be a bit tedious IMO. > There's no way for us to smartly decide how much heap/memtable space/etc > each table should use (not without some decent AI, anyway). > - To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org
Re: Why isn't there a separate JVM per table?
> > ... compaction on its own jvm was also something I was thinking about, but > then I realized even more JVM sharding could be done at the table level. Compaction in it's own JVM makes sense. At the table level I'm not so sure about. Gotta be some serious overheads from running that many JVM's. Keyspace might be reasonable purely to isolate bad tables, but for the most part I'd think isolating every table isn't that beneficial and pretty complicated. In most cases people just fix their modelling so that they don't generate large amounts of GC, and hopefully test enough so they know how it will behave in production. If we did at the table level we would inevitable have to make each individual table incredibly tune-able which would be a bit tedious IMO. There's no way for us to smartly decide how much heap/memtable space/etc each table should use (not without some decent AI, anyway).
Re: Why isn't there a separate JVM per table?
Agree that any first efforts per compaction should be on profiling. Probably some low-hanging fruit there. On Fri, Feb 23, 2018 at 11:55 AM, Jeff Jirsa wrote: > Bloom filters are offheap. > > To be honest, there may come a time when it makes sense to move compaction > into its own JVM, but it would be FAR less effort to just profile what > exists now and fix the problems. > > > > On Thu, Feb 22, 2018 at 2:52 PM, Carl Mueller > wrote: > >> BLoom filters... nevermind >> >> >> On Thu, Feb 22, 2018 at 4:48 PM, Carl Mueller < >> carl.muel...@smartthings.com> >> wrote: >> >> > Is the current reason for a large starting heap due to the memtable? >> > >> > On Thu, Feb 22, 2018 at 4:44 PM, Carl Mueller < >> > carl.muel...@smartthings.com> wrote: >> > >> >> ... compaction on its own jvm was also something I was thinking about, >> >> but then I realized even more JVM sharding could be done at the table >> level. >> >> >> >> On Thu, Feb 22, 2018 at 4:09 PM, Jon Haddad wrote: >> >> >> >>> Yeah, I’m in the compaction on it’s own JVM camp, in an ideal world >> >>> where we’re isolating crazy GC churning parts of the DB. It would mean >> >>> reworking how tasks are created and removal of all shared state in >> favor of >> >>> messaging + a smarter manager, which imo would be a good idea >> regardless. >> >>> >> >>> It might be a better use of time (especially for 4.0) to do some GC >> >>> performance profiling and cut down on the allocations, since that >> doesn’t >> >>> involve a massive effort. >> >>> >> >>> I’ve been meaning to do a little benchmarking and profiling for a while >> >>> now, and it seems like a few others have the same inclination as well, >> >>> maybe now is a good time to coordinate that. A nice perf bump for 4.0 >> >>> would be very rewarding. >> >>> >> >>> Jon >> >>> >> >>> > On Feb 22, 2018, at 2:00 PM, Nate McCall wrote: >> >>> > >> >>> > I've heard a couple of folks pontificate on compaction in its own >> >>> > process as well, given it has such a high impact on GC. Not sure >> about >> >>> > the value of individual tables. Interesting idea though. >> >>> > >> >>> > On Fri, Feb 23, 2018 at 10:45 AM, Gary Dusbabek > > >> >>> wrote: >> >>> >> I've given it some thought in the past. In the end, I usually talk >> >>> myself >> >>> >> out of it because I think it increases the surface area for failure. >> >>> That >> >>> >> is, managing N processes is more difficult that managing one >> process. >> >>> But >> >>> >> if the additional failure modes are addressed, there are some >> >>> interesting >> >>> >> possibilities. >> >>> >> >> >>> >> For example, having gossip in its own process would decrease the >> odds >> >>> that >> >>> >> a node is marked dead because STW GC is happening in the storage >> JVM. >> >>> On >> >>> >> the flipside, you'd need checks to make sure that the gossip process >> >>> can >> >>> >> recognize when the storage process has died vs just running a long >> GC. >> >>> >> >> >>> >> I don't know that I'd go so far as to have separate processes for >> >>> >> keyspaces, etc. >> >>> >> >> >>> >> There is probably some interesting work that could be done to >> support >> >>> the >> >>> >> orgs who run multiple cassandra instances on the same node (multiple >> >>> >> gossipers in that case is at least a little wasteful). >> >>> >> >> >>> >> I've also played around with using domain sockets for IPC inside of >> >>> >> cassandra. I never ran a proper benchmark, but there were some >> >>> throughput >> >>> >> advantages to this approach. >> >>> >> >> >>> >> Cheers, >> >>> >> >> >>> >> Gary. >> >>> >> >> >>> >> >> >>> >> On Thu, Feb 22, 2018 at 8:39 PM, Carl Mueller < >> >>> carl.muel...@smartthings.com> >> >>> >> wrote: >> >>> >> >> >>> >>> GC pauses may have been improved in newer releases, since we are in >> >>> 2.1.x, >> >>> >>> but I was wondering why cassandra uses one jvm for all tables and >> >>> >>> keyspaces, intermingling the heap for on-JVM objects. >> >>> >>> >> >>> >>> ... so why doesn't cassandra spin off a jvm per table so each jvm >> >>> can be >> >>> >>> tuned per table and gc tuned and gc impacts not impact other >> tables? >> >>> It >> >>> >>> would probably increase the number of endpoints if we avoid having >> an >> >>> >>> overarching query router. >> >>> >>> >> >>> > >> >>> > >> - >> >>> > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >> >>> > For additional commands, e-mail: dev-h...@cassandra.apache.org >> >>> > >> >>> >> >>> >> >>> - >> >>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >> >>> For additional commands, e-mail: dev-h...@cassandra.apache.org >> >>> >> >>> >> >> >> > >> - To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org
Re: Why isn't there a separate JVM per table?
Alternative: JVM per vnode. On Thu, Feb 22, 2018 at 4:52 PM, Carl Mueller wrote: > BLoom filters... nevermind > > > On Thu, Feb 22, 2018 at 4:48 PM, Carl Mueller < > carl.muel...@smartthings.com> wrote: > >> Is the current reason for a large starting heap due to the memtable? >> >> On Thu, Feb 22, 2018 at 4:44 PM, Carl Mueller < >> carl.muel...@smartthings.com> wrote: >> >>> ... compaction on its own jvm was also something I was thinking about, >>> but then I realized even more JVM sharding could be done at the table level. >>> >>> On Thu, Feb 22, 2018 at 4:09 PM, Jon Haddad wrote: >>> Yeah, I’m in the compaction on it’s own JVM camp, in an ideal world where we’re isolating crazy GC churning parts of the DB. It would mean reworking how tasks are created and removal of all shared state in favor of messaging + a smarter manager, which imo would be a good idea regardless. It might be a better use of time (especially for 4.0) to do some GC performance profiling and cut down on the allocations, since that doesn’t involve a massive effort. I’ve been meaning to do a little benchmarking and profiling for a while now, and it seems like a few others have the same inclination as well, maybe now is a good time to coordinate that. A nice perf bump for 4.0 would be very rewarding. Jon > On Feb 22, 2018, at 2:00 PM, Nate McCall wrote: > > I've heard a couple of folks pontificate on compaction in its own > process as well, given it has such a high impact on GC. Not sure about > the value of individual tables. Interesting idea though. > > On Fri, Feb 23, 2018 at 10:45 AM, Gary Dusbabek wrote: >> I've given it some thought in the past. In the end, I usually talk myself >> out of it because I think it increases the surface area for failure. That >> is, managing N processes is more difficult that managing one process. But >> if the additional failure modes are addressed, there are some interesting >> possibilities. >> >> For example, having gossip in its own process would decrease the odds that >> a node is marked dead because STW GC is happening in the storage JVM. On >> the flipside, you'd need checks to make sure that the gossip process can >> recognize when the storage process has died vs just running a long GC. >> >> I don't know that I'd go so far as to have separate processes for >> keyspaces, etc. >> >> There is probably some interesting work that could be done to support the >> orgs who run multiple cassandra instances on the same node (multiple >> gossipers in that case is at least a little wasteful). >> >> I've also played around with using domain sockets for IPC inside of >> cassandra. I never ran a proper benchmark, but there were some throughput >> advantages to this approach. >> >> Cheers, >> >> Gary. >> >> >> On Thu, Feb 22, 2018 at 8:39 PM, Carl Mueller < carl.muel...@smartthings.com> >> wrote: >> >>> GC pauses may have been improved in newer releases, since we are in 2.1.x, >>> but I was wondering why cassandra uses one jvm for all tables and >>> keyspaces, intermingling the heap for on-JVM objects. >>> >>> ... so why doesn't cassandra spin off a jvm per table so each jvm can be >>> tuned per table and gc tuned and gc impacts not impact other tables? It >>> would probably increase the number of endpoints if we avoid having an >>> overarching query router. >>> > > - > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > For additional commands, e-mail: dev-h...@cassandra.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org >>> >> >
Re: Why isn't there a separate JVM per table?
Bloom filters are offheap. To be honest, there may come a time when it makes sense to move compaction into its own JVM, but it would be FAR less effort to just profile what exists now and fix the problems. On Thu, Feb 22, 2018 at 2:52 PM, Carl Mueller wrote: > BLoom filters... nevermind > > > On Thu, Feb 22, 2018 at 4:48 PM, Carl Mueller < > carl.muel...@smartthings.com> > wrote: > > > Is the current reason for a large starting heap due to the memtable? > > > > On Thu, Feb 22, 2018 at 4:44 PM, Carl Mueller < > > carl.muel...@smartthings.com> wrote: > > > >> ... compaction on its own jvm was also something I was thinking about, > >> but then I realized even more JVM sharding could be done at the table > level. > >> > >> On Thu, Feb 22, 2018 at 4:09 PM, Jon Haddad wrote: > >> > >>> Yeah, I’m in the compaction on it’s own JVM camp, in an ideal world > >>> where we’re isolating crazy GC churning parts of the DB. It would mean > >>> reworking how tasks are created and removal of all shared state in > favor of > >>> messaging + a smarter manager, which imo would be a good idea > regardless. > >>> > >>> It might be a better use of time (especially for 4.0) to do some GC > >>> performance profiling and cut down on the allocations, since that > doesn’t > >>> involve a massive effort. > >>> > >>> I’ve been meaning to do a little benchmarking and profiling for a while > >>> now, and it seems like a few others have the same inclination as well, > >>> maybe now is a good time to coordinate that. A nice perf bump for 4.0 > >>> would be very rewarding. > >>> > >>> Jon > >>> > >>> > On Feb 22, 2018, at 2:00 PM, Nate McCall wrote: > >>> > > >>> > I've heard a couple of folks pontificate on compaction in its own > >>> > process as well, given it has such a high impact on GC. Not sure > about > >>> > the value of individual tables. Interesting idea though. > >>> > > >>> > On Fri, Feb 23, 2018 at 10:45 AM, Gary Dusbabek > > >>> wrote: > >>> >> I've given it some thought in the past. In the end, I usually talk > >>> myself > >>> >> out of it because I think it increases the surface area for failure. > >>> That > >>> >> is, managing N processes is more difficult that managing one > process. > >>> But > >>> >> if the additional failure modes are addressed, there are some > >>> interesting > >>> >> possibilities. > >>> >> > >>> >> For example, having gossip in its own process would decrease the > odds > >>> that > >>> >> a node is marked dead because STW GC is happening in the storage > JVM. > >>> On > >>> >> the flipside, you'd need checks to make sure that the gossip process > >>> can > >>> >> recognize when the storage process has died vs just running a long > GC. > >>> >> > >>> >> I don't know that I'd go so far as to have separate processes for > >>> >> keyspaces, etc. > >>> >> > >>> >> There is probably some interesting work that could be done to > support > >>> the > >>> >> orgs who run multiple cassandra instances on the same node (multiple > >>> >> gossipers in that case is at least a little wasteful). > >>> >> > >>> >> I've also played around with using domain sockets for IPC inside of > >>> >> cassandra. I never ran a proper benchmark, but there were some > >>> throughput > >>> >> advantages to this approach. > >>> >> > >>> >> Cheers, > >>> >> > >>> >> Gary. > >>> >> > >>> >> > >>> >> On Thu, Feb 22, 2018 at 8:39 PM, Carl Mueller < > >>> carl.muel...@smartthings.com> > >>> >> wrote: > >>> >> > >>> >>> GC pauses may have been improved in newer releases, since we are in > >>> 2.1.x, > >>> >>> but I was wondering why cassandra uses one jvm for all tables and > >>> >>> keyspaces, intermingling the heap for on-JVM objects. > >>> >>> > >>> >>> ... so why doesn't cassandra spin off a jvm per table so each jvm > >>> can be > >>> >>> tuned per table and gc tuned and gc impacts not impact other > tables? > >>> It > >>> >>> would probably increase the number of endpoints if we avoid having > an > >>> >>> overarching query router. > >>> >>> > >>> > > >>> > > - > >>> > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > >>> > For additional commands, e-mail: dev-h...@cassandra.apache.org > >>> > > >>> > >>> > >>> - > >>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > >>> For additional commands, e-mail: dev-h...@cassandra.apache.org > >>> > >>> > >> > > >
Re: Why isn't there a separate JVM per table?
BLoom filters... nevermind On Thu, Feb 22, 2018 at 4:48 PM, Carl Mueller wrote: > Is the current reason for a large starting heap due to the memtable? > > On Thu, Feb 22, 2018 at 4:44 PM, Carl Mueller < > carl.muel...@smartthings.com> wrote: > >> ... compaction on its own jvm was also something I was thinking about, >> but then I realized even more JVM sharding could be done at the table level. >> >> On Thu, Feb 22, 2018 at 4:09 PM, Jon Haddad wrote: >> >>> Yeah, I’m in the compaction on it’s own JVM camp, in an ideal world >>> where we’re isolating crazy GC churning parts of the DB. It would mean >>> reworking how tasks are created and removal of all shared state in favor of >>> messaging + a smarter manager, which imo would be a good idea regardless. >>> >>> It might be a better use of time (especially for 4.0) to do some GC >>> performance profiling and cut down on the allocations, since that doesn’t >>> involve a massive effort. >>> >>> I’ve been meaning to do a little benchmarking and profiling for a while >>> now, and it seems like a few others have the same inclination as well, >>> maybe now is a good time to coordinate that. A nice perf bump for 4.0 >>> would be very rewarding. >>> >>> Jon >>> >>> > On Feb 22, 2018, at 2:00 PM, Nate McCall wrote: >>> > >>> > I've heard a couple of folks pontificate on compaction in its own >>> > process as well, given it has such a high impact on GC. Not sure about >>> > the value of individual tables. Interesting idea though. >>> > >>> > On Fri, Feb 23, 2018 at 10:45 AM, Gary Dusbabek >>> wrote: >>> >> I've given it some thought in the past. In the end, I usually talk >>> myself >>> >> out of it because I think it increases the surface area for failure. >>> That >>> >> is, managing N processes is more difficult that managing one process. >>> But >>> >> if the additional failure modes are addressed, there are some >>> interesting >>> >> possibilities. >>> >> >>> >> For example, having gossip in its own process would decrease the odds >>> that >>> >> a node is marked dead because STW GC is happening in the storage JVM. >>> On >>> >> the flipside, you'd need checks to make sure that the gossip process >>> can >>> >> recognize when the storage process has died vs just running a long GC. >>> >> >>> >> I don't know that I'd go so far as to have separate processes for >>> >> keyspaces, etc. >>> >> >>> >> There is probably some interesting work that could be done to support >>> the >>> >> orgs who run multiple cassandra instances on the same node (multiple >>> >> gossipers in that case is at least a little wasteful). >>> >> >>> >> I've also played around with using domain sockets for IPC inside of >>> >> cassandra. I never ran a proper benchmark, but there were some >>> throughput >>> >> advantages to this approach. >>> >> >>> >> Cheers, >>> >> >>> >> Gary. >>> >> >>> >> >>> >> On Thu, Feb 22, 2018 at 8:39 PM, Carl Mueller < >>> carl.muel...@smartthings.com> >>> >> wrote: >>> >> >>> >>> GC pauses may have been improved in newer releases, since we are in >>> 2.1.x, >>> >>> but I was wondering why cassandra uses one jvm for all tables and >>> >>> keyspaces, intermingling the heap for on-JVM objects. >>> >>> >>> >>> ... so why doesn't cassandra spin off a jvm per table so each jvm >>> can be >>> >>> tuned per table and gc tuned and gc impacts not impact other tables? >>> It >>> >>> would probably increase the number of endpoints if we avoid having an >>> >>> overarching query router. >>> >>> >>> > >>> > - >>> > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >>> > For additional commands, e-mail: dev-h...@cassandra.apache.org >>> > >>> >>> >>> - >>> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >>> For additional commands, e-mail: dev-h...@cassandra.apache.org >>> >>> >> >
Re: Why isn't there a separate JVM per table?
Is the current reason for a large starting heap due to the memtable? On Thu, Feb 22, 2018 at 4:44 PM, Carl Mueller wrote: > ... compaction on its own jvm was also something I was thinking about, > but then I realized even more JVM sharding could be done at the table level. > > On Thu, Feb 22, 2018 at 4:09 PM, Jon Haddad wrote: > >> Yeah, I’m in the compaction on it’s own JVM camp, in an ideal world where >> we’re isolating crazy GC churning parts of the DB. It would mean reworking >> how tasks are created and removal of all shared state in favor of messaging >> + a smarter manager, which imo would be a good idea regardless. >> >> It might be a better use of time (especially for 4.0) to do some GC >> performance profiling and cut down on the allocations, since that doesn’t >> involve a massive effort. >> >> I’ve been meaning to do a little benchmarking and profiling for a while >> now, and it seems like a few others have the same inclination as well, >> maybe now is a good time to coordinate that. A nice perf bump for 4.0 >> would be very rewarding. >> >> Jon >> >> > On Feb 22, 2018, at 2:00 PM, Nate McCall wrote: >> > >> > I've heard a couple of folks pontificate on compaction in its own >> > process as well, given it has such a high impact on GC. Not sure about >> > the value of individual tables. Interesting idea though. >> > >> > On Fri, Feb 23, 2018 at 10:45 AM, Gary Dusbabek >> wrote: >> >> I've given it some thought in the past. In the end, I usually talk >> myself >> >> out of it because I think it increases the surface area for failure. >> That >> >> is, managing N processes is more difficult that managing one process. >> But >> >> if the additional failure modes are addressed, there are some >> interesting >> >> possibilities. >> >> >> >> For example, having gossip in its own process would decrease the odds >> that >> >> a node is marked dead because STW GC is happening in the storage JVM. >> On >> >> the flipside, you'd need checks to make sure that the gossip process >> can >> >> recognize when the storage process has died vs just running a long GC. >> >> >> >> I don't know that I'd go so far as to have separate processes for >> >> keyspaces, etc. >> >> >> >> There is probably some interesting work that could be done to support >> the >> >> orgs who run multiple cassandra instances on the same node (multiple >> >> gossipers in that case is at least a little wasteful). >> >> >> >> I've also played around with using domain sockets for IPC inside of >> >> cassandra. I never ran a proper benchmark, but there were some >> throughput >> >> advantages to this approach. >> >> >> >> Cheers, >> >> >> >> Gary. >> >> >> >> >> >> On Thu, Feb 22, 2018 at 8:39 PM, Carl Mueller < >> carl.muel...@smartthings.com> >> >> wrote: >> >> >> >>> GC pauses may have been improved in newer releases, since we are in >> 2.1.x, >> >>> but I was wondering why cassandra uses one jvm for all tables and >> >>> keyspaces, intermingling the heap for on-JVM objects. >> >>> >> >>> ... so why doesn't cassandra spin off a jvm per table so each jvm can >> be >> >>> tuned per table and gc tuned and gc impacts not impact other tables? >> It >> >>> would probably increase the number of endpoints if we avoid having an >> >>> overarching query router. >> >>> >> > >> > - >> > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >> > For additional commands, e-mail: dev-h...@cassandra.apache.org >> > >> >> >> - >> To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org >> For additional commands, e-mail: dev-h...@cassandra.apache.org >> >> >
Re: Why isn't there a separate JVM per table?
... compaction on its own jvm was also something I was thinking about, but then I realized even more JVM sharding could be done at the table level. On Thu, Feb 22, 2018 at 4:09 PM, Jon Haddad wrote: > Yeah, I’m in the compaction on it’s own JVM camp, in an ideal world where > we’re isolating crazy GC churning parts of the DB. It would mean reworking > how tasks are created and removal of all shared state in favor of messaging > + a smarter manager, which imo would be a good idea regardless. > > It might be a better use of time (especially for 4.0) to do some GC > performance profiling and cut down on the allocations, since that doesn’t > involve a massive effort. > > I’ve been meaning to do a little benchmarking and profiling for a while > now, and it seems like a few others have the same inclination as well, > maybe now is a good time to coordinate that. A nice perf bump for 4.0 > would be very rewarding. > > Jon > > > On Feb 22, 2018, at 2:00 PM, Nate McCall wrote: > > > > I've heard a couple of folks pontificate on compaction in its own > > process as well, given it has such a high impact on GC. Not sure about > > the value of individual tables. Interesting idea though. > > > > On Fri, Feb 23, 2018 at 10:45 AM, Gary Dusbabek > wrote: > >> I've given it some thought in the past. In the end, I usually talk > myself > >> out of it because I think it increases the surface area for failure. > That > >> is, managing N processes is more difficult that managing one process. > But > >> if the additional failure modes are addressed, there are some > interesting > >> possibilities. > >> > >> For example, having gossip in its own process would decrease the odds > that > >> a node is marked dead because STW GC is happening in the storage JVM. On > >> the flipside, you'd need checks to make sure that the gossip process can > >> recognize when the storage process has died vs just running a long GC. > >> > >> I don't know that I'd go so far as to have separate processes for > >> keyspaces, etc. > >> > >> There is probably some interesting work that could be done to support > the > >> orgs who run multiple cassandra instances on the same node (multiple > >> gossipers in that case is at least a little wasteful). > >> > >> I've also played around with using domain sockets for IPC inside of > >> cassandra. I never ran a proper benchmark, but there were some > throughput > >> advantages to this approach. > >> > >> Cheers, > >> > >> Gary. > >> > >> > >> On Thu, Feb 22, 2018 at 8:39 PM, Carl Mueller < > carl.muel...@smartthings.com> > >> wrote: > >> > >>> GC pauses may have been improved in newer releases, since we are in > 2.1.x, > >>> but I was wondering why cassandra uses one jvm for all tables and > >>> keyspaces, intermingling the heap for on-JVM objects. > >>> > >>> ... so why doesn't cassandra spin off a jvm per table so each jvm can > be > >>> tuned per table and gc tuned and gc impacts not impact other tables? It > >>> would probably increase the number of endpoints if we avoid having an > >>> overarching query router. > >>> > > > > - > > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > > For additional commands, e-mail: dev-h...@cassandra.apache.org > > > > > - > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > For additional commands, e-mail: dev-h...@cassandra.apache.org > >
Re: Why isn't there a separate JVM per table?
Yeah, I’m in the compaction on it’s own JVM camp, in an ideal world where we’re isolating crazy GC churning parts of the DB. It would mean reworking how tasks are created and removal of all shared state in favor of messaging + a smarter manager, which imo would be a good idea regardless. It might be a better use of time (especially for 4.0) to do some GC performance profiling and cut down on the allocations, since that doesn’t involve a massive effort. I’ve been meaning to do a little benchmarking and profiling for a while now, and it seems like a few others have the same inclination as well, maybe now is a good time to coordinate that. A nice perf bump for 4.0 would be very rewarding. Jon > On Feb 22, 2018, at 2:00 PM, Nate McCall wrote: > > I've heard a couple of folks pontificate on compaction in its own > process as well, given it has such a high impact on GC. Not sure about > the value of individual tables. Interesting idea though. > > On Fri, Feb 23, 2018 at 10:45 AM, Gary Dusbabek wrote: >> I've given it some thought in the past. In the end, I usually talk myself >> out of it because I think it increases the surface area for failure. That >> is, managing N processes is more difficult that managing one process. But >> if the additional failure modes are addressed, there are some interesting >> possibilities. >> >> For example, having gossip in its own process would decrease the odds that >> a node is marked dead because STW GC is happening in the storage JVM. On >> the flipside, you'd need checks to make sure that the gossip process can >> recognize when the storage process has died vs just running a long GC. >> >> I don't know that I'd go so far as to have separate processes for >> keyspaces, etc. >> >> There is probably some interesting work that could be done to support the >> orgs who run multiple cassandra instances on the same node (multiple >> gossipers in that case is at least a little wasteful). >> >> I've also played around with using domain sockets for IPC inside of >> cassandra. I never ran a proper benchmark, but there were some throughput >> advantages to this approach. >> >> Cheers, >> >> Gary. >> >> >> On Thu, Feb 22, 2018 at 8:39 PM, Carl Mueller >> wrote: >> >>> GC pauses may have been improved in newer releases, since we are in 2.1.x, >>> but I was wondering why cassandra uses one jvm for all tables and >>> keyspaces, intermingling the heap for on-JVM objects. >>> >>> ... so why doesn't cassandra spin off a jvm per table so each jvm can be >>> tuned per table and gc tuned and gc impacts not impact other tables? It >>> would probably increase the number of endpoints if we avoid having an >>> overarching query router. >>> > > - > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org > For additional commands, e-mail: dev-h...@cassandra.apache.org > - To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org
Re: Why isn't there a separate JVM per table?
I've heard a couple of folks pontificate on compaction in its own process as well, given it has such a high impact on GC. Not sure about the value of individual tables. Interesting idea though. On Fri, Feb 23, 2018 at 10:45 AM, Gary Dusbabek wrote: > I've given it some thought in the past. In the end, I usually talk myself > out of it because I think it increases the surface area for failure. That > is, managing N processes is more difficult that managing one process. But > if the additional failure modes are addressed, there are some interesting > possibilities. > > For example, having gossip in its own process would decrease the odds that > a node is marked dead because STW GC is happening in the storage JVM. On > the flipside, you'd need checks to make sure that the gossip process can > recognize when the storage process has died vs just running a long GC. > > I don't know that I'd go so far as to have separate processes for > keyspaces, etc. > > There is probably some interesting work that could be done to support the > orgs who run multiple cassandra instances on the same node (multiple > gossipers in that case is at least a little wasteful). > > I've also played around with using domain sockets for IPC inside of > cassandra. I never ran a proper benchmark, but there were some throughput > advantages to this approach. > > Cheers, > > Gary. > > > On Thu, Feb 22, 2018 at 8:39 PM, Carl Mueller > wrote: > >> GC pauses may have been improved in newer releases, since we are in 2.1.x, >> but I was wondering why cassandra uses one jvm for all tables and >> keyspaces, intermingling the heap for on-JVM objects. >> >> ... so why doesn't cassandra spin off a jvm per table so each jvm can be >> tuned per table and gc tuned and gc impacts not impact other tables? It >> would probably increase the number of endpoints if we avoid having an >> overarching query router. >> - To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org
Re: Why isn't there a separate JVM per table?
I've given it some thought in the past. In the end, I usually talk myself out of it because I think it increases the surface area for failure. That is, managing N processes is more difficult that managing one process. But if the additional failure modes are addressed, there are some interesting possibilities. For example, having gossip in its own process would decrease the odds that a node is marked dead because STW GC is happening in the storage JVM. On the flipside, you'd need checks to make sure that the gossip process can recognize when the storage process has died vs just running a long GC. I don't know that I'd go so far as to have separate processes for keyspaces, etc. There is probably some interesting work that could be done to support the orgs who run multiple cassandra instances on the same node (multiple gossipers in that case is at least a little wasteful). I've also played around with using domain sockets for IPC inside of cassandra. I never ran a proper benchmark, but there were some throughput advantages to this approach. Cheers, Gary. On Thu, Feb 22, 2018 at 8:39 PM, Carl Mueller wrote: > GC pauses may have been improved in newer releases, since we are in 2.1.x, > but I was wondering why cassandra uses one jvm for all tables and > keyspaces, intermingling the heap for on-JVM objects. > > ... so why doesn't cassandra spin off a jvm per table so each jvm can be > tuned per table and gc tuned and gc impacts not impact other tables? It > would probably increase the number of endpoints if we avoid having an > overarching query router. >
Re: Why isn't there a separate JVM per table?
it's an interesting idea. i'd wonder how much overhead you'd end up with message parsing and negate any potential GC wins. rick branson had played around a bunch with running storage nodes and doubling down on the old "fat client" model. if you had 1 tables (yes, barely works but we don't explicitly prevent it) you can't really run that many jvm processes on a single box. > On Feb 22, 2018, at 12:39 PM, Carl Mueller > wrote: > > GC pauses may have been improved in newer releases, since we are in 2.1.x, > but I was wondering why cassandra uses one jvm for all tables and > keyspaces, intermingling the heap for on-JVM objects. > > ... so why doesn't cassandra spin off a jvm per table so each jvm can be > tuned per table and gc tuned and gc impacts not impact other tables? It > would probably increase the number of endpoints if we avoid having an > overarching query router. - To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org For additional commands, e-mail: dev-h...@cassandra.apache.org
Why isn't there a separate JVM per table?
GC pauses may have been improved in newer releases, since we are in 2.1.x, but I was wondering why cassandra uses one jvm for all tables and keyspaces, intermingling the heap for on-JVM objects. ... so why doesn't cassandra spin off a jvm per table so each jvm can be tuned per table and gc tuned and gc impacts not impact other tables? It would probably increase the number of endpoints if we avoid having an overarching query router.