答复: Cassandra 2.1.0 Crashes the JVM with OOM with heaps of memory free
What's your vm.max_map_count setting? Best Regards, Liang 发件人: Leon Oosterwijk leon.oosterw...@macquarie.com 发送时间: 2014年12月19日 11:55 收件人: user@cassandra.apache.org 主题: Cassandra 2.1.0 Crashes the JVM with OOM with heaps of memory free All, We have a Cassandra cluster which seems to be struggling a bit. I have one node which crashes continually, and others which crash sporadically. When they crash it’s with a JVM couldn’t allocate memory, even though there’s heaps available. I suspect it’s because one table which is very big. (500GB) which has on the order of 500K-700K files in its directory. When I delete the directory contents on the crashing node and ran a repair, the nodes around this node crashed while streaming the data. Here is the relevant bits from the crash file and environment. Any help would be appreciated. # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (mmap) failed to map 12288 bytes for committing reserved memory. # Possible reasons: # The system is out of physical RAM or swap space # In 32 bit mode, the process size limit was hit # Possible solutions: # Reduce memory load on the system # Increase physical memory or swap space # Check if swap backing store is full # Use 64 bit Java on a 64 bit OS # Decrease Java heap size (-Xmx/-Xms) # Decrease number of Java threads # Decrease Java thread stack sizes (-Xss) # Set larger code cache with -XX:ReservedCodeCacheSize= # This output file may be truncated or incomplete. # # Out of Memory Error (os_linux.cpp:2671), pid=1104, tid=139950342317824 # # JRE version: Java(TM) SE Runtime Environment (8.0_20-b26) (build 1.8.0_20-b26) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.20-b23 mixed mode linux-amd64 compressed oops) # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try ulimit -c unlimited before starting Java again # --- T H R E A D --- Current thread (0x7f4acabb1800): JavaThread Thread-13 [_thread_new, id=19171, stack(0x7f48ba6ca000,0x7f48ba70b000)] Stack: [0x7f48ba6ca000,0x7f48ba70b000], sp=0x7f48ba709a50, free space=254k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0xa76cea] VMError::report_and_die()+0x2ca V [libjvm.so+0x4e52fb] report_vm_out_of_memory(char const*, int, unsigned long, VMErrorType, char const*)+0x8b V [libjvm.so+0x8e4ec3] os::Linux::commit_memory_impl(char*, unsigned long, bool)+0x103 V [libjvm.so+0x8e4f8c] os::pd_commit_memory(char*, unsigned long, bool)+0xc V [libjvm.so+0x8dce4a] os::commit_memory(char*, unsigned long, bool)+0x2a V [libjvm.so+0x8e33af] os::pd_create_stack_guard_pages(char*, unsigned long)+0x7f V [libjvm.so+0xa21bde] JavaThread::create_stack_guard_pages()+0x5e V [libjvm.so+0xa29954] JavaThread::run()+0x34 V [libjvm.so+0x8e75f8] java_start(Thread*)+0x108 C [libpthread.so.0+0x79d1] Memory: 4k page, physical 131988232k(694332k free), swap 37748728k(37748728k free) vm_info: Java HotSpot(TM) 64-Bit Server VM (25.20-b23) for linux-amd64 JRE (1.8.0_20-b26), built on Jul 30 2014 13:13:52 by java_re with gcc 4.3.0 20080428 (Red Hat 4.3.0-8) time: Fri Dec 19 14:37:29 2014 elapsed time: 2303 seconds (0d 0h 38m 23s) OS:Red Hat Enterprise Linux Server release 6.5 (Santiago) uname:Linux 2.6.32-431.5.1.el6.x86_64 #1 SMP Fri Jan 10 14:46:43 EST 2014 x86_64 libc:glibc 2.12 NPTL 2.12 rlimit: STACK 10240k, CORE 0k, NPROC 8192, NOFILE 65536, AS infinity load average:4.18 4.79 4.54 /proc/meminfo: MemTotal: 131988232 kB MemFree: 694332 kB Buffers: 837584 kB Cached: 51002896 kB SwapCached:0 kB Active: 93953028 kB Inactive: 32850628 kB Active(anon): 70851112 kB Inactive(anon): 4713848 kB Active(file): 23101916 kB Inactive(file): 28136780 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 37748728 kB SwapFree: 37748728 kB Dirty: 75752 kB Writeback: 0 kB AnonPages: 74963768 kB Mapped: 739884 kB Shmem:601592 kB Slab:3460252 kB SReclaimable:3170124 kB SUnreclaim: 290128 kB KernelStack: 36224 kB PageTables: 189772 kB NFS_Unstable: 0 kB Bounce:0 kB WritebackTmp: 0 kB CommitLimit:169736960 kB Committed_AS: 92208740 kB VmallocTotal: 34359738367 kB VmallocUsed: 492032 kB VmallocChunk: 34291733296 kB HardwareCorrupted: 0 kB AnonHugePages: 67717120 kB HugePages_Total: 0 HugePages_Free:0 HugePages_Rsvd:0 HugePages_Surp:0 Hugepagesize: 2048 kB DirectMap4k:5056 kB DirectMap2M: 2045952 kB DirectMap1G:132120576 kB Before you say It’s a ulimit issue: [501] ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority
Re: In place vnode conversion possible?
On 18/12/14 21:45, Robert Coli wrote: On Tue, Dec 16, 2014 at 12:38 AM, Jonas Borgström jo...@borgstrom.se mailto:jo...@borgstrom.se wrote: That said, I've done some testing and it appears to be possible to perform an in place conversion as long as all nodes contain all data (3 nodes and replication factor 3 for example) like this: I would expect this to work, but to stream up to RF x the data around. Why would any streaming take place? Simply changing the tokens and restarting a node does not seem to trigger any streaming. And if I manually trigger a nodetool repair I notice almost no streaming since all nodes were already responsible for 100% of the data (RF = NUM_NODES). / Jonas signature.asc Description: OpenPGP digital signature
Reset cfhistograms
Hi, I am using cassandra 2.1.2 with 5 node cluster single DC. I've read that histograms are reset after node restart or rerun of command. But in my case it's not resetting by running every time. Could someone point what could be the issue or how could I reset it without restarting node. Thanks! in advance. -Nitin
Multi DC informations (sync)
Hi guys, We expanded our cluster to a multiple DC configuration. Now I am wondering if there is any way to know: 1 - The replication lag between these 2 DC (Opscenter, nodetool, other ?) 2 - Make sure that sync is ok at any time I guess big companies running Cassandra are interested in these kind of info, so I think something exist but I am not aware of it. Any other important information or advice you can give me about best practices or tricks while running a multi DC (cross regions US - EU) is welcome of course ! cheers, Alain
Re: 2014 nosql benchmark
Today I've also seen this benchmark in Chinese websites. SequoiaDB seems come from a Chinese startup company, and in db-engines ranking http://db-engines.com/en/ranking it's score is 0.00. So IMO I have to say I think this benchmark is a soft sell. They compare three databases, two written by c++ and one by java, and use a very tricky testcase to make Cassandra can not hold all data in memtables. After all, java need more memory than c++. For a on-disk database, generally data size of one node is much larger than RAM, and it's performance of memory query is less important than disk query. So I think this benchmark have no value at all. 2014-12-19 14:47 GMT+08:00 Wilm Schumacher wilm.schumac...@gmail.com: Hi, I'm always interessted in such benchmark experiments, because the databases evolve so fast, that the race is always open and there is a lot motion in there. And of course I askes myself the same question. And I think that this publication is unreliable. For 4 reasons (from reading very fast, perhaps there is more): 1.) It is unclear what this is all about. The title is NoSQL Performance Testing. The subtitle is In-Memory Performance Comparison of SequoiaDB, Cassandra, and MongoDB. However, in the introduction there is not one word about in memory performance. The introduction could be a general introduction for a general on-disk-nosql benchmark. So ... only the subtitle (and a short sentence in the Result Summary) says what this is actually about. 2.) There are very important databases missing. For in memory e.g. redis. If e.g. redis is not a valid candidate in this race, why is this so?MySQL is capable of in memory distributed databanking, too. 3.) The methodology is unclear. Perhaps I'm the only one, but what does Run workload for 30 minutes (workload file workload[1-5]) mean for mixed read/write ops? Why 30 min? Okay, I can image, that the authors estimated the throughput, preset the number of 100 Mio rows and designed it to be larger than the estimated throughput in x minutes. However, all this information is missing. And why 45% and 22% of RAM? My first Idea would be a VERY low ration, like 2% or so, and a VERY large ratio, like 80-90%. And than everything in between. Is 22% or 45% somehow a magic number? Furthermore in the Result summary there 1/2 and 1/4 of RAM are discussed. Okay, 22% is near 1/4 ... but where does the difference origin from? And btw. ... 22% of what? Stuff to insert? Stuff already insererted? It's all deductable, but it's strange that the description is so sloppy. 4.) There is no repetion of the loads (as I understand). Its one run, one result ... and it's done. I don't know a lot of cassandra in in-memory use. But either the experiment should be repeated quite some runs OR it should be explained why this is not neccessary. Okay, perhaps 1 is a little picky, and 4 is a little fussy. But 3 is strange and 2 stinks. Well, just my first impression. And that's Cassandra is very fast ;). Best regards Wilm Am 19.12.2014 um 06:41 schrieb diwayou: i just have read this benchmark pdf, does anyone have some opinion about this? i think it's not fair about cassandra url: http://www.bankmark.de/wp-content/uploads/2014/12/bankmark-20141201-WP-NoSQLBenchmark.pdf http://msrg.utoronto.ca/papers/NoSQLBenchmark
Re: Multi DC informations (sync)
Alain, AFAIK, the DC replication is not linearizable. That is, writes are are not replicated according to a binlog or similar like MySQL. They are replicated concurrently. To answer you questions: 1 - Replication lag in Cassandra terms is probably “Hinted handoff”. You’d want to check the status of that. 2 - `nodetool status` is your friend. It will tell you whether the cluster considers other nodes reachable or not. Run it on a node in the datacenter that you’d like to test connectivity from. Cheers, Jens ——— Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook Linkedin Twitter On Fri, Dec 19, 2014 at 11:16 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi guys, We expanded our cluster to a multiple DC configuration. Now I am wondering if there is any way to know: 1 - The replication lag between these 2 DC (Opscenter, nodetool, other ?) 2 - Make sure that sync is ok at any time I guess big companies running Cassandra are interested in these kind of info, so I think something exist but I am not aware of it. Any other important information or advice you can give me about best practices or tricks while running a multi DC (cross regions US - EU) is welcome of course ! cheers, Alain
Re: Understanding tombstone WARN log output
Hi again, A follow-up question (to my yet unanswered question): How come the first localDeletion is Integer.MAX_VALUE above? Should it be? Cheers, Jens ——— Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook Linkedin Twitter On Thu, Dec 18, 2014 at 2:48 PM, Jens Rantil jens.ran...@tink.se wrote: Hi, I am occasionally seeing: WARN [ReadStage:9576] 2014-12-18 11:16:19,042 SliceQueryFilter.java (line 225) Read 756 live and 17027 tombstoned cells in mykeyspace.mytable (see tombstone_warn_threshold). 5001 columns was requested, slices=[73c31274-f45c-4ba5-884a-6d08d20597e7:myfield-], delInfo={deletedAt=-9223372036854775808, localDeletion=2147483647, ranges=[73f0b59e-7525-4a18-a84f-d2a2f0505503-73f0b59e-7525-4a18-a84f-d2a2f0505503:!, deletedAt=141872018676, localDeletion=1418720186][74374d72-2688-4e64-bb0b-f51a956b0529-74374d72-2688-4e64-bb0b-f51a956b0529:!, deletedAt=1418720184675000, localDeletion=1418720184] ... in system.log. My primary key is ((userid uuid), id uuid). Is it possible for me to see from this output which partition key and/or ranges that has all of these tombstones? Thanks, Jens -- Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook https://www.facebook.com/#!/tink.se Linkedin http://www.linkedin.com/company/2735919?trk=vsrp_companies_res_phototrkInfo=VSRPsearchId%3A1057023381369207406670%2CVSRPtargetId%3A2735919%2CVSRPcmpt%3Aprimary Twitter https://twitter.com/tink
Drivers performance
htmlbodyHello, I am in the middle of evaluating whether we should switch from Astyanax to datastax driver and I did simple benchmark that load 10 000 times the same row by key and I was surprised with the slowness of datastax driver. I uploaded it to github. https://github.com/michalsvec/astyanax-datastax-benchmark It was tested against Cassandra 1.2 and 2.1. Testing conditions were naive (localhost, single node, ...) but still the difference is huge. 10 000 iterations: * Astyanax:2734 ms * Astyanax prepared:1997 ms * Datastax:10230 ms Is it really so slow or do I miss something? Thank you for any advice. Michal NOTICE: This email and any attachments may contain confidential and proprietary information of NetSuite Inc. and is for the sole use of the intended recipient for the stated purpose. Any improper use or distribution is prohibited. If you are not the intended recipient, please notify the sender; do not review, copy or distribute; and promptly delete or destroy all transmitted information. Please note that all communications and information transmitted through this email system may be monitored by NetSuite or its agents and that all incoming email is automatically scanned by a third party spam and filtering service /body/html
Re: Multi DC informations (sync)
Hi Jens, thanks for your insight. Replication lag in Cassandra terms is probably “Hinted handoff” -- Well I think hinted handoff are only used when a node is down, and are not even mandatory enabled. I guess that cross DC async replication is something else, taht has nothing to see with hinted handoff, am I wrong ? `nodetool status` is your friend. It will tell you whether the cluster considers other nodes reachable or not. Run it on a node in the datacenter that you’d like to test connectivity from. -- Connectivity ≠ write success Basically the two question can be changed this way: 1 - How to monitor the async cross dc write latency ? 2 - What error should I look for when async write fails (if any) ? Or is there any other way to see that network throughput (for example) is too small for a given traffic. Hope this is clearer. C*heers, Alain 2014-12-19 11:44 GMT+01:00 Jens Rantil jens.ran...@tink.se: Alain, AFAIK, the DC replication is not linearizable. That is, writes are are not replicated according to a binlog or similar like MySQL. They are replicated concurrently. To answer you questions: 1 - Replication lag in Cassandra terms is probably “Hinted handoff”. You’d want to check the status of that. 2 - `nodetool status` is your friend. It will tell you whether the cluster considers other nodes reachable or not. Run it on a node in the datacenter that you’d like to test connectivity from. Cheers, Jens ——— Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook Linkedin Twitter On Fri, Dec 19, 2014 at 11:16 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi guys, We expanded our cluster to a multiple DC configuration. Now I am wondering if there is any way to know: 1 - The replication lag between these 2 DC (Opscenter, nodetool, other ?) 2 - Make sure that sync is ok at any time I guess big companies running Cassandra are interested in these kind of info, so I think something exist but I am not aware of it. Any other important information or advice you can give me about best practices or tricks while running a multi DC (cross regions US - EU) is welcome of course ! cheers, Alain
Re: Drivers performance
Better question for the java driver mailing list, but I see a number of problems in your Datastax java driver code, and without knowing the way Astyanax handles caching of prepared statements I can tell you 1. You're re repreparing a statement on _every_ iteration, and these are not cached by the driver. This is not only expensive, it is slower than just using non prepared statements. This is a substantial slow down. Drivers are not necessarily implementing this the same way so the code is not apples to apples. Change your code to prepare _once_ and I bet your numbers improve drastically. 2. Your pooling options are CRAZY high, and I'm guessing your'e running out of resources on the datastax driver, again the code is different with different tradeoffs from Astyanax , a connection in thrift is not remotely the same as a connection in the modern remote protocol. Just use the default pooling options and I bet your numbers improve greatly (if not there is something deeply off about your cluster and or app servers). 3. A lot of the speed up in the java driver is in the async support and how the native protocol handles async, since you're doing synchronous this is the best case for thrift performance, however that still does not explain your gap ( which in most synchronous cases is thrift is comparable at best, but usually not faster ). 4. I haven't been able to figure out which version of the Datastax driver your on from looking at the code, this can change performance drastically as there has been many improvements, especially for Cassandra 2.1 I suggest you reply to the java driver mailing list for more in depth discussion https://groups.google.com/a/lists.datastax.com/forum/#!forum/java-driver-user On Fri, Dec 19, 2014 at 7:26 AM, Svec, Michal ms...@netsuite.com wrote: Hello, I am in the middle of evaluating whether we should switch from Astyanax to datastax driver and I did simple benchmark that load 10 000 times the same row by key and I was surprised with the slowness of datastax driver. I uploaded it to github. https://github.com/michalsvec/astyanax-datastax-benchmark It was tested against Cassandra 1.2 and 2.1. Testing conditions were naive (localhost, single node, …) but still the difference is huge. 10 000 iterations: · Astyanax:2734 ms · Astyanax prepared:1997 ms · Datastax:10230 ms Is it really so slow or do I miss something? Thank you for any advice. Michal NOTICE: This email and any attachments may contain confidential and proprietary information of NetSuite Inc. and is for the sole use of the intended recipient for the stated purpose. Any improper use or distribution is prohibited. If you are not the intended recipient, please notify the sender; do not review, copy or distribute; and promptly delete or destroy all transmitted information. Please note that all communications and information transmitted through this email system may be monitored and retained by NetSuite or its agents and that all incoming email is automatically scanned by a third party spam and filtering service which may result in deletion of a legitimate e-mail before it is read by the intended recipient. -- [image: datastax_logo.png] http://www.datastax.com/ Ryan Svihla Solution Architect [image: twitter.png] https://twitter.com/foundev [image: linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
Re: 答复: Cassandra 2.1.0 Crashes the JVM with OOM with heaps of memory free
It does appear to be a ulimit issue to some degree as some settings are lower than recommended by a few factors (namely nproc). http://www.datastax.com/documentation/cassandra/2.0/cassandra/install/installRecommendSettings.html * - memlock unlimited * - nofile 10 * - nproc 32768 * - as unlimited However, I'm also confident you have other issues as well, that are going to be problematic. Namely what is your heap setting at? can you grep for ERROR, WARN, dropped, GCInspector in the system.log for Cassandra and share the results? On Fri, Dec 19, 2014 at 2:23 AM, 谢良 xieli...@xiaomi.com wrote: What's your vm.max_map_count setting? Best Regards, Liang -- *发件人:* Leon Oosterwijk leon.oosterw...@macquarie.com *发送时间:* 2014年12月19日 11:55 *收件人:* user@cassandra.apache.org *主题:* Cassandra 2.1.0 Crashes the JVM with OOM with heaps of memory free All, We have a Cassandra cluster which seems to be struggling a bit. I have one node which crashes continually, and others which crash sporadically. When they crash it’s with a JVM couldn’t allocate memory, even though there’s heaps available. I suspect it’s because one table which is very big. (500GB) which has on the order of 500K-700K files in its directory. When I delete the directory contents on the crashing node and ran a repair, the nodes around this node crashed while streaming the data. Here is the relevant bits from the crash file and environment. Any help would be appreciated. # # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (mmap) failed to map 12288 bytes for committing reserved memory. # Possible reasons: # The system is out of physical RAM or swap space # In 32 bit mode, the process size limit was hit # Possible solutions: # Reduce memory load on the system # Increase physical memory or swap space # Check if swap backing store is full # Use 64 bit Java on a 64 bit OS # Decrease Java heap size (-Xmx/-Xms) # Decrease number of Java threads # Decrease Java thread stack sizes (-Xss) # Set larger code cache with -XX:ReservedCodeCacheSize= # This output file may be truncated or incomplete. # # Out of Memory Error (os_linux.cpp:2671), pid=1104, tid=139950342317824 # # JRE version: Java(TM) SE Runtime Environment (8.0_20-b26) (build 1.8.0_20-b26) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.20-b23 mixed mode linux-amd64 compressed oops) # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try ulimit -c unlimited before starting Java again # --- T H R E A D --- Current thread (0x7f4acabb1800): JavaThread Thread-13 [_thread_new, id=19171, stack(0x7f48ba6ca000,0x7f48ba70b000)] Stack: [0x7f48ba6ca000,0x7f48ba70b000], sp=0x7f48ba709a50, free space=254k Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code) V [libjvm.so+0xa76cea] VMError::report_and_die()+0x2ca V [libjvm.so+0x4e52fb] report_vm_out_of_memory(char const*, int, unsigned long, VMErrorType, char const*)+0x8b V [libjvm.so+0x8e4ec3] os::Linux::commit_memory_impl(char*, unsigned long, bool)+0x103 V [libjvm.so+0x8e4f8c] os::pd_commit_memory(char*, unsigned long, bool)+0xc V [libjvm.so+0x8dce4a] os::commit_memory(char*, unsigned long, bool)+0x2a V [libjvm.so+0x8e33af] os::pd_create_stack_guard_pages(char*, unsigned long)+0x7f V [libjvm.so+0xa21bde] JavaThread::create_stack_guard_pages()+0x5e V [libjvm.so+0xa29954] JavaThread::run()+0x34 V [libjvm.so+0x8e75f8] java_start(Thread*)+0x108 C [libpthread.so.0+0x79d1] Memory: 4k page, physical 131988232k(694332k free), swap 37748728k(37748728k free) vm_info: Java HotSpot(TM) 64-Bit Server VM (25.20-b23) for linux-amd64 JRE (1.8.0_20-b26), built on Jul 30 2014 13:13:52 by java_re with gcc 4.3.0 20080428 (Red Hat 4.3.0-8) time: Fri Dec 19 14:37:29 2014 elapsed time: 2303 seconds (0d 0h 38m 23s) OS:Red Hat Enterprise Linux Server release 6.5 (Santiago) uname:Linux 2.6.32-431.5.1.el6.x86_64 #1 SMP Fri Jan 10 14:46:43 EST 2014 x86_64 libc:glibc 2.12 NPTL 2.12 rlimit: STACK 10240k, CORE 0k, NPROC 8192, NOFILE 65536, AS infinity load average:4.18 4.79 4.54 /proc/meminfo: MemTotal: 131988232 kB MemFree: 694332 kB Buffers: 837584 kB Cached: 51002896 kB SwapCached:0 kB Active: 93953028 kB Inactive: 32850628 kB Active(anon): 70851112 kB Inactive(anon): 4713848 kB Active(file): 23101916 kB Inactive(file): 28136780 kB Unevictable: 0 kB Mlocked: 0 kB SwapTotal: 37748728 kB SwapFree: 37748728 kB Dirty: 75752 kB Writeback: 0 kB AnonPages: 74963768 kB Mapped: 739884 kB Shmem:601592 kB Slab:
Key Cache Questions
Hello all,I just read that the default size of the Key cache is 100 MB. Is it stored in memory or disk?
Re: Multi DC informations (sync)
More accurately,the write path of Cassandra in a multi dc sense is kinda like the following 1. write goes to a node which acts as coordinator 2. writes go out to all replicas in that DC, and then one write per remote DC goes out to another node which takes responsibility for writing to all replicas in it's data center. The request blocks however until all CL is satisfied. 3. if any of these writes fail by default a hinted handoff is generated.. So as you can see..there is effectively not lag beyond either raw network latency+node speed and/or just failed writes and waiting on hint replay to occur. Likewise repairs can be used to make the data centers back in sync, and in the case of substantial outages you will need repairs to bring you back in sync, you're running repairs already right? Think of Cassandra as a global write, and not a message queue, and you've got the basic idea. On Fri, Dec 19, 2014 at 7:54 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi Jens, thanks for your insight. Replication lag in Cassandra terms is probably “Hinted handoff” -- Well I think hinted handoff are only used when a node is down, and are not even mandatory enabled. I guess that cross DC async replication is something else, taht has nothing to see with hinted handoff, am I wrong ? `nodetool status` is your friend. It will tell you whether the cluster considers other nodes reachable or not. Run it on a node in the datacenter that you’d like to test connectivity from. -- Connectivity ≠ write success Basically the two question can be changed this way: 1 - How to monitor the async cross dc write latency ? 2 - What error should I look for when async write fails (if any) ? Or is there any other way to see that network throughput (for example) is too small for a given traffic. Hope this is clearer. C*heers, Alain 2014-12-19 11:44 GMT+01:00 Jens Rantil jens.ran...@tink.se: Alain, AFAIK, the DC replication is not linearizable. That is, writes are are not replicated according to a binlog or similar like MySQL. They are replicated concurrently. To answer you questions: 1 - Replication lag in Cassandra terms is probably “Hinted handoff”. You’d want to check the status of that. 2 - `nodetool status` is your friend. It will tell you whether the cluster considers other nodes reachable or not. Run it on a node in the datacenter that you’d like to test connectivity from. Cheers, Jens ——— Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook Linkedin Twitter On Fri, Dec 19, 2014 at 11:16 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi guys, We expanded our cluster to a multiple DC configuration. Now I am wondering if there is any way to know: 1 - The replication lag between these 2 DC (Opscenter, nodetool, other ?) 2 - Make sure that sync is ok at any time I guess big companies running Cassandra are interested in these kind of info, so I think something exist but I am not aware of it. Any other important information or advice you can give me about best practices or tricks while running a multi DC (cross regions US - EU) is welcome of course ! cheers, Alain -- [image: datastax_logo.png] http://www.datastax.com/ Ryan Svihla Solution Architect [image: twitter.png] https://twitter.com/foundev [image: linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
Re: simple data movement ?
Thanks, this looks uglier , I double checked my production cluster ( I have a staging and development cluster as well ) and production is on 1.2.8. A copy of the data resulted in a mssage : Exception encountered during startup: Incompatible SSTable found. Current version ka is unable to read file: /cassandra/apache-cassandra-2.1.2/bin/../data/data/system/schema_keyspaces/system-schema_keyspaces-ic-150. Please run upgradesstables. Is the move going to to be 1.2.8 -- 1.2.9 -- 2.0.x -- 2.1.2 ?? Can I just dump the data and import it into 2.1.2 ?? Jim From: Ryan Svihla rsvi...@datastax.commailto:rsvi...@datastax.com Reply-To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Date: Thu, 18 Dec 2014 06:00:09 -0600 To: user@cassandra.apache.orgmailto:user@cassandra.apache.org Subject: Re: simple data movement ? I'm not sure that'll work with that many version moves in the middle, upgrades are to my knowledge only tested between specific steps, namely from 1.2.9 to the latest 2.0.x http://www.datastax.com/documentation/upgrade/doc/upgrade/cassandra/upgradeC_c.html Specifically: Cassandra 2.0.x restrictions¶http://www.datastax.com/documentation/upgrade/doc/upgrade/cassandra/upgradeC_c.html?scroll=concept_ds_yqj_5xr_ck__section_ubt_nwr_54 After downloading DataStax Communityhttp://planetcassandra.org/cassandra/, upgrade to Cassandra directly from Cassandra 1.2.9 or later. Cassandra 2.0 is not network- or SSTable-compatible with versions older than 1.2.9. If your version of Cassandra is earlier than 1.2.9 and you want to perform a rolling restarthttp://www.datastax.com/documentation/cassandra/1.2/cassandra/glossary/gloss_rolling_restart.html, first upgrade the entire cluster to 1.2.9, and then to Cassandra 2.0. Cassandra 2.1.x restrictions¶http://www.datastax.com/documentation/upgrade/doc/upgrade/cassandra/upgradeC_c.html?scroll=concept_ds_yqj_5xr_ck__section_qzx_pwr_54 Upgrade to Cassandra 2.1 from Cassandra 2.0.7 or later. Cassandra 2.1 is not compatible with Cassandra 1.x SSTables. First upgrade the nodes to Cassandra 2.0.7 or later, start the cluster, upgrade the SSTables, stop the cluster, and then upgrade to Cassandra 2.1. On Wed, Dec 17, 2014 at 10:55 PM, Ben Bromhead b...@instaclustr.commailto:b...@instaclustr.com wrote: Just copy the data directory from each prod node to your test node (and relevant configuration files etc). If your IP addresses are different between test and prod, follow https://engineering.eventbrite.com/changing-the-ip-address-of-a-cassandra-node-with-auto_bootstrapfalse/ On 18 December 2014 at 09:10, Langston, Jim jim.langs...@dynatrace.commailto:jim.langs...@dynatrace.com wrote: Hi all, I have set up a test environment with C* 2.1.2, wanting to test our applications against it. I currently have C* 1.2.9 in production and want to use that data for testing. What would be a good approach for simply taking a copy of the production data and moving it into the test env and having the test env C* use that data ? The test env. is identical is size, with the difference being the versions of C*. Thanks, Jim The contents of this e-mail are intended for the named addressee only. It contains information that may be confidential. Unless you are the named addressee or an authorized designee, you may not copy or use it, or disclose it to anyone else. If you received it in error please notify us immediately and then destroy it -- Ben Bromhead Instaclustr | www.instaclustr.comhttps://www.instaclustr.com/ | @instaclustrhttp://twitter.com/instaclustr | +61 415 936 359tel:%2B61%20415%20936%20359 -- [datastax_logo.png]http://www.datastax.com/ Ryan Svihla Solution Architect [twitter.png]https://twitter.com/foundev [linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay. The contents of this e-mail are intended for the named addressee only. It contains information that may be confidential. Unless you are the named addressee or an authorized designee, you may not copy or use it, or disclose it to anyone else. If you received it in error please notify us immediately and then destroy it
Re: High Bloom Filter FP Ratio
We're seeing similar behavior except our FP ratio is closer to 1.0 (100%). We're using Cassandra 2.1.2. Schema --- CREATE TABLE contacts.contact ( id bigint, property_id int, created_at bigint, updated_at bigint, value blob, PRIMARY KEY (id, property_id) ) WITH CLUSTERING ORDER BY (property_id ASC) *AND bloom_filter_fp_chance = 0.001* AND caching = '{keys:ALL, rows_per_partition:NONE}' AND comment = '' AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy', 'max_threshold': '32'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND dclocal_read_repair_chance = 0.1 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = '99.0PERCENTILE'; CF Stats Output: - Keyspace: contacts Read Count: 2458375 Read Latency: 0.852844076675 ms. Write Count: 10357 Write Latency: 0.1816912233272183 ms. Pending Flushes: 0 Table: contact SSTable count: 61 SSTables in each level: [1, 10, 50, 0, 0, 0, 0, 0, 0] Space used (live): 9047112471 Space used (total): 9047112471 Space used by snapshots (total): 0 SSTable Compression Ratio: 0.34119240020241487 Memtable cell count: 24570 Memtable data size: 1299614 Memtable switch count: 2 Local read count: 2458290 Local read latency: 0.853 ms Local write count: 10044 Local write latency: 0.186 ms Pending flushes: 0 Bloom filter false positives: 11096 *Bloom filter false ratio: 0.99197* Bloom filter space used: 3923784 Compacted partition minimum bytes: 373 Compacted partition maximum bytes: 152321 Compacted partition mean bytes: 9938 Average live cells per slice (last five minutes): 37.57851240677983 Maximum live cells per slice (last five minutes): 63.0 Average tombstones per slice (last five minutes): 0.0 Maximum tombstones per slice (last five minutes): 0.0 -- about.me http://about.me/markgreene On Wed, Dec 17, 2014 at 1:32 PM, Chris Hart ch...@remilon.com wrote: Hi, I have create the following table with bloom_filter_fp_chance=0.01: CREATE TABLE logged_event ( time_key bigint, partition_key_randomizer int, resource_uuid timeuuid, event_json text, event_type text, field_error_list maptext, text, javascript_timestamp timestamp, javascript_uuid uuid, page_impression_guid uuid, page_request_guid uuid, server_received_timestamp timestamp, session_id bigint, PRIMARY KEY ((time_key, partition_key_randomizer), resource_uuid) ) WITH bloom_filter_fp_chance=0.01 AND caching='KEYS_ONLY' AND comment='' AND dclocal_read_repair_chance=0.00 AND gc_grace_seconds=864000 AND index_interval=128 AND read_repair_chance=0.00 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND default_time_to_live=0 AND speculative_retry='99.0PERCENTILE' AND memtable_flush_period_in_ms=0 AND compaction={'class': 'SizeTieredCompactionStrategy'} AND compression={'sstable_compression': 'LZ4Compressor'}; When I run cfstats, I see a much higher false positive ratio: Table: logged_event SSTable count: 15 Space used (live), bytes: 104128214227 Space used (total), bytes: 104129482871 SSTable Compression Ratio: 0.3295840184239226 Number of keys (estimate): 199293952 Memtable cell count: 56364 Memtable data size, bytes: 20903960 Memtable switch count: 148 Local read count: 1396402 Local read latency: 0.362 ms Local write count: 2345306 Local write latency: 0.062 ms Pending tasks: 0 Bloom filter false positives: 147705 Bloom filter false ratio: 0.49020 Bloom filter space used, bytes: 249129040 Compacted partition minimum bytes: 447 Compacted partition maximum bytes: 315852 Compacted partition mean bytes: 1636 Average live cells per slice (last five minutes): 0.0 Average tombstones per slice (last five minutes): 0.0 Any idea what could be causing this? This is timeseries data. Every time we read from this table, we read a single row key with 1000 partition_key_randomizer values. I'm running cassandra 2.0.11. I tried running an upgradesstables to rewrite
Re: Multi DC informations (sync)
All that you said match the idea I had of how it works except this part: The request blocks however until all CL is satisfied -- Does this mean that the client will see an error if the local DC write the data correctly (i.e. CL reached) but the remote DC fails ? This is not the idea I had of something asynchronous... If it doesn't fail on client side (real asynchronous), is there a way to make sure remote DC has indeed received the information ? I mean if the throughput cross regions is to small, the write will fail and so will the HH, potentially. How to detect we are lacking of throughput cross DC for example ? Repairs are indeed a good thing (we run them as a weekly routine, GC grace period 10 sec), but having inconsistency for a week without knowing it is quite an issue. Thanks for this detailed information Ryan, I hope I am clear enough while expressing my doubts. C*heers Alain 2014-12-19 15:43 GMT+01:00 Ryan Svihla rsvi...@datastax.com: More accurately,the write path of Cassandra in a multi dc sense is kinda like the following 1. write goes to a node which acts as coordinator 2. writes go out to all replicas in that DC, and then one write per remote DC goes out to another node which takes responsibility for writing to all replicas in it's data center. The request blocks however until all CL is satisfied. 3. if any of these writes fail by default a hinted handoff is generated.. So as you can see..there is effectively not lag beyond either raw network latency+node speed and/or just failed writes and waiting on hint replay to occur. Likewise repairs can be used to make the data centers back in sync, and in the case of substantial outages you will need repairs to bring you back in sync, you're running repairs already right? Think of Cassandra as a global write, and not a message queue, and you've got the basic idea. On Fri, Dec 19, 2014 at 7:54 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi Jens, thanks for your insight. Replication lag in Cassandra terms is probably “Hinted handoff” -- Well I think hinted handoff are only used when a node is down, and are not even mandatory enabled. I guess that cross DC async replication is something else, taht has nothing to see with hinted handoff, am I wrong ? `nodetool status` is your friend. It will tell you whether the cluster considers other nodes reachable or not. Run it on a node in the datacenter that you’d like to test connectivity from. -- Connectivity ≠ write success Basically the two question can be changed this way: 1 - How to monitor the async cross dc write latency ? 2 - What error should I look for when async write fails (if any) ? Or is there any other way to see that network throughput (for example) is too small for a given traffic. Hope this is clearer. C*heers, Alain 2014-12-19 11:44 GMT+01:00 Jens Rantil jens.ran...@tink.se: Alain, AFAIK, the DC replication is not linearizable. That is, writes are are not replicated according to a binlog or similar like MySQL. They are replicated concurrently. To answer you questions: 1 - Replication lag in Cassandra terms is probably “Hinted handoff”. You’d want to check the status of that. 2 - `nodetool status` is your friend. It will tell you whether the cluster considers other nodes reachable or not. Run it on a node in the datacenter that you’d like to test connectivity from. Cheers, Jens ——— Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook Linkedin Twitter On Fri, Dec 19, 2014 at 11:16 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi guys, We expanded our cluster to a multiple DC configuration. Now I am wondering if there is any way to know: 1 - The replication lag between these 2 DC (Opscenter, nodetool, other ?) 2 - Make sure that sync is ok at any time I guess big companies running Cassandra are interested in these kind of info, so I think something exist but I am not aware of it. Any other important information or advice you can give me about best practices or tricks while running a multi DC (cross regions US - EU) is welcome of course ! cheers, Alain -- [image: datastax_logo.png] http://www.datastax.com/ Ryan Svihla Solution Architect [image: twitter.png] https://twitter.com/foundev [image: linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
Re: simple data movement ?
It may be more valuable to set up your test cluster as the same version, and make sure your tokens are the same. then copy over your sstables. you'll have an exact replica of prod you can test your upgrade process. On Fri Dec 19 2014 at 11:04:58 AM Ryan Svihla rsvi...@datastax.com wrote: In theory, you could always do a data dump ..sstable to json and back for example, but you'd have to have your schema setup ,and I've not actually done this myself so YMMV. I've helped a bunch of folks with that upgrade path and while it's time consuming it does work. On Fri, Dec 19, 2014 at 8:49 AM, Langston, Jim jim.langs...@dynatrace.com wrote: Thanks, this looks uglier , I double checked my production cluster ( I have a staging and development cluster as well ) and production is on 1.2.8. A copy of the data resulted in a mssage : Exception encountered during startup: Incompatible SSTable found. Current version ka is unable to read file: /cassandra/apache-cassandra-2.1.2/bin/../data/data/system/schema_keyspaces/system-schema_keyspaces-ic-150. Please run upgradesstables. Is the move going to to be 1.2.8 -- 1.2.9 -- 2.0.x -- 2.1.2 ?? Can I just dump the data and import it into 2.1.2 ?? Jim From: Ryan Svihla rsvi...@datastax.com Reply-To: user@cassandra.apache.org Date: Thu, 18 Dec 2014 06:00:09 -0600 To: user@cassandra.apache.org Subject: Re: simple data movement ? I'm not sure that'll work with that many version moves in the middle, upgrades are to my knowledge only tested between specific steps, namely from 1.2.9 to the latest 2.0.x http://www.datastax.com/documentation/upgrade/doc/upgrade/cassandra/upgradeC_c.html Specifically: Cassandra 2.0.x restrictions¶ http://www.datastax.com/documentation/upgrade/doc/upgrade/cassandra/upgradeC_c.html?scroll=concept_ds_yqj_5xr_ck__section_ubt_nwr_54 After downloading DataStax Community http://planetcassandra.org/cassandra/, upgrade to Cassandra directly from Cassandra 1.2.9 or later. Cassandra 2.0 is not network- or SSTable-compatible with versions older than 1.2.9. If your version of Cassandra is earlier than 1.2.9 and you want to perform a rolling restart http://www.datastax.com/documentation/cassandra/1.2/cassandra/glossary/gloss_rolling_restart.html, first upgrade the entire cluster to 1.2.9, and then to Cassandra 2.0. Cassandra 2.1.x restrictions¶ http://www.datastax.com/documentation/upgrade/doc/upgrade/cassandra/upgradeC_c.html?scroll=concept_ds_yqj_5xr_ck__section_qzx_pwr_54 Upgrade to Cassandra 2.1 from Cassandra 2.0.7 or later. Cassandra 2.1 is not compatible with Cassandra 1.x SSTables. First upgrade the nodes to Cassandra 2.0.7 or later, start the cluster, upgrade the SSTables, stop the cluster, and then upgrade to Cassandra 2.1. On Wed, Dec 17, 2014 at 10:55 PM, Ben Bromhead b...@instaclustr.com wrote: Just copy the data directory from each prod node to your test node (and relevant configuration files etc). If your IP addresses are different between test and prod, follow https://engineering.eventbrite.com/changing-the-ip-address-of-a-cassandra-node-with-auto_bootstrapfalse/ On 18 December 2014 at 09:10, Langston, Jim jim.langs...@dynatrace.com wrote: Hi all, I have set up a test environment with C* 2.1.2, wanting to test our applications against it. I currently have C* 1.2.9 in production and want to use that data for testing. What would be a good approach for simply taking a copy of the production data and moving it into the test env and having the test env C* use that data ? The test env. is identical is size, with the difference being the versions of C*. Thanks, Jim The contents of this e-mail are intended for the named addressee only. It contains information that may be confidential. Unless you are the named addressee or an authorized designee, you may not copy or use it, or disclose it to anyone else. If you received it in error please notify us immediately and then destroy it -- Ben Bromhead Instaclustr | www.instaclustr.com | @instaclustr http://twitter.com/instaclustr | +61 415 936 359 -- [image: datastax_logo.png] http://www.datastax.com/ Ryan Svihla Solution Architect [image: twitter.png] https://twitter.com/foundev [image: linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay. The contents of this e-mail are intended for the named addressee only. It contains information that may be confidential. Unless you are the named addressee or an authorized designee, you may not
Node down during move
Hi list, we added a new node to existing 8-nodes cluster with C* 1.2.9 without vnodes and because we are almost totally out of space, we are shuffling the token fone node after another (not in parallel). During one of this move operations, the receiving node died and thus the streaming failed: WARN [Streaming to /X.Y.Z.18:2] 2014-12-19 19:25:56,227 StorageService.java (line 3703) Streaming to /X.Y.Z.18 failed INFO [RMI TCP Connection(12940)-X.Y.Z.17] 2014-12-19 19:25:56,233 ColumnFamilyStore.java (line 629) Enqueuing flush of Memtable-local@433096244(70/70 serialized/live bytes, 2 ops) INFO [FlushWriter:3772] 2014-12-19 19:25:56,238 Memtable.java (line 461) Writing Memtable-local@433096244(70/70 serialized/live bytes, 2 ops) ERROR [Streaming to /X.Y.Z.18:2] 2014-12-19 19:25:56,246 CassandraDaemon.java (line 192) Exception in thread Thread[Streaming to /X.Y.Z.18:2,5,RMI Runtime] java.lang.RuntimeException: java.io.IOException: Broken pipe at com.google.common.base.Throwables.propagate(Throwables.java:160) at org.apache.cassandra.utils.WrappedRunnable.run(WrappedRunnable.java:32) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) Caused by: java.io.IOException: Broken pipe at sun.nio.ch.FileDispatcherImpl.write0(Native Method) After restart of the receiving node, we tried to perform the move again, but it failed with: Exception in thread main java.io.IOException: target token 113427455640312821154458202477256070486 is already owned by another node. at org.apache.cassandra.service.StorageService.move(StorageService.java:2930) So we tried to move it with a token just 1 higher, to trigger the movement. This didn't move anything, but finished successfully: INFO [Thread-5520] 2014-12-19 20:00:24,689 StreamInSession.java (line 199) Finished streaming session 4974f3c0-87b1-11e4-bf1b-97d9ac6bd256 from /X.Y.Z.18 Now, it is quite improbable that the first streaming was done and it died just after copying everything, as the ERROR was the last message about streaming in the logs. Is there any way how to make sure the data are really moved and thus running nodetool cleanup is safe? Thank you. Jiri Hoky
Re: Multi DC informations (sync)
replies inline On Fri, Dec 19, 2014 at 10:30 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: All that you said match the idea I had of how it works except this part: The request blocks however until all CL is satisfied -- Does this mean that the client will see an error if the local DC write the data correctly (i.e. CL reached) but the remote DC fails ? This is not the idea I had of something asynchronous... Asynchronous is just all requests are sent out at once..the client response is blocked till CL is satisfied or timeout occurs. If CL is one for example..the first response back will be a success on the client..regardless of what's happened in the background. If it's say ALL..then yes it'd wait for all responses to come back. If it doesn't fail on client side (real asynchronous), is there a way to make sure remote DC has indeed received the information ? I mean if the throughput cross regions is to small, the write will fail and so will the HH, potentially. How to detect we are lacking of throughput cross DC for example ? monitoring logging, etc, etc, etc If an application needs EACH_QUORUM consistency across all data centers and the performance penalty is worthwhile..then that's probably what you're asking for. If LOCAL_QUORUM + regular repairs is fine..then do that..if CL ONE is fine then do that. You SHOULD BE monitoring dropped mutations and Hints via JMX or something like Opscenter. Outages of substantial length should probably involve a repair, if it's over your HH timeout, it DEFINITELY should involve a repair. If you ever have a doubt it should involve repair. Repairs are indeed a good thing (we run them as a weekly routine, GC grace period 10 sec), but having inconsistency for a week without knowing it is quite an issue. Then use a higher consistency level so that the client is not surprised, and knows the state of things, and doesn't consider a write successful until it's consistent across the data centers (i'd argue this is probably not what you really want, but different applications have different needs). If you need only local data center level awareness doing LOCAL_QUORUM reads and writes will get you to where you want, but complete multidatacenter nearly immediate consistency that you know about on the client is not free, and it isn't with any system. Thanks for this detailed information Ryan, I hope I am clear enough while expressing my doubts. I think it's a bit of a misunderstanding of the tools available. If you have a need for full nearly immediate data center consistency, my suggestion is a sizing (from a network pipe and application design SLA perspective) for a higher CL on writes and potentially reads, the tools are there. C*heers Alain 2014-12-19 15:43 GMT+01:00 Ryan Svihla rsvi...@datastax.com: More accurately,the write path of Cassandra in a multi dc sense is kinda like the following 1. write goes to a node which acts as coordinator 2. writes go out to all replicas in that DC, and then one write per remote DC goes out to another node which takes responsibility for writing to all replicas in it's data center. The request blocks however until all CL is satisfied. 3. if any of these writes fail by default a hinted handoff is generated.. So as you can see..there is effectively not lag beyond either raw network latency+node speed and/or just failed writes and waiting on hint replay to occur. Likewise repairs can be used to make the data centers back in sync, and in the case of substantial outages you will need repairs to bring you back in sync, you're running repairs already right? Think of Cassandra as a global write, and not a message queue, and you've got the basic idea. On Fri, Dec 19, 2014 at 7:54 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi Jens, thanks for your insight. Replication lag in Cassandra terms is probably “Hinted handoff” -- Well I think hinted handoff are only used when a node is down, and are not even mandatory enabled. I guess that cross DC async replication is something else, taht has nothing to see with hinted handoff, am I wrong ? `nodetool status` is your friend. It will tell you whether the cluster considers other nodes reachable or not. Run it on a node in the datacenter that you’d like to test connectivity from. -- Connectivity ≠ write success Basically the two question can be changed this way: 1 - How to monitor the async cross dc write latency ? 2 - What error should I look for when async write fails (if any) ? Or is there any other way to see that network throughput (for example) is too small for a given traffic. Hope this is clearer. C*heers, Alain 2014-12-19 11:44 GMT+01:00 Jens Rantil jens.ran...@tink.se: Alain, AFAIK, the DC replication is not linearizable. That is, writes are are not replicated according to a binlog or similar like MySQL. They are replicated concurrently. To answer you questions: 1 - Replication lag in Cassandra terms is
Re: Key Cache Questions
if you have JNA installed it's stored off-heap in ram, without JNA it's stored on heap in ram. The following should help explain in more depth http://www.datastax.com/dev/blog/maximizing-cache-benefit-with-cassandra On Fri, Dec 19, 2014 at 8:35 AM, Batranut Bogdan batra...@yahoo.com wrote: Hello all, I just read that the default size of the Key cache is 100 MB. Is it stored in memory or disk? -- [image: datastax_logo.png] http://www.datastax.com/ Ryan Svihla Solution Architect [image: twitter.png] https://twitter.com/foundev [image: linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most innovative companies such as Netflix, Adobe, Intuit, and eBay.
Re: Multi DC informations (sync)
Your gc grace should be longer than your repair schedule. You're likely going to have deleted data resurface. On Fri Dec 19 2014 at 8:31:13 AM Alain RODRIGUEZ arodr...@gmail.com wrote: All that you said match the idea I had of how it works except this part: The request blocks however until all CL is satisfied -- Does this mean that the client will see an error if the local DC write the data correctly (i.e. CL reached) but the remote DC fails ? This is not the idea I had of something asynchronous... If it doesn't fail on client side (real asynchronous), is there a way to make sure remote DC has indeed received the information ? I mean if the throughput cross regions is to small, the write will fail and so will the HH, potentially. How to detect we are lacking of throughput cross DC for example ? Repairs are indeed a good thing (we run them as a weekly routine, GC grace period 10 sec), but having inconsistency for a week without knowing it is quite an issue. Thanks for this detailed information Ryan, I hope I am clear enough while expressing my doubts. C*heers Alain 2014-12-19 15:43 GMT+01:00 Ryan Svihla rsvi...@datastax.com: More accurately,the write path of Cassandra in a multi dc sense is kinda like the following 1. write goes to a node which acts as coordinator 2. writes go out to all replicas in that DC, and then one write per remote DC goes out to another node which takes responsibility for writing to all replicas in it's data center. The request blocks however until all CL is satisfied. 3. if any of these writes fail by default a hinted handoff is generated.. So as you can see..there is effectively not lag beyond either raw network latency+node speed and/or just failed writes and waiting on hint replay to occur. Likewise repairs can be used to make the data centers back in sync, and in the case of substantial outages you will need repairs to bring you back in sync, you're running repairs already right? Think of Cassandra as a global write, and not a message queue, and you've got the basic idea. On Fri, Dec 19, 2014 at 7:54 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi Jens, thanks for your insight. Replication lag in Cassandra terms is probably “Hinted handoff” -- Well I think hinted handoff are only used when a node is down, and are not even mandatory enabled. I guess that cross DC async replication is something else, taht has nothing to see with hinted handoff, am I wrong ? `nodetool status` is your friend. It will tell you whether the cluster considers other nodes reachable or not. Run it on a node in the datacenter that you’d like to test connectivity from. -- Connectivity ≠ write success Basically the two question can be changed this way: 1 - How to monitor the async cross dc write latency ? 2 - What error should I look for when async write fails (if any) ? Or is there any other way to see that network throughput (for example) is too small for a given traffic. Hope this is clearer. C*heers, Alain 2014-12-19 11:44 GMT+01:00 Jens Rantil jens.ran...@tink.se: Alain, AFAIK, the DC replication is not linearizable. That is, writes are are not replicated according to a binlog or similar like MySQL. They are replicated concurrently. To answer you questions: 1 - Replication lag in Cassandra terms is probably “Hinted handoff”. You’d want to check the status of that. 2 - `nodetool status` is your friend. It will tell you whether the cluster considers other nodes reachable or not. Run it on a node in the datacenter that you’d like to test connectivity from. Cheers, Jens ——— Jens Rantil Backend engineer Tink AB Email: jens.ran...@tink.se Phone: +46 708 84 18 32 Web: www.tink.se Facebook Linkedin Twitter On Fri, Dec 19, 2014 at 11:16 AM, Alain RODRIGUEZ arodr...@gmail.com wrote: Hi guys, We expanded our cluster to a multiple DC configuration. Now I am wondering if there is any way to know: 1 - The replication lag between these 2 DC (Opscenter, nodetool, other ?) 2 - Make sure that sync is ok at any time I guess big companies running Cassandra are interested in these kind of info, so I think something exist but I am not aware of it. Any other important information or advice you can give me about best practices or tricks while running a multi DC (cross regions US - EU) is welcome of course ! cheers, Alain -- [image: datastax_logo.png] http://www.datastax.com/ Ryan Svihla Solution Architect [image: twitter.png] https://twitter.com/foundev [image: linkedin.png] http://www.linkedin.com/pub/ryan-svihla/12/621/727/ DataStax is the fastest, most scalable distributed database technology, delivering Apache Cassandra to the world’s most innovative enterprises. Datastax is built to be agile, always-on, and predictably scalable to any size. With more than 500 customers in 45 countries, DataStax is the database technology and transactional backbone of choice for the worlds most
Re: High Bloom Filter FP Ratio
I took a look at the code where the bloom filter true/false positive counters are updated and notice that the true-positive count isn't being updated on key cache hits: https://issues.apache.org/jira/browse/CASSANDRA-8525. That may explain your ratios. Can you try querying for a few non-existent partition keys in cqlsh with tracing enabled (just run TRACING ON) and see if you really do get that high of a false-positive ratio? On Fri, Dec 19, 2014 at 9:59 AM, Mark Greene green...@gmail.com wrote: We're seeing similar behavior except our FP ratio is closer to 1.0 (100%). We're using Cassandra 2.1.2. Schema --- CREATE TABLE contacts.contact ( id bigint, property_id int, created_at bigint, updated_at bigint, value blob, PRIMARY KEY (id, property_id) ) WITH CLUSTERING ORDER BY (property_id ASC) *AND bloom_filter_fp_chance = 0.001* AND caching = '{keys:ALL, rows_per_partition:NONE}' AND comment = '' AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy', 'max_threshold': '32'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND dclocal_read_repair_chance = 0.1 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = '99.0PERCENTILE'; CF Stats Output: - Keyspace: contacts Read Count: 2458375 Read Latency: 0.852844076675 ms. Write Count: 10357 Write Latency: 0.1816912233272183 ms. Pending Flushes: 0 Table: contact SSTable count: 61 SSTables in each level: [1, 10, 50, 0, 0, 0, 0, 0, 0] Space used (live): 9047112471 Space used (total): 9047112471 Space used by snapshots (total): 0 SSTable Compression Ratio: 0.34119240020241487 Memtable cell count: 24570 Memtable data size: 1299614 Memtable switch count: 2 Local read count: 2458290 Local read latency: 0.853 ms Local write count: 10044 Local write latency: 0.186 ms Pending flushes: 0 Bloom filter false positives: 11096 *Bloom filter false ratio: 0.99197* Bloom filter space used: 3923784 Compacted partition minimum bytes: 373 Compacted partition maximum bytes: 152321 Compacted partition mean bytes: 9938 Average live cells per slice (last five minutes): 37.57851240677983 Maximum live cells per slice (last five minutes): 63.0 Average tombstones per slice (last five minutes): 0.0 Maximum tombstones per slice (last five minutes): 0.0 -- about.me http://about.me/markgreene On Wed, Dec 17, 2014 at 1:32 PM, Chris Hart ch...@remilon.com wrote: Hi, I have create the following table with bloom_filter_fp_chance=0.01: CREATE TABLE logged_event ( time_key bigint, partition_key_randomizer int, resource_uuid timeuuid, event_json text, event_type text, field_error_list maptext, text, javascript_timestamp timestamp, javascript_uuid uuid, page_impression_guid uuid, page_request_guid uuid, server_received_timestamp timestamp, session_id bigint, PRIMARY KEY ((time_key, partition_key_randomizer), resource_uuid) ) WITH bloom_filter_fp_chance=0.01 AND caching='KEYS_ONLY' AND comment='' AND dclocal_read_repair_chance=0.00 AND gc_grace_seconds=864000 AND index_interval=128 AND read_repair_chance=0.00 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND default_time_to_live=0 AND speculative_retry='99.0PERCENTILE' AND memtable_flush_period_in_ms=0 AND compaction={'class': 'SizeTieredCompactionStrategy'} AND compression={'sstable_compression': 'LZ4Compressor'}; When I run cfstats, I see a much higher false positive ratio: Table: logged_event SSTable count: 15 Space used (live), bytes: 104128214227 Space used (total), bytes: 104129482871 SSTable Compression Ratio: 0.3295840184239226 Number of keys (estimate): 199293952 Memtable cell count: 56364 Memtable data size, bytes: 20903960 Memtable switch count: 148 Local read count: 1396402 Local read latency: 0.362 ms Local write count: 2345306 Local write latency: 0.062 ms Pending tasks: 0 Bloom filter false positives: 147705 Bloom filter false ratio: 0.49020 Bloom filter space used, bytes:
Re: High Bloom Filter FP Ratio
Hi Tyler, I tried what you said and false positives look much more reasonable there. Thanks for looking into this. -Chris - Original Message - From: Tyler Hobbs ty...@datastax.com To: user@cassandra.apache.org Sent: Friday, December 19, 2014 1:25:29 PM Subject: Re: High Bloom Filter FP Ratio I took a look at the code where the bloom filter true/false positive counters are updated and notice that the true-positive count isn't being updated on key cache hits: https://issues.apache.org/jira/browse/CASSANDRA-8525. That may explain your ratios. Can you try querying for a few non-existent partition keys in cqlsh with tracing enabled (just run TRACING ON) and see if you really do get that high of a false-positive ratio? On Fri, Dec 19, 2014 at 9:59 AM, Mark Greene green...@gmail.com wrote: We're seeing similar behavior except our FP ratio is closer to 1.0 (100%). We're using Cassandra 2.1.2. Schema --- CREATE TABLE contacts.contact ( id bigint, property_id int, created_at bigint, updated_at bigint, value blob, PRIMARY KEY (id, property_id) ) WITH CLUSTERING ORDER BY (property_id ASC) *AND bloom_filter_fp_chance = 0.001* AND caching = '{keys:ALL, rows_per_partition:NONE}' AND comment = '' AND compaction = {'min_threshold': '4', 'class': 'org.apache.cassandra.db.compaction.LeveledCompactionStrategy', 'max_threshold': '32'} AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'} AND dclocal_read_repair_chance = 0.1 AND default_time_to_live = 0 AND gc_grace_seconds = 864000 AND max_index_interval = 2048 AND memtable_flush_period_in_ms = 0 AND min_index_interval = 128 AND read_repair_chance = 0.0 AND speculative_retry = '99.0PERCENTILE'; CF Stats Output: - Keyspace: contacts Read Count: 2458375 Read Latency: 0.852844076675 ms. Write Count: 10357 Write Latency: 0.1816912233272183 ms. Pending Flushes: 0 Table: contact SSTable count: 61 SSTables in each level: [1, 10, 50, 0, 0, 0, 0, 0, 0] Space used (live): 9047112471 Space used (total): 9047112471 Space used by snapshots (total): 0 SSTable Compression Ratio: 0.34119240020241487 Memtable cell count: 24570 Memtable data size: 1299614 Memtable switch count: 2 Local read count: 2458290 Local read latency: 0.853 ms Local write count: 10044 Local write latency: 0.186 ms Pending flushes: 0 Bloom filter false positives: 11096 *Bloom filter false ratio: 0.99197* Bloom filter space used: 3923784 Compacted partition minimum bytes: 373 Compacted partition maximum bytes: 152321 Compacted partition mean bytes: 9938 Average live cells per slice (last five minutes): 37.57851240677983 Maximum live cells per slice (last five minutes): 63.0 Average tombstones per slice (last five minutes): 0.0 Maximum tombstones per slice (last five minutes): 0.0 -- about.me http://about.me/markgreene On Wed, Dec 17, 2014 at 1:32 PM, Chris Hart ch...@remilon.com wrote: Hi, I have create the following table with bloom_filter_fp_chance=0.01: CREATE TABLE logged_event ( time_key bigint, partition_key_randomizer int, resource_uuid timeuuid, event_json text, event_type text, field_error_list maptext, text, javascript_timestamp timestamp, javascript_uuid uuid, page_impression_guid uuid, page_request_guid uuid, server_received_timestamp timestamp, session_id bigint, PRIMARY KEY ((time_key, partition_key_randomizer), resource_uuid) ) WITH bloom_filter_fp_chance=0.01 AND caching='KEYS_ONLY' AND comment='' AND dclocal_read_repair_chance=0.00 AND gc_grace_seconds=864000 AND index_interval=128 AND read_repair_chance=0.00 AND replicate_on_write='true' AND populate_io_cache_on_flush='false' AND default_time_to_live=0 AND speculative_retry='99.0PERCENTILE' AND memtable_flush_period_in_ms=0 AND compaction={'class': 'SizeTieredCompactionStrategy'} AND compression={'sstable_compression': 'LZ4Compressor'}; When I run cfstats, I see a much higher false positive ratio: Table: logged_event SSTable count: 15 Space used (live), bytes: 104128214227 Space used (total), bytes: 104129482871 SSTable Compression Ratio: 0.3295840184239226 Number of keys (estimate): 199293952 Memtable cell count: 56364 Memtable data size, bytes: 20903960 Memtable switch count: 148 Local read count: 1396402 Local
Re: In place vnode conversion possible?
On Fri, Dec 19, 2014 at 12:25 AM, Jonas Borgström jo...@borgstrom.se wrote: Why would any streaming take place? Simply changing the tokens and restarting a node does not seem to trigger any streaming. Oh, sorry for not reading the whole mail, I figured you were going to do something less low level hacky. :) That method seems like it would work. Basically in this case (RF=N) shotgun range movement are safe, because nothing's actually moving. =Rob
Re: Practical use of counters in the industry
On Thu, Dec 18, 2014 at 7:19 PM, Rajath Subramanyam rajat...@gmail.com wrote: Thanks Ken. Any other use cases where counters are used apart from Rainbird ? Disqus use(d? s?) them behind an in-memory accumulator which batches and periodically flushes. This is the best way to use old counters. New counters should be usable in more cases without something in front of them. =Rob