Re: Cassandra, vnodes, and spark
Look into the source code of the Spark connector. CassandraRDD try to find all token ranges (even when using vnodes) for each node (endpoint) and create RDD partition to match this distribution of token ranges. Thus data locality is guaranteed On Tue, Sep 16, 2014 at 4:39 AM, Eric Plowe eric.pl...@gmail.com wrote: Interesting. The way I understand the spark connector is that it's basically a client executing a cql query and filling a spark rdd. Spark will then handle the partitioning of data. Again, this is my understanding, and it maybe incorrect. On Monday, September 15, 2014, Robert Coli rc...@eventbrite.com wrote: On Mon, Sep 15, 2014 at 4:57 PM, Eric Plowe eric.pl...@gmail.com wrote: Based on this stackoverflow question, vnodes effect the number of mappers Hadoop needs to spawn. Which in then affect performance. With the spark connector for cassandra would the same situation happen? Would vnodes affect performance in a similar situation to Hadoop? I don't know what specifically Spark does here, but if it has the same locality expectations as Hadoop generally, my belief would be : yes. =Rob
Direct IO with Spark and Hadoop over Cassandra
Hi. As I see massive data processing tools (map\reduce) with C* data include connectors - Calliope http://tuplejump.github.io/calliope/ - Datastax spark cassandra connector https://github.com/datastax/spark-cassandra-connector - Startio Deep https://github.com/Stratio/stratio-deep - other free\commercial runtime (job management and infrastructure) - Spark - Hadoop But if I'm not mistaken all these solutions use network for data loading. In best case logic instance (some job) run on the same node (wherethe corresponding range was found). Why this logic can`t use direct C* IO (sstable reading from disk)? Any cons ? Some time ago i read article (still can't find it) about academical research within Hadoop was modified to support this direct IO mode. According to that benchmarks direct IOgave a significant performance increase.
Re: Direct IO with Spark and Hadoop over Cassandra
If you access directly the C* sstables from those frameworks, you will: 1) miss live data which are in memory and not dumped yet to disk 2) skip the Dynamo layer of C* responsible for data consistency Le 16 sept. 2014 10:58, platon.tema platon.t...@yandex.ru a écrit : Hi. As I see massive data processing tools (map\reduce) with C* data include connectors - Calliope http://tuplejump.github.io/calliope/ - Datastax spark cassandra connector https://github.com/datastax/ spark-cassandra-connector - Startio Deep https://github.com/Stratio/stratio-deep - other free\commercial runtime (job management and infrastructure) - Spark - Hadoop But if I'm not mistaken all these solutions use network for data loading. In best case logic instance (some job) run on the same node (wherethe corresponding range was found). Why this logic can`t use direct C* IO (sstable reading from disk)? Any cons ? Some time ago i read article (still can't find it) about academical research within Hadoop was modified to support this direct IO mode. According to that benchmarks direct IOgave a significant performance increase.
Document of WRITETIME function needs update
Hi, I found that the WRITETIME function on counter column returns date/time in milliseconds instead of microseconds, which is not mentioned in the document http://www.datastax.com/documentation/cql/3.1/cql/cql_using/use_writetime.html. It will be helpful to clarify the difference in the document. One side question: I denormalize the counter column value to regular tables using read-after-write in QUORUM consistency from counter table and update the regular tables using counter column's write time to resolve write conflict. Is this a valid use case? Thanks, Ziju.
Re: Cassandra, vnodes, and spark
Run into this performance report https://github.com/datastax/spark-cassandra-connector/issues/200 Does spark connector in its current state issue one CQL per vnode or task per vnode? Regards. On Tue, Sep 16, 2014 at 2:05 AM, DuyHai Doan doanduy...@gmail.com wrote: Look into the source code of the Spark connector. CassandraRDD try to find all token ranges (even when using vnodes) for each node (endpoint) and create RDD partition to match this distribution of token ranges. Thus data locality is guaranteed On Tue, Sep 16, 2014 at 4:39 AM, Eric Plowe eric.pl...@gmail.com wrote: Interesting. The way I understand the spark connector is that it's basically a client executing a cql query and filling a spark rdd. Spark will then handle the partitioning of data. Again, this is my understanding, and it maybe incorrect. On Monday, September 15, 2014, Robert Coli rc...@eventbrite.com wrote: On Mon, Sep 15, 2014 at 4:57 PM, Eric Plowe eric.pl...@gmail.com wrote: Based on this stackoverflow question, vnodes effect the number of mappers Hadoop needs to spawn. Which in then affect performance. With the spark connector for cassandra would the same situation happen? Would vnodes affect performance in a similar situation to Hadoop? I don't know what specifically Spark does here, but if it has the same locality expectations as Hadoop generally, my belief would be : yes. =Rob
Re: Direct IO with Spark and Hadoop over Cassandra
Thanks. But 1) overcomes with C* API for commitlog and memtables or with mixed access (direct IO + traditional connectors or pure CQL if data model allows, we experimented with it). 2) is more complex for universal solution. In our case C* uses without replication (RF=1) because of huge data size (replication too expensive). On 09/16/2014 03:40 PM, DuyHai Doan wrote: If you access directly the C* sstables from those frameworks, you will: 1) miss live data which are in memory and not dumped yet to disk 2) skip the Dynamo layer of C* responsible for data consistency Le 16 sept. 2014 10:58, platon.tema platon.t...@yandex.ru mailto:platon.t...@yandex.ru a écrit : Hi. As I see massive data processing tools (map\reduce) with C* data include connectors - Calliope http://tuplejump.github.io/calliope/ - Datastax spark cassandra connector https://github.com/datastax/spark-cassandra-connector - Startio Deep https://github.com/Stratio/stratio-deep - other free\commercial runtime (job management and infrastructure) - Spark - Hadoop But if I'm not mistaken all these solutions use network for data loading. In best case logic instance (some job) run on the same node (wherethe corresponding range was found). Why this logic can`t use direct C* IO (sstable reading from disk)? Any cons ? Some time ago i read article (still can't find it) about academical research within Hadoop was modified to support this direct IO mode. According to that benchmarks direct IOgave a significant performance increase.
RE: Direct IO with Spark and Hadoop over Cassandra
You will also have to read/resolve multiple row instances (if you update records) and tombstones (if you delete records) yourself. From: platon.tema [mailto:platon.t...@yandex.ru] Sent: Tuesday, September 16, 2014 1:51 PM To: user@cassandra.apache.org Subject: Re: Direct IO with Spark and Hadoop over Cassandra Thanks. But 1) overcomes with C* API for commitlog and memtables or with mixed access (direct IO + traditional connectors or pure CQL if data model allows, we experimented with it). 2) is more complex for universal solution. In our case C* uses without replication (RF=1) because of huge data size (replication too expensive). On 09/16/2014 03:40 PM, DuyHai Doan wrote: If you access directly the C* sstables from those frameworks, you will: 1) miss live data which are in memory and not dumped yet to disk 2) skip the Dynamo layer of C* responsible for data consistency Le 16 sept. 2014 10:58, platon.tema platon.t...@yandex.rumailto:platon.t...@yandex.ru a écrit : Hi. As I see massive data processing tools (map\reduce) with C* data include connectors - Calliope http://tuplejump.github.io/calliope/ - Datastax spark cassandra connector https://github.com/datastax/spark-cassandra-connector - Startio Deep https://github.com/Stratio/stratio-deep - other free\commercial runtime (job management and infrastructure) - Spark - Hadoop But if I'm not mistaken all these solutions use network for data loading. In best case logic instance (some job) run on the same node (wherethe corresponding range was found). Why this logic can`t use direct C* IO (sstable reading from disk)? Any cons ? Some time ago i read article (still can't find it) about academical research within Hadoop was modified to support this direct IO mode. According to that benchmarks direct IOgave a significant performance increase. ___ This message is for information purposes only, it is not a recommendation, advice, offer or solicitation to buy or sell a product or service nor an official confirmation of any transaction. It is directed at persons who are professionals and is not intended for retail customer use. Intended for recipient only. This message is subject to the terms at: www.barclays.com/emaildisclaimer. For important disclosures, please see: www.barclays.com/salesandtradingdisclaimer regarding market commentary from Barclays Sales and/or Trading, who are active market participants; and in respect of Barclays Research, including disclosures relating to specific issuers, please see http://publicresearch.barclays.com. ___
Re: Direct IO with Spark and Hadoop over Cassandra
Yes, updates and deletes is trouble. At the moment for updates collection we refresh result data by query to C* (java driver) before reporting to user. For deletes we can skip it during scanning by TTL for example (not tested yet). On 09/16/2014 04:53 PM, moshe.kr...@barclays.com wrote: You will also have to read/resolve multiple row instances (if you update records) and tombstones (if you delete records) yourself. *From:*platon.tema [mailto:platon.t...@yandex.ru] *Sent:* Tuesday, September 16, 2014 1:51 PM *To:* user@cassandra.apache.org *Subject:* Re: Direct IO with Spark and Hadoop over Cassandra Thanks. But 1) overcomes with C* API for commitlog and memtables or with mixed access (direct IO + traditional connectors or pure CQL if data model allows, we experimented with it). 2) is more complex for universal solution. In our case C* uses without replication (RF=1) because of huge data size (replication too expensive). On 09/16/2014 03:40 PM, DuyHai Doan wrote: If you access directly the C* sstables from those frameworks, you will: 1) miss live data which are in memory and not dumped yet to disk 2) skip the Dynamo layer of C* responsible for data consistency Le 16 sept. 2014 10:58, platon.tema platon.t...@yandex.ru mailto:platon.t...@yandex.ru a écrit : Hi. As I see massive data processing tools (map\reduce) with C* data include connectors - Calliope http://tuplejump.github.io/calliope/ - Datastax spark cassandra connector https://github.com/datastax/spark-cassandra-connector - Startio Deep https://github.com/Stratio/stratio-deep - other free\commercial runtime (job management and infrastructure) - Spark - Hadoop But if I'm not mistaken all these solutions use network for data loading. In best case logic instance (some job) run on the same node (wherethe corresponding range was found). Why this logic can`t use direct C* IO (sstable reading from disk)? Any cons ? Some time ago i read article (still can't find it) about academical research within Hadoop was modified to support this direct IO mode. According to that benchmarks direct IOgave a significant performance increase. ___ This message is for information purposes only, it is not a recommendation, advice, offer or solicitation to buy or sell a product or service nor an official confirmation of any transaction. It is directed at persons who are professionals and is not intended for retail customer use. Intended for recipient only. This message is subject to the terms at: www.barclays.com/emaildisclaimer http://www.barclays.com/emaildisclaimer. For important disclosures, please see: www.barclays.com/salesandtradingdisclaimer http://www.barclays.com/salesandtradingdisclaimer regarding market commentary from Barclays Sales and/or Trading, who are active market participants; and in respect of Barclays Research, including disclosures relating to specific issuers, please see http://publicresearch.barclays.com. ___
Re: hs_err_pid3013.log, out of memory?
How much memory does your system have? How much memory is system utilizing before starting Cassandra (use command free)? What are the heap setting it tries to use? Chris On Sep 15, 2014, at 8:16 PM, Yatong Zhang bluefl...@gmail.com wrote: It's during the startup. I tried to upgrade cassandra from 2.0.7 to 2.0.10, but looks like cassandra could not start again. Also I found the following log at '/var/log/messages': Sep 16 09:06:59 storage6 kernel: INFO: task java:4971 blocked for more than 120 seconds. Sep 16 09:06:59 storage6 kernel: Tainted: G --- H 2.6.32-431.el6.x86_64 #1 Sep 16 09:06:59 storage6 kernel: echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Sep 16 09:06:59 storage6 kernel: java D 0003 0 4971 1 0x0080 Sep 16 09:06:59 storage6 kernel: 88042b591c98 0082 81ed4ff0 8803b4f01540 Sep 16 09:06:59 storage6 kernel: 88042b591c68 810af370 88042b591ca0 8803b4f01540 Sep 16 09:06:59 storage6 kernel: 8803b4f01af8 88042b591fd8 fbc8 8803b4f01af8 Sep 16 09:06:59 storage6 kernel: Call Trace: Sep 16 09:06:59 storage6 kernel: [810af370] ? exit_robust_list+0x90/0x160 Sep 16 09:06:59 storage6 kernel: [81076ad5] exit_mm+0x95/0x180 Sep 16 09:06:59 storage6 kernel: [81076f1f] do_exit+0x15f/0x870 Sep 16 09:06:59 storage6 kernel: [81077688] do_group_exit+0x58/0xd0 Sep 16 09:06:59 storage6 kernel: [8108d046] get_signal_to_deliver+0x1f6/0x460 Sep 16 09:06:59 storage6 kernel: [8100a265] do_signal+0x75/0x800 Sep 16 09:06:59 storage6 kernel: [81066629] ? wake_up_new_task+0xd9/0x130 Sep 16 09:06:59 storage6 kernel: [81070ead] ? do_fork+0x13d/0x480 Sep 16 09:06:59 storage6 kernel: [810b1c0b] ? sys_futex+0x7b/0x170 Sep 16 09:06:59 storage6 kernel: [8100aa80] do_notify_resume+0x90/0xc0 Sep 16 09:06:59 storage6 kernel: [8100b341] int_signal+0x12/0x17 Sep 16 09:06:59 storage6 kernel: INFO: task java:4972 blocked for more than 120 seconds. Sep 16 09:06:59 storage6 kernel: Tainted: G --- H 2.6.32-431.el6.x86_64 #1 Sep 16 09:06:59 storage6 kernel: echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Sep 16 09:06:59 storage6 kernel: java D 0 4972 1 0x0080 Sep 16 09:06:59 storage6 kernel: 8803b4d7fc98 0082 81ed6d78 8803b4cf1500 Sep 16 09:06:59 storage6 kernel: 8803b4d7fc68 810af370 8803b4d7fca0 8803b4cf1500 Sep 16 09:06:59 storage6 kernel: 8803b4cf1ab8 8803b4d7ffd8 fbc8 8803b4cf1ab8 Sep 16 09:06:59 storage6 kernel: Call Trace: Sep 16 09:06:59 storage6 kernel: [810af370] ? exit_robust_list+0x90/0x160 Sep 16 09:06:59 storage6 kernel: [81076ad5] exit_mm+0x95/0x180 Sep 16 09:06:59 storage6 kernel: [81076f1f] do_exit+0x15f/0x870 Sep 16 09:06:59 storage6 kernel: [81065e20] ? wake_up_state+0x10/0x20 Sep 16 09:06:59 storage6 kernel: [81077688] do_group_exit+0x58/0xd0 Sep 16 09:06:59 storage6 kernel: [8108d046] get_signal_to_deliver+0x1f6/0x460 Sep 16 09:06:59 storage6 kernel: [8100a265] do_signal+0x75/0x800 Sep 16 09:06:59 storage6 kernel: [810097cc] ? __switch_to+0x1ac/0x320 Sep 16 09:06:59 storage6 kernel: [81527910] ? thread_return+0x4e/0x76e Sep 16 09:06:59 storage6 kernel: [810b1c0b] ? sys_futex+0x7b/0x170 Sep 16 09:06:59 storage6 kernel: [8100aa80] do_notify_resume+0x90/0xc0 Sep 16 09:06:59 storage6 kernel: [8100b341] int_signal+0x12/0x17 Sep 16 09:06:59 storage6 kernel: INFO: task java:4973 blocked for more than 120 seconds. On Tue, Sep 16, 2014 at 9:00 AM, Robert Coli rc...@eventbrite.com wrote: On Mon, Sep 15, 2014 at 5:55 PM, Yatong Zhang bluefl...@gmail.com wrote: I just encountered an error which left a log '/hs_err_pid3013.log'. So is there a way to solve this? # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 12288 bytes for committing reserved memory. Use less heap memory? You haven't specified under which circumstances this occurred, so I can only conjecture that it is likely being caused by writing too fast. Write more slowly. =Rob
Re: Trying to understand cassandra gc logs
Check out: https://blogs.oracle.com/poonam/entry/understanding_cms_gc_logs The young gen collection is stop the world that pauses application threads, and a couple parts of CMS can as well. I would recommend disabling the #JVM_OPTS=$JVM_OPTS -XX:PrintFLSStatistics=1 line in your cassandra-env.sh as well to simplify things a little and make it parsable by gc log visualization tools --- Chris Lohfink On Sep 15, 2014, at 9:40 PM, Donald Smith donald.sm...@audiencescience.com wrote: I understand that cassandra uses ParNew GC for New Gen and CMS for Old Gen (tenured). I’m trying to interpret in the logs when a Full GC happens and what kind of Full GC is used. It never says “Full GC” or anything like that. But I see that whenever there’s a line like 2014-09-15T18:04:17.197-0700: 117485.192: [CMS-concurrent-mark-start] the count of full GCs increases from {Heap after GC invocations=158459 (full 931): to a line like: {Heap before GC invocations=158459 (full 932): See the highlighted lines in the gclog output below. So, apparently there was a full GC between those two lines. Between those lines it also has two lines, such as: 2014-09-15T18:04:17.197-0700: 117485.192: Total time for which application threads were stopped: 0.0362080 seconds 2014-09-15T18:04:17.882-0700: 117485.877: Total time for which application threads were stopped: 0.0129660 seconds Also, the full count (932 above) is always exactly half the number (1864) FGC returned by jstat, as in dc1-cassandra01.dc01 /var/log/cassandra sudo jstat -gcutil 28511 S0 S1 E O P YGC YGCTFGCFGCT GCT 55.82 0.00 82.45 45.02 59.76 165772 5129.728 1864 320.247 5449.975 So, I am apparently correct that “(full 932)” is the count of Full GCs. I’m perplexed by the log output, though. I also see lines mentioning “concurrent mark-sweep” that do not appear to correspond to full GCs. So, my questions are: Is CMS used also for full GCs? If not, what kind of gc is done? The logs don’t say.Lines saying “Total time for which application threads were stopped” appear twice per full gc; why? Apparently, even our Full GCs are fast. 99% of them finish within 0.18 seconds; 99.9% finish within 0.5 seconds (which may be too slow for some of our clients). Here below is some log output, with interesting parts highlighted in grey or yellow. Thanks, Don {Heap before GC invocations=158458 (full 931): par new generation total 1290240K, used 1213281K [0x0005bae0, 0x00061260, 0x00061260) eden space 1146880K, 100% used [0x0005bae0, 0x000600e0, 0x000600e0) from space 143360K, 46% used [0x000600e0, 0x000604ed87c0, 0x000609a0) to space 143360K, 0% used [0x000609a0, 0x000609a0, 0x00061260) concurrent mark-sweep generation total 8003584K, used 5983572K [0x00061260, 0x0007fae0, 0x0007fae0) concurrent-mark-sweep perm gen total 44820K, used 26890K [0x0007fae0, 0x0007fd9c5000, 0x0008) 2014-09-15T18:04:17.131-0700: 117485.127: [GCBefore GC: Statistics for BinaryTreeDictionary: Total Free Space: 197474318 Max Chunk Size: 160662270 Number of Blocks: 3095 Av. Block Size: 63804 Tree Height: 32 Before GC: Statistics for BinaryTreeDictionary: Total Free Space: 2285026 Max Chunk Size: 2279936 Number of Blocks: 8 Av. Block Size: 285628 Tree Height: 5 2014-09-15T18:04:17.133-0700: 117485.128: [ParNew Desired survivor size 73400320 bytes, new threshold 1 (max 1) - age 1: 44548776 bytes, 44548776 total : 1213281K-49867K(1290240K), 0.0264540 secs] 7196854K-6059170K(9293824K)After GC: Statistics for BinaryTreeDictionary: Total Free Space: 195160244 Max Chunk Size: 160662270 Number of Blocks: 3093 Av. Block Size: 63097 Tree Height: 32 After GC: Statistics for BinaryTreeDictionary: Total Free Space: 2285026 Max Chunk Size: 2279936 Number of Blocks: 8 Av. Block Size: 285628 Tree Height: 5 , 0.0286700 secs] [Times: user=0.37 sys=0.01, real=0.03 secs] Heap after GC invocations=158459 (full 931): par new generation total 1290240K, used 49867K [0x0005bae0, 0x00061260, 0x00061260) eden space 1146880K, 0% used [0x0005bae0, 0x0005bae0, 0x000600e0) from space 143360K, 34% used [0x000609a0, 0x00060cab2e18, 0x00061260) to space 143360K, 0% used [0x000600e0, 0x000600e0, 0x000609a0) concurrent mark-sweep generation total 8003584K, used 6009302K [0x00061260, 0x0007fae0, 0x0007fae0) concurrent-mark-sweep perm gen total 44820K, used
Consistency Level for Atomic Batches
Is consistency level honored for batch statements? If I have 100 insert/update statements in my batch and use LOCAL_QUORUM consistency, will the control from coordinator return only after a local quorum update has been done for all the 100 statements? Or is it different ? Thanks Vish
Re: Consistency Level for Atomic Batches
A follow up on the earlier question. I meant to ask earlier if control returns to client after batch log is written on coordinator irrespective of consistency level mentioned. Also: will the coordinator attempt all statements one after the other, or in parallel ? Thanks On Tue, Sep 16, 2014 at 8:00 AM, Viswanathan Ramachandran vish.ramachand...@gmail.com wrote: Is consistency level honored for batch statements? If I have 100 insert/update statements in my batch and use LOCAL_QUORUM consistency, will the control from coordinator return only after a local quorum update has been done for all the 100 statements? Or is it different ? Thanks Vish
Blocking while a node finishes joining the cluster after restart.
Say I want to do a rolling restart of Cassandra… I can’t just restart all of them because they need some time to gossip and for that gossip to get to all nodes. What is the best strategy for this. It would be something like: /etc/init.d/cassandra restart wait-for-cassandra.sh … or something along those lines. -- Founder/CEO Spinn3r.com Location: *San Francisco, CA* blog: http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com
Re: Blocking while a node finishes joining the cluster after restart.
Hi Kevin, if you are using the latest version of opscenter, then even the community (= free) edition can do a rolling restart of your cluster. It's pretty convenient. Ciao, Duncan. On 16/09/14 19:44, Kevin Burton wrote: Say I want to do a rolling restart of Cassandra… I can’t just restart all of them because they need some time to gossip and for that gossip to get to all nodes. What is the best strategy for this. It would be something like: /etc/init.d/cassandra restart wait-for-cassandra.sh … or something along those lines. -- Founder/CEO Spinn3r.com http://Spinn3r.com Location: *San Francisco, CA* blog:**http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com
Re: Blocking while a node finishes joining the cluster after restart.
FYI: OpsCenter has a default of sleep 60 seconds after each node restart, and an option of drain before stopping. I haven't noticed if they do anything special with seeds. (At least one seed needs to be running before you restart other nodes.) I wondered the same thing as Kevin and came to these conclusions. Fixing the startup script is non-trivial as far as startup scripts go. For start, it would have to: - parse cassandra.yaml for seeds - if itself is not a seed, wait for a seed to start first. (could take minutes or never.) - continue start. For a no-downtime cluster restart script, it would have to: - verify cluster health (ie. quorum/CL is met or you lose writes) - parse cassandra.yaml for seeds and see if a seed is up - stop gossip and thrift - maybe do compaction before drain - drain node - stop/start or restart cassandra process. http://comments.gmane.org/gmane.comp.db.cassandra.user/20144 Both of those scripts would be nice to have. :) OpsCenter is flaky at doing rolling restart in my test cluster, so an alternative is needed. Also, the free OpsCenter doesn't have rolling repair option enabled. ccm has the options to do drain, stop and start, but a bash script would be needed to make it rolling. https://github.com/pcmanus/ccm Thanks, James. -- Cassandra/MySQL DBA. Available in San Jose area or remote. From: Duncan Sands duncan.sa...@gmail.com To: user@cassandra.apache.org Sent: Tuesday, September 16, 2014 11:09 AM Subject: Re: Blocking while a node finishes joining the cluster after restart. Hi Kevin, if you are using the latest version of opscenter, then even the community (= free) edition can do a rolling restart of your cluster. It's pretty convenient. Ciao, Duncan. On 16/09/14 19:44, Kevin Burton wrote: Say I want to do a rolling restart of Cassandra… I can’t just restart all of them because they need some time to gossip and for that gossip to get to all nodes. What is the best strategy for this. It would be something like: /etc/init.d/cassandra restart wait-for-cassandra.sh … or something along those lines. -- Founder/CEO Spinn3r.com http://Spinn3r.com Location: *San Francisco, CA* blog:**http://burtonator.wordpress.com … or check out my Google+ profile https://plus.google.com/102718274791889610666/posts http://spinn3r.com
Re: Blocking while a node finishes joining the cluster after restart.
On Tue, Sep 16, 2014 at 12:21 PM, James Briggs james.bri...@yahoo.com wrote: I haven't noticed if they do anything special with seeds. (At least one seed needs to be running before you restart other nodes.) If the nodes have all seen each other before (the cluster has coalesced once) then AFAIK this statement is not true. The ring state is persisted, nodes don't need to talk to a seed to start. I wondered the same thing as Kevin and came to these conclusions. As I don't think the seed node wrinkle exists, I'm pretty sure all you really have to do is make sure the node is answering on the Thrift and Gossip ports and that other nodes all see it as UP. =Rob
Re: Blocking while a node finishes joining the cluster after restart.
Hi Robert. I just did a test (shutdown all nodes, start one non-seed node.) You're correct that an old non-seed node can start by itself. So startup scripts don't have to be intelligent, but apps need to wait until there's enough nodes up to serve the whole keyspace: cqlsh:my_keyspace consistency Current consistency level is ONE. cqlsh:my_keyspace select * from numbers where v=1; v --- 1 (1 rows) cqlsh:my_keyspace select * from numbers where v=2; Unable to complete request: one or more nodes were unavailable. Thanks, James. -- Cassandra/MySQL DBA. Available in San Jose area or remote.
backport of CASSANDRA-6916
Hello, Has anyone backported incremental replacement of compacted SSTables (CASSANDRA-6916) to 2.0? Is it doable or there are many dependencies introduced in 2.1? Haven't checked the ticket detail yet, but just in case anyone has interesting info to share. Cheers, -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: backport of CASSANDRA-6916
On Tue, Sep 16, 2014 at 2:56 PM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: Has anyone backported incremental replacement of compacted SSTables (CASSANDRA-6916) to 2.0? Is it doable or there are many dependencies introduced in 2.1? Haven't checked the ticket detail yet, but just in case anyone has interesting info to share. Are you looking to patch for public consumption, or for your own purposes? I just took the temperature of #cassandra-dev and they were cold on the idea as a public patch, because of potential impact on stability. =Rob
Re: backport of CASSANDRA-6916
own purposes but wouldn't mind making it public so people could patch it themselves if they want too.. (if nobody has already done so) :) On Tue, Sep 16, 2014 at 8:13 PM, Robert Coli rc...@eventbrite.com wrote: On Tue, Sep 16, 2014 at 2:56 PM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: Has anyone backported incremental replacement of compacted SSTables (CASSANDRA-6916) to 2.0? Is it doable or there are many dependencies introduced in 2.1? Haven't checked the ticket detail yet, but just in case anyone has interesting info to share. Are you looking to patch for public consumption, or for your own purposes? I just took the temperature of #cassandra-dev and they were cold on the idea as a public patch, because of potential impact on stability. =Rob -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: backport of CASSANDRA-6916
Paulo: Out of curiosity, why not just upgrade to 2.1 if you want the new features? You know you want to! :) Thanks, James Briggs -- Cassandra/MySQL DBA. Available in San Jose area or remote. From: Robert Coli rc...@eventbrite.com To: user@cassandra.apache.org user@cassandra.apache.org Sent: Tuesday, September 16, 2014 4:13 PM Subject: Re: backport of CASSANDRA-6916 On Tue, Sep 16, 2014 at 2:56 PM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: Has anyone backported incremental replacement of compacted SSTables (CASSANDRA-6916) to 2.0? Is it doable or there are many dependencies introduced in 2.1? Haven't checked the ticket detail yet, but just in case anyone has interesting info to share. Are you looking to patch for public consumption, or for your own purposes? I just took the temperature of #cassandra-dev and they were cold on the idea as a public patch, because of potential impact on stability. =Rob
Re: backport of CASSANDRA-6916
Because I want this specific feature, and not all 2.1 features, even though this is probably one of the most significant changes in 2.1. Upgrading would be nice, but want to wait a little more before fully jumping into 2.1 :) We're having sudden peaks on read latency some time after a massive batch write which is mostly likely caused by cold page cache of newly compacted sstables, which will hopefully be solved by this. On Tue, Sep 16, 2014 at 8:25 PM, James Briggs james.bri...@yahoo.com wrote: Paulo: Out of curiosity, why not just upgrade to 2.1 if you want the new features? You know you want to! :) Thanks, James Briggs -- Cassandra/MySQL DBA. Available in San Jose area or remote. -- *From:* Robert Coli rc...@eventbrite.com *To:* user@cassandra.apache.org user@cassandra.apache.org *Sent:* Tuesday, September 16, 2014 4:13 PM *Subject:* Re: backport of CASSANDRA-6916 On Tue, Sep 16, 2014 at 2:56 PM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: Has anyone backported incremental replacement of compacted SSTables (CASSANDRA-6916) to 2.0? Is it doable or there are many dependencies introduced in 2.1? Haven't checked the ticket detail yet, but just in case anyone has interesting info to share. Are you looking to patch for public consumption, or for your own purposes? I just took the temperature of #cassandra-dev and they were cold on the idea as a public patch, because of potential impact on stability. =Rob -- *Paulo Motta* Chaordic | *Platform* *www.chaordic.com.br http://www.chaordic.com.br/* +55 48 3232.3200
Re: backport of CASSANDRA-6916
On Tue, Sep 16, 2014 at 4:38 PM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: We're having sudden peaks on read latency some time after a massive batch write which is mostly likely caused by cold page cache of newly compacted sstables, which will hopefully be solved by this. populate_io_cache_on_flush ? Note that this feature is sorta badly named, it includes flushing of SSTables as part of compaction. https://issues.apache.org/jira/browse/CASSANDRA-4694?focusedCommentId=13723129page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-13723129 =Rob
Re: backport of CASSANDRA-6916
On Tue, Sep 16, 2014 at 4:50 PM, Robert Coli rc...@eventbrite.com wrote: On Tue, Sep 16, 2014 at 4:38 PM, Paulo Ricardo Motta Gomes paulo.mo...@chaordicsystems.com wrote: We're having sudden peaks on read latency some time after a massive batch write which is mostly likely caused by cold page cache of newly compacted sstables, which will hopefully be solved by this. Note that this feature is sorta badly named, it includes flushing of SSTables as part of compaction. Also note that it is removed as of 2.1. https://issues.apache.org/jira/browse/CASSANDRA-7495 =Rob
no change observed in read latency after switching from EBS to SSD storage
Hi - We are running Cassandra 2.0.5 on AWS on m3.large instances. These instances were using EBS for storage (I know it is not recommended). We replaced the EBS storage with SSDs. However, we didn't see any change in read latency. A query that took 10 seconds when data was stored on EBS still takes 10 seconds even after we moved the data directory to SSD. It is a large query returning 200,000 CQL rows from a single partition. We are reading 3 columns from each row and the combined data in these three columns for each row is around 100 bytes. In other words, the raw data returned by the query is approximately 20MB. I was expecting at least 5-10 times reduction in read latency going from EBS to SSD, so I am puzzled why we are not seeing any change in performance. Does anyone have insight as to why we don't see any performance impact on the reads going from EBS to SSD? Thanks, Mohammed
Re: no change observed in read latency after switching from EBS to SSD storage
On Tue, Sep 16, 2014 at 5:35 PM, Mohammed Guller moham...@glassbeam.com wrote: Does anyone have insight as to why we don't see any performance impact on the reads going from EBS to SSD? What does it say when you enable tracing on this CQL query? 10 seconds is a really long time to access anything in Cassandra. There is, generally speaking, a reason why the default timeouts are lower than this. My conjecture is that the data in question was previously being served from the page cache and is now being served from SSD. You have, in switching from EBS-plus-page-cache to SSD successfully proved that SSD and RAM are both very fast. There is also a strong suggestion that whatever access pattern you are using is not bounded by disk performance. =Rob
Announce: top for Cassandra - cass_top
I wrote cass_top, a poor man's version of OpsCenter, in bash (no dependencies.) http://www.jebriggs.com/blog/2014/09/top-utility-for-cassandra-clusters-cass_top/ Actually, if it had node or cluster restart, it would do most of what the OpsCenter free version does. :) The features of cass_top are: - colorizes nodetool status output: UN nodes green, DN nodes red, other statuses blue - no extra firewall holes needed (agent-less and server-less), unlike OpsCenter - fast initial startup time (under 2 seconds), unlike OpsCenter - uses bash, so no programming environment needed - run it anywhere nodetool works - uses minimal screen real estate, so several rings can fit on one monitor - free (Apache 2). Please send me your comments and suggestions. The top-like infinite loop is actually a read loop, so adding a few more features like cfstats or flush would be easy. Enjoy, James Briggs. -- Cassandra/MySQL DBA. Available in San Jose area or remote.
Re: no change observed in read latency after switching from EBS to SSD storage
To expand on what Robert said, Cassandra is a log-structured database: - writes are append operations, so both correctly configured disk volumes and SSD are fast at that - reads could be helped by SSD if they're not in cache (ie. on disk) - but compaction is definitely helped by SSD with large data loads (compaction is the trade-off for fast writes) Thanks, James Briggs. -- Cassandra/MySQL DBA. Available in San Jose area or remote. Mailbox dimensions: 10x12x14 From: Robert Coli rc...@eventbrite.com To: user@cassandra.apache.org user@cassandra.apache.org Sent: Tuesday, September 16, 2014 5:42 PM Subject: Re: no change observed in read latency after switching from EBS to SSD storage On Tue, Sep 16, 2014 at 5:35 PM, Mohammed Guller moham...@glassbeam.com wrote: Does anyone have insight as to why we don't see any performance impact on the reads going from EBS to SSD? What does it say when you enable tracing on this CQL query? 10 seconds is a really long time to access anything in Cassandra. There is, generally speaking, a reason why the default timeouts are lower than this. My conjecture is that the data in question was previously being served from the page cache and is now being served from SSD. You have, in switching from EBS-plus-page-cache to SSD successfully proved that SSD and RAM are both very fast. There is also a strong suggestion that whatever access pattern you are using is not bounded by disk performance. =Rob
Re: hs_err_pid3013.log, out of memory?
Are you using JNA? Did you adjust your memlock limit? On Tue, Sep 16, 2014 at 9:46 AM, Chris Lohfink clohf...@blackbirdit.com wrote: How much memory does your system have? How much memory is system utilizing before starting Cassandra (use command free)? What are the heap setting it tries to use? Chris On Sep 15, 2014, at 8:16 PM, Yatong Zhang bluefl...@gmail.com wrote: It's during the startup. I tried to upgrade cassandra from 2.0.7 to 2.0.10, but looks like cassandra could not start again. Also I found the following log at '/var/log/messages': Sep 16 09:06:59 storage6 kernel: INFO: task java:4971 blocked for more than 120 seconds. Sep 16 09:06:59 storage6 kernel: Tainted: G --- H 2.6.32-431.el6.x86_64 #1 Sep 16 09:06:59 storage6 kernel: echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Sep 16 09:06:59 storage6 kernel: java D 0003 0 4971 1 0x0080 Sep 16 09:06:59 storage6 kernel: 88042b591c98 0082 81ed4ff0 8803b4f01540 Sep 16 09:06:59 storage6 kernel: 88042b591c68 810af370 88042b591ca0 8803b4f01540 Sep 16 09:06:59 storage6 kernel: 8803b4f01af8 88042b591fd8 fbc8 8803b4f01af8 Sep 16 09:06:59 storage6 kernel: Call Trace: Sep 16 09:06:59 storage6 kernel: [810af370] ? exit_robust_list+0x90/0x160 Sep 16 09:06:59 storage6 kernel: [81076ad5] exit_mm+0x95/0x180 Sep 16 09:06:59 storage6 kernel: [81076f1f] do_exit+0x15f/0x870 Sep 16 09:06:59 storage6 kernel: [81077688] do_group_exit+0x58/0xd0 Sep 16 09:06:59 storage6 kernel: [8108d046] get_signal_to_deliver+0x1f6/0x460 Sep 16 09:06:59 storage6 kernel: [8100a265] do_signal+0x75/0x800 Sep 16 09:06:59 storage6 kernel: [81066629] ? wake_up_new_task+0xd9/0x130 Sep 16 09:06:59 storage6 kernel: [81070ead] ? do_fork+0x13d/0x480 Sep 16 09:06:59 storage6 kernel: [810b1c0b] ? sys_futex+0x7b/0x170 Sep 16 09:06:59 storage6 kernel: [8100aa80] do_notify_resume+0x90/0xc0 Sep 16 09:06:59 storage6 kernel: [8100b341] int_signal+0x12/0x17 Sep 16 09:06:59 storage6 kernel: INFO: task java:4972 blocked for more than 120 seconds. Sep 16 09:06:59 storage6 kernel: Tainted: G --- H 2.6.32-431.el6.x86_64 #1 Sep 16 09:06:59 storage6 kernel: echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Sep 16 09:06:59 storage6 kernel: java D 0 4972 1 0x0080 Sep 16 09:06:59 storage6 kernel: 8803b4d7fc98 0082 81ed6d78 8803b4cf1500 Sep 16 09:06:59 storage6 kernel: 8803b4d7fc68 810af370 8803b4d7fca0 8803b4cf1500 Sep 16 09:06:59 storage6 kernel: 8803b4cf1ab8 8803b4d7ffd8 fbc8 8803b4cf1ab8 Sep 16 09:06:59 storage6 kernel: Call Trace: Sep 16 09:06:59 storage6 kernel: [810af370] ? exit_robust_list+0x90/0x160 Sep 16 09:06:59 storage6 kernel: [81076ad5] exit_mm+0x95/0x180 Sep 16 09:06:59 storage6 kernel: [81076f1f] do_exit+0x15f/0x870 Sep 16 09:06:59 storage6 kernel: [81065e20] ? wake_up_state+0x10/0x20 Sep 16 09:06:59 storage6 kernel: [81077688] do_group_exit+0x58/0xd0 Sep 16 09:06:59 storage6 kernel: [8108d046] get_signal_to_deliver+0x1f6/0x460 Sep 16 09:06:59 storage6 kernel: [8100a265] do_signal+0x75/0x800 Sep 16 09:06:59 storage6 kernel: [810097cc] ? __switch_to+0x1ac/0x320 Sep 16 09:06:59 storage6 kernel: [81527910] ? thread_return+0x4e/0x76e Sep 16 09:06:59 storage6 kernel: [810b1c0b] ? sys_futex+0x7b/0x170 Sep 16 09:06:59 storage6 kernel: [8100aa80] do_notify_resume+0x90/0xc0 Sep 16 09:06:59 storage6 kernel: [8100b341] int_signal+0x12/0x17 Sep 16 09:06:59 storage6 kernel: INFO: task java:4973 blocked for more than 120 seconds. On Tue, Sep 16, 2014 at 9:00 AM, Robert Coli rc...@eventbrite.com wrote: On Mon, Sep 15, 2014 at 5:55 PM, Yatong Zhang bluefl...@gmail.com wrote: I just encountered an error which left a log '/hs_err_pid3013.log'. So is there a way to solve this? # There is insufficient memory for the Java Runtime Environment to continue. # Native memory allocation (malloc) failed to allocate 12288 bytes for committing reserved memory. Use less heap memory? You haven't specified under which circumstances this occurred, so I can only conjecture that it is likely being caused by writing too fast. Write more slowly. =Rob
Re: no change observed in read latency after switching from EBS to SSD storage
Mohammed, to add to previous answers, EBS is network attached, with SSD or without it , you access your disk via network constrained by network bandwidth and latency, if you really need to improve IO performance try switching to ephemeral storage (also called instance storage) which is physically attached to EC2 instance, and is as good as native disk IO goes. On Tue, Sep 16, 2014 at 11:39 PM, James Briggs james.bri...@yahoo.com wrote: To expand on what Robert said, Cassandra is a log-structured database: - writes are append operations, so both correctly configured disk volumes and SSD are fast at that - reads could be helped by SSD if they're not in cache (ie. on disk) - but compaction is definitely helped by SSD with large data loads (compaction is the trade-off for fast writes) Thanks, James Briggs. -- Cassandra/MySQL DBA. Available in San Jose area or remote. Mailbox dimensions: 10x12x14 -- *From:* Robert Coli rc...@eventbrite.com *To:* user@cassandra.apache.org user@cassandra.apache.org *Sent:* Tuesday, September 16, 2014 5:42 PM *Subject:* Re: no change observed in read latency after switching from EBS to SSD storage On Tue, Sep 16, 2014 at 5:35 PM, Mohammed Guller moham...@glassbeam.com wrote: Does anyone have insight as to why we don't see any performance impact on the reads going from EBS to SSD? What does it say when you enable tracing on this CQL query? 10 seconds is a really long time to access anything in Cassandra. There is, generally speaking, a reason why the default timeouts are lower than this. My conjecture is that the data in question was previously being served from the page cache and is now being served from SSD. You have, in switching from EBS-plus-page-cache to SSD successfully proved that SSD and RAM are both very fast. There is also a strong suggestion that whatever access pattern you are using is not bounded by disk performance. =Rob
Re: no change observed in read latency after switching from EBS to SSD storage
If you cached your tables or the database you may not see any difference at all. Regards, -Tony On Tuesday, September 16, 2014 6:36 PM, Mohammed Guller moham...@glassbeam.com wrote: Hi - We are running Cassandra 2.0.5 on AWS on m3.large instances. These instances were using EBS for storage (I know it is not recommended). We replaced the EBS storage with SSDs. However, we didn't see any change in read latency. A query that took 10 seconds when data was stored on EBS still takes 10 seconds even after we moved the data directory to SSD. It is a large query returning 200,000 CQL rows from a single partition. We are reading 3 columns from each row and the combined data in these three columns for each row is around 100 bytes. In other words, the raw data returned by the query is approximately 20MB. I was expecting at least 5-10 times reduction in read latency going from EBS to SSD, so I am puzzled why we are not seeing any change in performance. Does anyone have insight as to why we don't see any performance impact on the reads going from EBS to SSD? Thanks, Mohammed
Re: no change observed in read latency after switching from EBS to SSD storage
EBS vs local SSD in terms of latency you are using ms as your unit of measurement. If your query runs for 10s you will not notice anything. What is a few less ms for the life of a 10 second query. To reiterate what Rob said. The query is probably slow because of your use case / data model, not the underlying disk. On 17 September 2014 14:21, Tony Anecito adanec...@yahoo.com wrote: If you cached your tables or the database you may not see any difference at all. Regards, -Tony On Tuesday, September 16, 2014 6:36 PM, Mohammed Guller moham...@glassbeam.com wrote: Hi - We are running Cassandra 2.0.5 on AWS on m3.large instances. These instances were using EBS for storage (I know it is not recommended). We replaced the EBS storage with SSDs. However, we didn't see any change in read latency. A query that took 10 seconds when data was stored on EBS still takes 10 seconds even after we moved the data directory to SSD. It is a large query returning 200,000 CQL rows from a single partition. We are reading 3 columns from each row and the combined data in these three columns for each row is around 100 bytes. In other words, the raw data returned by the query is approximately 20MB. I was expecting at least 5-10 times reduction in read latency going from EBS to SSD, so I am puzzled why we are not seeing any change in performance. Does anyone have insight as to why we don't see any performance impact on the reads going from EBS to SSD? Thanks, Mohammed -- Ben Bromhead Instaclustr | www.instaclustr.com | @instaclustr http://twitter.com/instaclustr | +61 415 936 359
RE: no change observed in read latency after switching from EBS to SSD storage
Rob, The 10 seconds latency that I gave earlier is from CQL tracing. Almost 5 seconds out of that was taken up by the “merge memtable and sstables” step. The remaining 5 seconds are from “read live and tombstoned cells.” I too first thought that maybe disk is not the bottleneck and Cassandra is serving everything from cache, but in that case, it should not take 10 seconds for reading just 20MB data. Also, I narrowed down the query to limit it to a single partition read and I ran the query in cqlsh running on the same node. I turned on tracing, which shows that all the steps got executed on the same node. htop shows that CPU and memory are not the bottlenecks. Network should not come into play since the cqlsh is running on the same node. Is there any performance tuning parameter in the cassandra.yaml file for large reads? Mohammed From: Robert Coli [mailto:rc...@eventbrite.com] Sent: Tuesday, September 16, 2014 5:42 PM To: user@cassandra.apache.org Subject: Re: no change observed in read latency after switching from EBS to SSD storage On Tue, Sep 16, 2014 at 5:35 PM, Mohammed Guller moham...@glassbeam.commailto:moham...@glassbeam.com wrote: Does anyone have insight as to why we don't see any performance impact on the reads going from EBS to SSD? What does it say when you enable tracing on this CQL query? 10 seconds is a really long time to access anything in Cassandra. There is, generally speaking, a reason why the default timeouts are lower than this. My conjecture is that the data in question was previously being served from the page cache and is now being served from SSD. You have, in switching from EBS-plus-page-cache to SSD successfully proved that SSD and RAM are both very fast. There is also a strong suggestion that whatever access pattern you are using is not bounded by disk performance. =Rob
Re: C 2.1
DSE/Solr is tightly integrated, so there is no “external” system to manage – insert data in CQL and within a few seconds it is available for query from Solr running in the same JVM as Cassandra. DSE/Solr indexes the data on each Cassandra node, and uses Cassandra’s cluster management for distributing queries across the cluster. And... Lucene (underneath Solr) is optimal for queries that span multiple fields. DSE/Solr supports CQL3 wide rows (clustering columns.) -- Jack Krupansky From: Ram N Sent: Monday, September 15, 2014 4:34 PM To: user Subject: Re: C 2.1 Jack, Using Solr or an external search/indexing service is an option but increases the complexity of managing different systems. I am curious to understand the impact of having wide-rows on a separate CF for inverted index purpose which if I understand correctly is what Rob's response, having a separate CF for index is better than using the default Secondary index option. Would be great to understand the design decision to go with present implementation on Secondary Index when the alternative is better? Looking at JIRAs is still confusing to come up with the why :) --R On Mon, Sep 15, 2014 at 11:17 AM, Jack Krupansky j...@basetechnology.com wrote: If you’re indexing and querying on that many columns (dozens, or more than a handful), consider DSE/Solr, especially if you need to query on multiple columns in the same query. -- Jack Krupansky From: Robert Coli Sent: Monday, September 15, 2014 11:07 AM To: user@cassandra.apache.org Subject: Re: C 2.1 On Sat, Sep 13, 2014 at 3:49 PM, Ram N yrami...@gmail.com wrote: Is 2.1 a production ready release? https://engineering.eventbrite.com/what-version-of-cassandra-should-i-run/ Datastax Java driver - I get too confused with CQL and the underlying storage model. I am also not clear on the indexing structure of columns. Does CQL indexes create a separate CF for the index table? How is it different from maintaining inverted index? Internally both are the same? Does cql stmt to create index, creates a separate CF and has an atomic way of updating/managing them? Which one is better to scale? (something like stargate-core or the ones done by usergrid? or the CQL approach?) New projects should use CQL. Access to underlying storage via Thrift is likely to eventually be removed from Cassandra. On a separate note just curious if I have 1000's of columns in a given row and a fixed set of indexed column (say 30 - 50 columns) which approach should I be taking? Will cassandra scale with these many indexed column? Are there any limits? How much of an impact do CQL indexes create on the system? I am also not sure if these use cases are the right choice for cassandra but would really appreciate any response on these. Thanks. Use of the Secondary Indexes feature is generally an anti-pattern in Cassandra. 30-50 indexed columns in a row sounds insane to me. However 30-50 column families into which one manually denormalized does not sound too insane to me... =Rob http://twitter.com/rcolidba