[Cassandra Wiki] Update of "FAQ" by JonathanEllis

Apache Wiki Mon, 15 Aug 2016 08:05:23 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "FAQ" page has been changed by JonathanEllis:
https://wiki.apache.org/cassandra/FAQ?action=diff&rev1=172&rev2=173

Comment:
Replaced by in-tree docs

+ [[http://cassandra.apache.org/doc/latest/faq/index.html|Superseded by the FAQ 
here].
- = Frequently asked questions =
-  * [[#cant_listen_on_ip_any|Why can't I make Cassandra listen on 0.0.0.0 (all 
my addresses)?]]
-  * [[#ports|What ports does Cassandra use?]]
-  * [[#existing_data_when_adding_new_nodes|What happens to existing data in my 
cluster when I add new nodes?]]
-  * [[#what_kind_of_hardware_should_i_use|What kind of hardware should I run 
Cassandra on?]]
-  * [[#architecture|What are SSTables and Memtables?]]
-  * [[#working_with_timeuuid_in_java|Why is it so hard to work with 
TimeUUIDType in Java?]]
-  * [[#i_deleted_what_gives|I delete data from Cassandra, but disk usage stays 
the same. What gives?]]
-  * [[#cloned|Why does nodetool ring only show one entry, even though my nodes 
logged that they see each other joining the ring?]]
-  * [[#change_replication|Can I change the ReplicationFactor on a live 
cluster?]]
-  * [[#large_file_and_blob_storage|Can I store large files or BLOBs in 
Cassandra?]]
-  * [[#jmx_localhost_refused|Nodetool says "Connection refused to host: 
127.0.1.1" for any remote host. What gives?]]
-  * [[#iter_world|How can I iterate over all the rows in a ColumnFamily?]]
-  * [[#gui|Is there a GUI admin tool for Cassandra?]]
-  * [[#clustername_mismatch|Cassandra says "ClusterName mismatch: 
oldClusterName != newClusterName" and refuses to start]]
-  * [[#batch_mutate_atomic|Are batch operations atomic?]]
-  * [[#batch_bulkload|Will batching my operations speed up my bulk load?]]
-  * [[#hadoop_support|Is Hadoop (i.e. Map/Reduce, Pig, Hive) supported?]]
-  * [[#multi_tenant|Can a Cassandra cluster be multi-tenant?]]
-  * [[#using_cassandra|Who is using Cassandra and for what?]]
-  * [[#what_about_the_obdc|Are there any OBDC drivers for Cassandra?]]
-  * [[#logging_using_cassandra|Are there ways to do logging directly to 
Cassandra?]]
-  * [[#rhel_selinux|Problems using on RHEL?]]
-  * [[#auth|Is there an authentication/authorization mechanism for Cassandra?]]
-  * [[#bulkloading|How do I bulk load data into Cassandra?]]
-  * [[#unsubscribe|How do I unsubscribe from the email list?]]
-  * [[#mmap|Why does top report that Cassandra is using a lot more memory than 
the Java heap max?]]
-  * [[#jna|I'm getting java.io.IOException: Cannot run program "ln" when 
trying to snapshot or update a keyspace]]
-  * [[#replicaplacement|How does Cassandra decide which nodes have what data?]]
-  * [[#cachehitrateunits|I have a row or key cache hit rate of 0.XX123456789.  
Is that XX% or 0.XX% ?]]
-  * [[#seed|What are seeds?]]
-  * [[#seed_spof|Does single seed mean single point of failure?]]
-  * [[#jconsole_array_arg|Why can't I call jmx method X on jconsole? (ex. 
getNaturalEndpoints)]]
-  * [[#max_key_size|What's the maximum key size permitted?]]
-  * [[#ubuntu_hangs|I'm using Ubuntu with JNA, and weird things keep hanging 
and stalling and printing scary tracebacks in dmesg!]]
-  * [[#schema_disagreement|What are schema disagreement errors and how do I 
fix them?]]
-  * [[#dropped_messages|Why do I see "... messages dropped.." in the logs?]]
-  * [[#memlock|Cassandra dies with "java.lang.OutOfMemoryError: Map failed"]]
-  * [[#opp|Why should I avoid order-preserving partitioners?]]
-  * [[#clocktie|What happens if two updates are made with the same timestamp?]]
-  * [[#bootstreamfail|Why bootstrapping a new node fails with a "Stream 
failed" error?]]
  
- <<Anchor(cant_listen_on_ip_any)>>
- 
- == Why can't I make Cassandra listen on 0.0.0.0 (all my addresses)? ==
- Cassandra is a gossip-based distributed system.  !ListenAddress is also 
"contact me here address," i.e., the address it tells other nodes to reach it 
at.  Telling other nodes "contact me on any of my addresses" is a bad idea; if 
different nodes in the cluster pick different addresses for you, Bad Things 
happen.
- 
- If you don't want to manually specify an IP to !ListenAddress for each node 
in your cluster (understandable!), leave it blank and Cassandra will use 
!InetAddress.getLocalHost() to pick an address.  Then it's up to you or your 
ops team to make things resolve correctly (/etc/hosts/, dns, etc).
- 
- One exception to this process is JMX, which by default binds to 0.0.0.0 (Java 
bug [[http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=6425769|6425769]]).
- 
- See [[https://issues.apache.org/jira/browse/CASSANDRA-256|CASSANDRA-256]] and 
[[https://issues.apache.org/jira/browse/CASSANDRA-43|CASSANDRA-43]] for more 
gory details.
- 
- <<Anchor(ports)>>
- 
- == What ports does Cassandra use? ==
- By default, Cassandra uses 7000 for cluster communication (7001 if SSL is 
enabled), 9160 for Thrift clients, 9042 for native protocol clients, and 7199 
for [[JmxInterface|JMX]].  The internode communication, Thrift, and native 
protocol ports are configurable in cassandra.yaml.  The JMX port is 
configurable in cassandra-env.sh (through JVM options). All ports are TCP. See 
also RunningCassandra.
- 
- <<Anchor(existing_data_when_adding_new_nodes)>>
- 
- == What happens to existing data in my cluster when I add new nodes? ==
- When a new nodes joins a cluster, it will automatically contact the other 
nodes in the cluster and copy the right data to itself.
- 
- <<Anchor(what_kind_of_hardware_should_i_use)>>
- 
- == What kind of hardware should I run Cassandra on? ==
- See [CassandraHardware].
- 
- <<Anchor(architecture)>>
- 
- == What are SSTables and Memtables? ==
- See [[MemtableSSTable]].
- 
- <<Anchor(working_with_timeuuid_in_java)>>
- 
- == Why is it so hard to work with TimeUUIDType in Java? ==
- TimeUUID's are difficult to use from java clients because java.util.UUID does 
not support generating version 1 (time-based) UUIDs.  Here is one way to work 
with them and Cassandra:
- 
- Use the UUID generator from: http://johannburkard.de/software/uuid/. See 
[[TimeBaseUUIDNotes | Time based UUID Notes]]
- 
- Below are three methods that are quite useful in working with the uuids as 
they come in and out of Cassandra.
- 
- Generate a new UUID to use in a TimeUUIDType sorted column family.
- 
- {{{
-         /**
-          * Gets a new time uuid.
-          *
-          * @return the time uuid
-          */
-         public static java.util.UUID getTimeUUID()
-         {
-                 return java.util.UUID.fromString(new 
com.eaio.uuid.UUID().toString());
-         }
- }}}
- When you read out of cassandra your getting a byte[] that needs to be 
converted into a TimeUUID and since the java.util.UUID doesn't seem to have a 
simple way of doing this, pass it through the eaio uuid dealio again.
- 
- {{{
-         /**
-          * Returns an instance of uuid.
-          *
-          * @param uuid the uuid
-          * @return the java.util. uuid
-          */
-         public static java.util.UUID toUUID( byte[] uuid )
-         {
-         long msb = 0;
-         long lsb = 0;
-         assert uuid.length == 16;
-         for (int i=0; i<8; i++)
-             msb = (msb << 8) | (uuid[i] & 0xff);
-         for (int i=8; i<16; i++)
-             lsb = (lsb << 8) | (uuid[i] & 0xff);
-         long mostSigBits = msb;
-         long leastSigBits = lsb;
- 
-         com.eaio.uuid.UUID u = new com.eaio.uuid.UUID(msb,lsb);
-         return java.util.UUID.fromString(u.toString());
-         }
- }}}
- When you want to actually place the UUID into the Column then you'll want to 
convert it like this. This method is often used in conjuntion with the 
getTimeUUID() mentioned above.
- 
- {{{
-         /**
-          * As byte array.
-          *
-          * @param uuid the uuid
-          *
-          * @return the byte[]
-          */
-         public static byte[] asByteArray(java.util.UUID uuid)
-         {
-             long msb = uuid.getMostSignificantBits();
-             long lsb = uuid.getLeastSignificantBits();
-             byte[] buffer = new byte[16];
- 
-             for (int i = 0; i < 8; i++) {
-                     buffer[i] = (byte) (msb >>> 8 * (7 - i));
-             }
-             for (int i = 8; i < 16; i++) {
-                     buffer[i] = (byte) (lsb >>> 8 * (7 - i));
-             }
- 
-             return buffer;
-         }
- }}}
- Further, it is often useful to create a TimeUUID object from some time other 
than the present: for example, to use as the lower bound in a !SlicePredicate 
to retrieve all columns whose TimeUUID comes after time ''X''. Most libraries 
don't provide this functionality, probably because this breaks the "Universal" 
part of UUID: '''this should give you pause'''! Never assume such a UUID is 
unique: use it only as a marker for a specific time.
- 
- With those disclaimers out of the way, if you feel a need to create a 
TimeUUID based on a specific date, here is some code that will work:
- 
- {{{
-         public static java.util.UUID uuidForDate(Date d)
-         {
- /*
-   Magic number obtained from #cassandra's thobbs, who
-   claims to have stolen it from a Python library.
- */
-             final long NUM_100NS_INTERVALS_SINCE_UUID_EPOCH = 
0x01b21dd213814000L;
- 
-             long origTime = d.getTime();
-             long time = origTime * 10000 + 
NUM_100NS_INTERVALS_SINCE_UUID_EPOCH;
-             long timeLow = time &       0xffffffffL;
-             long timeMid = time &   0xffff00000000L;
-             long timeHi = time & 0xfff000000000000L;
-             long upperLong = (timeLow << 32) | (timeMid >> 16) | (1 << 12) | 
(timeHi >> 48) ;
-             return new java.util.UUID(upperLong, 0xC000000000000000L);
-         }
- }}}
- <<Anchor(i_deleted_what_gives)>>
- 
- == I delete data from Cassandra, but disk usage stays the same. What gives? ==
- Data you write to Cassandra gets persisted to SSTables. Since SSTables are 
immutable, the data can't actually be removed when you perform a delete, 
instead, a marker (also called a "tombstone") is written to indicate the 
value's new status. Never fear though, on the first compaction that occurs 
between the data and the tombstone, the data will be expunged completely and 
the corresponding disk space recovered. See DistributedDeletes for more detail.
- 
- <<Anchor(cloned)>>
- 
- == Why does nodetool ring only show one entry, even though my nodes logged 
that they see each other joining the ring? ==
- This happens when you have the same token assigned to each node.  Don't do 
that.
- 
- Most often this bites people who deploy by installing Cassandra on a VM 
(especially when using the Debian package, which auto-starts Cassandra after 
installation, thus generating and saving a token), then cloning that VM to 
other nodes.
- 
- The easiest fix is to wipe the data and commitlog directories, thus making 
sure that each node will generate a random token on the next restart.
- 
- <<Anchor(change_replication)>>
- 
- == Can I change the ReplicationFactor on a live cluster? ==
- Yes, but it will require running repair to change the replica count of 
existing data.
- 
-  * Alter the !ReplicationFactor for the desired keyspace(s) using 
cassandra-cli.
- 
- If you're reducing the !ReplicationFactor:
- 
-  * Run "nodetool cleanup" on the cluster to remove surplus replicated data. 
Cleanup runs on a per-node basis.
- 
- If you're increasing the !ReplicationFactor:
- 
-  * Run "nodetool repair" to run an anti-entropy repair on the cluster. Repair 
runs on a per-replica set basis. This is an intensive process that may result 
in adverse cluster performance. It's highly recommended to do rolling repairs, 
as an attempt to repair the entire cluster at once will most likely swamp it.
- 
- <<Anchor(large_file_and_blob_storage)>>
- 
- == Can I Store BLOBs in Cassandra? ==
- Currently Cassandra isn't optimized specifically for large file or BLOB 
storage.   However, files of  around  64Mb and smaller can be easily stored in 
the database without splitting them into smaller chunks.  This is primarily due 
to the fact that Cassandra's public API is based on Thrift, which offers no 
streaming abilities;  any value written or fetched has to fit in to memory.  
Other non Thrift  interfaces may solve this problem in the future, but there 
are currently no plans to change Thrift's behavior.  When planning  
applications that require storing BLOBS, you should also consider these 
attributes of Cassandra as well:
- 
-  * The main limitation on a column and super column size is that all the data 
for a single key and column must fit (on disk) on a single machine(node) in the 
cluster.  Because keys alone are used to determine the nodes responsible for 
replicating their data, the amount of data associated with a single key has 
this upper bound. This is an inherent limitation of the distribution model.
- 
-  * When large columns are created and retrieved, that columns data is loaded 
into RAM which  can get resource intensive quickly.  Consider, loading  200 
rows with columns  that store 10Mb image files each into RAM.  That small 
result set would consume about 2Gb of RAM.  Clearly as more and more large 
columns are loaded,  RAM would start to get consumed quickly.  This can be 
worked around, but will take some upfront planning and testing to get a 
workable solution for most applications.  You can find more information 
regarding this behavior here: [[MemtableThresholds|memtables]], and a possible 
solution in 0.7 here: 
[[https://issues.apache.org/jira/browse/CASSANDRA-16|CASSANDRA-16]].
- 
-  * Please refer to the notes in the Cassandra limitations section for more 
information: [[CassandraLimitations|Cassandra Limitations]]
- 
- <<Anchor(jmx_localhost_refused)>>
- 
- == Nodetool says "Connection refused to host: 127.0.1.1" for any remote host. 
What gives? ==
- Nodetool relies on JMX, which in turn relies on RMI, which in turn sets up 
its own listeners and connectors as needed on each end of the exchange. 
Normally all of this happens behind the scenes transparently, but incorrect 
name resolution for either the host connecting, or the one being connected to, 
can result in crossed wires and confusing exceptions.
- 
- If you are not using DNS, then make sure that your `/etc/hosts` files are 
accurate on both ends. If that fails, try setting the 
`-Djava.rmi.server.hostname=<public name>` JVM option near the bottom of 
`cassandra-env.sh` to an interface that you can reach from the remote machine.
- 
- <<Anchor(iter_world)>>
- 
- == How can I iterate over all the rows in a ColumnFamily? ==
- Use a CQL client ([ClientOptions]) and Cassandra 2.0.  The cursor support in 
2.0 means you can just write "SELECT * FROM foo" and paging through the 
resultset will be handled automatically.
- 
- Alternatively, you may with to use HadoopSupport.
- 
- <<Anchor(gui)>>
- 
- == Is there a GUI admin tool for Cassandra? ==
-  * [[http://www.datastax.com/products/opscenter|DataStax Opscenter]], a 
management and monitoring tool for Cassandra with a web-based UI.
-  * [[https://github.com/sebgiroux/Cassandra-Cluster-Admin|Cassandra Cluster 
Admin]], a PHP-based web UI.
-  * [[http://toadforcloud.com | Toad for Cloud Databases]], a desktop 
application and Eclipse plugin which support Cassandra.
-  * [[http://dbeaver.jkiss.org | DBeaver]], a desktop application, support 
Cassandra using JDBC driver.
- 
- <<Anchor(clustername_mismatch)>>
- 
- == Cassandra says "ClusterName mismatch: oldClusterName != newClusterName" 
and refuses to start ==
- To prevent operator errors, Cassandra stores the name of the cluster in its 
system table.  If you need to rename a cluster for some reason, you can:
- 
- Perform these steps on each node:
-  1. Start the `cassandra-cli` connected locally to this node.
-  1. Run the following:
-    a. use system;
-    a. set !LocationInfo[utf8('L')][utf8('!ClusterName')]=utf8('<new cluster 
name>');
-    a. exit;
-  1. Run `nodetool flush` on this node.
-  1. Update the cassandra.yaml file for the cluster_name as the same as 2b).
-  1. Restart the node.
- 
- Once all nodes have been had this operation performed and restarted, 
`nodetool ring` should show all nodes as UP.
- 
- <<Anchor(batch_mutate_atomic)>>
- 
- == Are batchoperations atomic? ==
- 
- Since Cassandra 1.2, CQL batches are atomic by default 
(http://www.datastax.com/dev/blog/atomic-batches-in-cassandra-1-2).  Thrift API 
users must call atomic_batch_mutate instead of batch_mutate if they want this 
behavior.
- 
- <<Anchor(batch_bulkload)>>
- 
- == Will batching my operations speed up my bulk load? ==
- 
- '''NO.'''  Using batches to load data will just add "spikes" of latency.  
Don't do this.  Use asynchronous INSERTs instead, or use true BulkLoading.
- 
- (Minor exception: batching updates to a single partition can be a Good Thing. 
 But never ever blindly batch everything!)
- 
- <<Anchor(hadoop_support)>>
- 
- == Is Hadoop (i.e. Map/Reduce, Pig, Hive) supported? ==
- For the latest on Hadoop-Cassandra integration, see HadoopSupport.
- 
- <<Anchor(multi_tenant)>>
- 
- == Can a Cassandra cluster be multi-tenant? ==
- Most users of Cassandra stand up a cluster for each application or related 
set of applications as it is much simpler to tune and troubleshoot.  There has 
been work done to support more multi-tenant capabilities such as scheduling and 
auth.  For more information, see MultiTenant.  However the well-trodden path is 
definitely single-tenant.
- 
- <<Anchor(using_cassandra)>>
- 
- == Who is using Cassandra and for what? ==
- See http://planetcassandra.org/Company/ViewCompany?IndustryId=-1.
- 
- <<Anchor(what_about_the_obdc)>>
- 
- == Are there any OBDC drivers for Cassandra? ==
- 
- Yes: 
http://www.datastax.com/dev/blog/using-the-datastax-odbc-driver-for-apache-cassandra
- 
- <<Anchor(logging_using_cassandra)>>
- 
- == Are there ways to do logging directly to Cassandra? ==
- For information on logging directly to Cassandra, see LoggingToCassandra.
- 
- <<Anchor(rhel_selinux)>>
- 
- == On RHEL nodes are unable to join the ring ==
- Check if selinux is on; if it is, turn it OFF.
- 
- <<Anchor(auth)>>
- 
- == Is there an authentication/authorization mechanism for Cassandra? ==
- Yes. For details, see ExtensibleAuth.
- 
- <<Anchor(bulkloading)>>
- 
- == How do I bulk load data into Cassandra? ==
- See BulkLoading
- 
- <<Anchor(unsubscribe)>>
- 
- == How do I unsubscribe from the email list? ==
- Send an email to user-unsubscr...@cassandra.apache.org
- 
- <<Anchor(mmap)>>
- 
- == Why does top report that Cassandra is using a lot more memory than the 
Java heap max? ==
- Cassandra uses mmap to do zero-copy reads. That is, we use the operating 
system's virtual memory system to map the sstable data files into the Cassandra 
process' address space. This will "use" virtual memory; i.e. address space, and 
will be reported by tools like top accordingly, but on 64 bit systems virtual 
address space is effectively unlimited so you should not worry about that.
- 
- What matters from the perspective of "memory use" in the sense as it is 
normally meant, is the amount of data allocated on brk() or mmap'd /dev/zero, 
which represent real memory used.  The key issue is that for a mmap'd file, 
there is never a need to retain the data resident in physical memory. Thus, 
whatever you do keep resident in physical memory is essentially just there as a 
cache, in the same way as normal I/O will cause the kernel page cache to retain 
data that you read/write.
- 
- The difference between normal I/O and mmap() is that in the mmap() case the 
memory is actually mapped to the process, thus affecting the virtual size as 
reported by top. The main argument for using mmap() instead of standard I/O is 
the fact that reading entails just touching memory - in the case of the memory 
being resident, you just read it - you don't even take a page fault (so no 
overhead in entering the kernel and doing a semi-context switch). This is 
covered in more detail 
[[http://www.varnish-cache.org/trac/wiki/ArchitectNotes|here]].
- 
- <<Anchor(jna)>>
- 
- == I'm getting java.io.IOException: Cannot run program "ln" when trying to 
snapshot or update a keyspace ==
- Updating a keyspace first takes a snapshot. This involves creating hardlinks 
to the existing SSTables, but Java has no native way to create hard links, so 
it must fork "ln". When forking, there must be as much memory free as the 
parent process, even though the child isn't going to use it all.  Because Java 
is a large process, this is problematic.  The solution is to install 
[[http://jna.java.net/|Java Native Access]] so it can create the hard links 
itself.
- 
- <<Anchor(replicaplacement)>>
- 
- == How does Cassandra decide which nodes have what data? ==
- The set of nodes (a single node, or several) responsible for any given piece 
of data is determined by:
- 
-  * The row key (data is partitioned on row key)
-  * The replication factor (decides ''how many'' nodes are in the replica set 
for a given row)
-  * The replication strategy (decides ''which'' nodes are part of said replica 
set)
- 
- In the case of the !SimpleStrategy, replicas are placed on succeeding nodes 
in the ring. The first node is determined by the partitioner and the row key, 
and the remainder are placed on succeeding node. In the case of 
!NetworkTopologyStrategy placement is affected by data-center and rack 
awareness, and the placement will depend on how nodes in different racks or 
data centers are placed in the ring.
- 
- It is important to understand that Cassandra ''does not'' alter the replica 
set for a given row key based on changing characteristics like current load, 
which nodes are up or down, or which node your client happens to talk to.
- 
- <<Anchor(cachehitrateunits)>>
- 
- == I have a row or key cache hit rate of 0.XX123456789 reported by JMX.  Is 
that XX% or 0.XX% ? ==
- XX%
- 
- <<Anchor(seed)>>
- 
- == What are seeds? ==
- Seeds are used during startup to discover the cluster
- 
- If you configure your nodes to refer some node as seed, nodes in your ring 
tend to send Gossip message to seeds more often ( Refer to ArchitectureGossip 
for details ) than to non-seeds. In other words, seeds are worked as hubs of 
Gossip network. With seeds, each node can detect status changes of other nodes 
quickly.
- 
- Seeds are also referred by new nodes on bootstrap to learn other nodes in 
ring. When you add a new node to ring, you need to specify at least one live 
seed to contact. Once a node join the ring, it learns about the other nodes, so 
it doesn't need seed on subsequent boot.
- 
- Newer versions of cassandra persist the cluster topology making seeds less 
important then they were in the 0.6.X series, where they were used every startup
- 
- You can make a seed a node at any time. There is nothing special about seed 
nodes. If you list the node in seed list it is a seed
- 
- Seeds do not auto bootstrap (ie if a node has itself in its seed list it will 
not automatically transfer data to itself) If you want a node to do that 
bootstrap it first and then add it to seeds later. If you have no data (new 
install) you do not have to worry about bootstrap or autobootstrap at all.
- 
- Recommended usage of seeds:
- 
-  * pick two (or more) nodes per data center as seed nodes.
-  * sync the seed list to all your nodes
- 
- <<Anchor(seed_spof)>>
- 
- == Does single seed mean single point of failure? ==
- The ring can operate or boot without a seed; however, you will not be able to 
add new nodes to the cluster. It is recommended to configure multiple seeds in 
production system.
- 
- <<Anchor(jconsole_array_arg)>>
- 
- == Why can't I call jmx method X on jconsole? (ex. getNaturalEndpoints) ==
- Some of JMX operations can't be called with jconsole because the buttons are 
inactive for them. Jconsole doesn't support array argument, so operations which 
need array as arugument can't be invoked on jconsole. You need to write a JMX 
client to call such operations or need array-capable JMX monitoring tool.
- 
- <<Anchor(max_key_size)>>
- 
- == What's the maximum key size permitted? ==
- The key (and column names) must be under 64K bytes.
- 
- Routing is O(N) of the key size and querying and updating are O(N log N). In 
practice these factors are usually dwarfed by other overhead, but some users 
with very large "natural" keys use their hashes instead to cut down the size.
- 
- <<Anchor(ubuntu_ec2_hangs)>> <<Anchor(ubuntu_hangs)>>
- 
- == I'm using Ubuntu with JNA, and weird things keep hanging and stalling and 
blocking and printing scary tracebacks in dmesg! ==
- We have come across several different, but similar, sets of symptoms that 
might match what you're seeing. They might all have the same root cause; it's 
not clear. One common piece is messages like this in dmesg:
- 
- {{{
- INFO: task (some_taskname):(some_pid) blocked for more than 120 seconds.
- "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
- }}}
- It does not seem that anyone has had the time to track this down to the real 
root cause, but it does seem that upgrading the linux-image package and 
rebooting your instances fixes it. There is likely some bug in several of the 
kernel builds distributed by Ubuntu which is fixed in later versions. Versions 
of linux-image-* which are known not to have this problem include:
- 
-  * linux-image-2.6.38-10-virtual (2.6.38-10.46) (Ubuntu 11.04/Natty Narwhal)
-  * linux-image-2.6.35-24-virtual (2.6.35-24.42) (Ubuntu 10.10/Maverick 
Meerkat)
- 
- Uninstalling libjna-java or recompiling Cassandra with 
CLibrary.tryMlockall()'s mlockall() call commented out also make at least some 
sorts of this problem go away, but that's a lot less desirable of a fix.
- 
- If you have more information on the problem and better ways to avoid it, 
please do update this space.
- 
- <<Anchor(schema_disagreement)>>
- 
- == What are schema disagreement errors and how do I fix them? ==
- Prior to Cassandra 1.1 and 1.2, Cassandra schema updates 
[[LiveSchemaUpdates|assume that schema changes are done one-at-a-time]].  If 
you make multiple changes at the same time, you can cause some nodes to end up 
with a different schema, than others.  (Before 0.7.6, this can also be caused 
by cluster system clocks being substantially out of sync with each other.)
- 
- To fix schema disagreements, you need to force the disagreeing nodes to 
rebuild their schema.  Here's how:
- 
- Open the cassandra-cli and run: 'connect localhost/9160;', then 'describe 
cluster;'.  You'll see something like this:
- 
- {{{
- [default@unknown] describe cluster;
- Cluster Information:
-    Snitch: org.apache.cassandra.locator.SimpleSnitch
-    Partitioner: org.apache.cassandra.dht.RandomPartitioner
-    Schema versions:
- 75eece10-bf48-11e0-0000-4d205df954a7: [192.168.1.9, 192.168.1.25]
- 5a54ebd0-bd90-11e0-0000-9510c23fceff: [192.168.1.27]
- }}}
- Note which schemas are in the minority and mark down those IPs -- in the 
above example, 192.168.1.27. Login to each of those machines and cleaninly stop 
the Cassandra service/process, typically by running:
- 
-  * nodetool disablethrift
-  * nodetool disablegossip
-  * nodetool drain
-  * 'sudo service cassandra stop' or 'kill <pid>'. 
- 
- At the end of this process the commit log directory 
(/var/lib/cassandra/commitlog) should contain only a single small file. 
-  
- Remove the Schema* and Migration* sstables inside of your system keyspace 
(/var/lib/cassandra/data/system, if you're using the defaults).
- 
- After starting Cassandra again, this node will notice the missing information 
and pull in the correct schema from one of the other nodes. In version 1.0.X 
and before the schema is applied one mutation at a time. While it is being 
applied the node may log messages, such as the one below, that a Column Family 
cannot be found. These messages can be ignored. 
- 
- {{{
- ERROR [MutationStage:1] 2012-05-18 16:23:15,664 RowMutationVerbHandler.java 
(line 61) Error in row mutation
- org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find cfId=1012
- }}}
- 
- To confirm everything is on the same schema, verify that 'describe cluster;' 
only returns one schema version.
- 
- <<Anchor(dropped_messages)>>
- 
- == Why do I see "... messages dropped.." in the logs? ==
- This is a symptom of load shedding -- Cassandra defending itself against more 
requests than it can handle.
- 
- Internode messages which are received by a node, but do not get not to be 
processed within rpc_timeout are dropped rather than processed. As the 
coordinator node will no longer be waiting for a response. If the Coordinator 
node does not receive Consistency Level responses before the rpc_timeout it 
will return a !TimedOutException to the client. If the coordinator receives 
Consistency Level responses it will return success to the client.
- 
- For MUTATION messages this means that the mutation was not applied to all 
replicas it was sent to. The inconsistency will be repaired by Read Repair or 
Anti Entropy Repair.
- 
- For READ messages this means a read request may not have completed.
- 
- Load shedding is part of the Cassandra architecture, if this is a persistent 
issue it is generally a sign of an overloaded node or cluster.
- 
- <<Anchor(memlock)>>
- == Cassandra dies with "java.lang.OutOfMemoryError: Map failed" ==
- '''IF''' Cassandra is dying specifically with the "Map failed" message it 
means the OS is denying java the ability to lock more memory.  In linux, this 
typically means memlock is limited.  Check /proc/<pid of cassandra>/limits to 
verify this and raise it (eg, via ulimit in bash.)  You may also need to 
increase  vm.max_map_count.  Note that the debian and redhat packages handle 
this for you automatically.
- 
- <<Anchor(opp)>>
- == Why should I avoid order-preserving partitioners? ==
- See Partitioners.
- 
- <<Anchor(clocktie)>>
- == What happens if two updates are made with the same timestamp? ==
- 
- Updates must be commutative, since they may arrive in different orders on 
different replicas.  As long as Cassandra has a deterministic way to pick the 
winner, the one selected is as valid as any other, and the specifics should be 
treated as an implementation detail.  That said, in the case of a timestamp 
tie, Cassandra follows two rules: first, deletes take precedence over 
inserts/updates.  Second, if there are two updates, the one with the lexically 
larger value is selected.
- 
- {{https://c.statcounter.com/9397521/0/fe557aad/1/|stats}}
- 
- <<Anchor(bootstreamfail)>>
- == Why bootstrapping a new node fails with a "Stream failed" error? ==
- 
- Two main possibilities:
-   1. the GC may be creating long pauses disrupting the streaming process
-   2. compactions happening in the background hold streaming long enough that 
the TCP connection fails
- 
- In the first case, regular GC tuning advices apply. In the second case, you 
need to set TCP keepalive to a lower value (default is very high on Linux). Try 
to just run the following:
- 
- $ sudo /sbin/sysctl -w net.ipv4.tcp_keepalive_time=60 
net.ipv4.tcp_keepalive_intvl=60 net.ipv4.tcp_keepalive_probes=5
- 
- To make those settings permanent, add them to your /etc/sysctl.conf file.
- 
- Note: GCE's firewall will always interrupt TCP connections that are inactive 
for more than 10 min. Running the above command is highly recommended in that 
environment.
-

[Cassandra Wiki] Update of "FAQ" by JonathanEllis

Reply via email to