[Cassandra Wiki] Update of "FAQ" by JonathanEllis

Apache Wiki Thu, 05 Sep 2013 19:53:32 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "FAQ" page has been changed by JonathanEllis:
https://wiki.apache.org/cassandra/FAQ?action=diff&rev1=162&rev2=163

Comment:
clean out obsolete questions

  = Frequently asked questions =
   * [[#cant_listen_on_ip_any|Why can't I make Cassandra listen on 0.0.0.0 (all 
my addresses)?]]
   * [[#ports|What ports does Cassandra use?]]
-  * [[#slows_down_after_lotso_inserts|Why does Cassandra slow down after doing 
a lot of inserts?]]
   * [[#existing_data_when_adding_new_nodes|What happens to existing data in my 
cluster when I add new nodes?]]
-  * [[#node_clients_connect_to|Does it matter which node a Thrift or 
higher-level client connects to?]]
   * [[#what_kind_of_hardware_should_i_use|What kind of hardware should I run 
Cassandra on?]]
   * [[#architecture|What are SSTables and Memtables?]]
   * [[#working_with_timeuuid_in_java|Why is it so hard to work with 
TimeUUIDType in Java?]]
   * [[#i_deleted_what_gives|I delete data from Cassandra, but disk usage stays 
the same. What gives?]]
   * [[#cloned|Why does nodetool ring only show one entry, even though my nodes 
logged that they see each other joining the ring?]]
-  * [[#range_ghosts|Why do deleted keys show up during range scans?]]
   * [[#change_replication|Can I change the ReplicationFactor on a live 
cluster?]]
   * [[#large_file_and_blob_storage|Can I store large files or BLOBs in 
Cassandra?]]
   * [[#jmx_localhost_refused|Nodetool says "Connection refused to host: 
127.0.1.1" for any remote host. What gives?]]
   * [[#iter_world|How can I iterate over all the rows in a ColumnFamily?]]
-  * [[#no_keyspaces|Why were none of the keyspaces described in 
storage-conf.xml loaded?]]
   * [[#gui|Is there a GUI admin tool for Cassandra?]]
   * [[#clustername_mismatch|Cassandra says "ClusterName mismatch: 
oldClusterName != newClusterName" and refuses to start]]
   * [[#batch_mutate_atomic|Are batch_mutate operations atomic?]]
@@ -28, +24 @@

   * [[#rhel_selinux|Problems using on RHEL?]]
   * [[#auth|Is there an authentication/authorization mechanism for Cassandra?]]
   * [[#bulkloading|How do I bulk load data into Cassandra?]]
-  * [[#range_rp|Why aren't range slices/sequential scans giving me the 
expected results?]]
   * [[#unsubscribe|How do I unsubscribe from the email list?]]
   * [[#mmap|Why does top report that Cassandra is using a lot more memory than 
the Java heap max?]]
   * [[#jna|I'm getting java.io.IOException: Cannot run program "ln" when 
trying to snapshot or update a keyspace]]
@@ -41, +36 @@

   * [[#ubuntu_hangs|I'm using Ubuntu with JNA, and holy crap weird things keep 
hanging and stalling and printing scary tracebacks in dmesg!]]
   * [[#schema_disagreement|What are schema disagreement errors and how do I 
fix them?]]
   * [[#dropped_messages|Why do I see "... messages dropped.." in the logs?]]
-  * [[#cli_keys|Why does the 0.8 cli not assume keys are strings anymore?]]
   * [[#memlock|Cassandra dies with "java.lang.OutOfMemoryError: Map failed"]]
   * [[#opp|Why should I avoid order-preserving partitioners?]]
   * [[#clocktie|What happens if two updates are made with the same timestamp?]]
@@ -62, +56 @@

  == What ports does Cassandra use? ==
  By default, Cassandra uses 7000 for cluster communication (7001 if SSL is 
enabled), 9160 for clients (Thrift), and 7199 for [[JmxInterface|JMX]].  The 
internode communication and Thrift ports are configurable in cassandra.yaml, 
and the JMX port is configurable in cassandra-env.sh (through JVM options). All 
ports are TCP. See also RunningCassandra.
  
- <<Anchor(slows_down_after_lotso_inserts)>>
- 
- == Why does Cassandra slow down after doing a lot of inserts? ==
- This is a symptom of memory pressure, resulting in a storm of GC operations 
as the JVM frantically tries to free enough heap to continue to operate.  
Eventually, the server will crash from OutOfMemory; usually, but not always, it 
will be able to log this final error before the JVM terminates.
- 
- You can increase the amount of memory the JVM uses, or decrease the insert 
threshold before Cassandra flushes its memtables.  See MemtableThresholds for 
details.
- 
- Setting your cache sizes too large can result in memory pressure.
- 
  <<Anchor(existing_data_when_adding_new_nodes)>>
  
  == What happens to existing data in my cluster when I add new nodes? ==
  When a new nodes joins a cluster, it will automatically contact the other 
nodes in the cluster and copy the right data to itself.
  
- In general, you should set the `initial_token` config option in 
cassandra.yaml before starting a new node. Otherwise, a suboptimal token may be 
selected automatically, leading to an unbalanced ring.  See 
[[Operations#Token_selection|token selection]] in the operations wiki.
- 
- <<Anchor(node_clients_connect_to)>>
- 
- == Does it matter which node a Thrift or higher-level client connects to? ==
- No, any node in the cluster will work; Cassandra nodes proxy your request as 
needed. This leaves the client with a number of options for end point selection:
- 
-  1. You can maintain a list of contact nodes (all or a subset of the nodes in 
the cluster), and configure your clients to choose among them.
-  1. Use round-robin DNS and create a record that points to a set of contact 
nodes (recommended).
-  1. Use the `describe_ring(keyspace)` Thrift RPC call to obtain an 
update-to-date list of the nodes in the cluster and cycle through them.
-  1. Deploy a load-balancer, proxy, etc.
- 
- When using a higher-level client you should investigate which, if any, 
options are implemented by your higher-level client to help you distribute your 
requests across nodes in a cluster.
- 
  <<Anchor(what_kind_of_hardware_should_i_use)>>
  
  == What kind of hardware should I run Cassandra on? ==
@@ -98, +69 @@

  <<Anchor(architecture)>>
  
  == What are SSTables and Memtables? ==
- See [[MemtableSSTable]] and MemtableThresholds.
+ See [[MemtableSSTable]].
  
  <<Anchor(working_with_timeuuid_in_java)>>
  
@@ -209, +180 @@

  
  The easiest fix is to wipe the data and commitlog directories, thus making 
sure that each node will generate a random token on the next restart.
  
- <<Anchor(range_ghosts)>>
- 
- == Why do deleted keys show up during range scans? ==
- Because get_range_slice says, "apply this predicate to the range of rows 
given," meaning, if the predicate result is empty, we have to include an empty 
result for that row key.  It is perfectly valid to perform such a query 
returning empty column lists for some or all keys, even if no deletions have 
been performed.
- 
- So to special case leaving out result entries for deletions, we would have to 
check the entire rest of the row to make sure there is no undeleted data 
anywhere else either (in which case leaving the key out would be an error).
- 
- This is what we used to do with the old get_key_range method, but the 
performance hit turned out to be unacceptable.
- 
- See DistributedDeletes for more on how deletes work in Cassandra.
- 
  <<Anchor(change_replication)>>
  
  == Can I change the ReplicationFactor on a live cluster? ==
@@ -256, +216 @@

  <<Anchor(iter_world)>>
  
  == How can I iterate over all the rows in a ColumnFamily? ==
- Simple but slow: Use get_range_slices, start with the empty string, and after 
each call use the last key read as the start key in the next iteration.
+ Use a CQL client ([ClientOptions]) and Cassandra 2.0.  The cursor support in 
2.0 means you can just write "SELECT * FROM foo" and paging through the 
resultset will be handled automatically.
  
+ Alternatively, you may with to use HadoopSupport.
- Most clients support an easy way to do this.  For example, 
[[http://pycassa.github.com/pycassa/api/pycassa/columnfamily.html#pycassa.columnfamily.ColumnFamily.get_range|pycassa's
 get_range()]], and 
[[http://thobbs.github.com/phpcassa/api/class-phpcassa.ColumnFamily.html#_get_range|phpcassa's
 get_range()]] return an iterator that fetches the next batch of rows 
automatically.  Hector has an 
[[https://github.com/zznate/hector-examples/blob/master/src/main/java/com/riptano/cassandra/hector/example/PaginateGetRangeSlices.java|example
 of how to do this]].
- 
- Better: use HadoopSupport.
- 
- <<Anchor(no_keyspaces)>>
- 
- == Why were none of the keyspaces described in storage-conf.xml loaded? ==
- Prior to 0.7, cassandra loaded a set of static keyspaces defined in a 
storage-conf.xml file.  
[[https://issues.apache.org/jira/browse/CASSANDRA-44|CASSANDRA-44]] added the 
ability to modify schema dynamically on a live cluster.  Part of this change 
required that we ignore the schema defined in storage-conf.xml.  Additionally, 
0.7 converts to YAML based configuration.
- 
- If you have an existing storage-conf.xml file, you will first need to convert 
it to YAML using the `bin/config-converter` tool, which can generate a 
cassandra.yaml file from a storage-conf.xml file.  Once you have a 
cassandra.yaml, it is possible to do a one-time load of the schema it defines.  
0.7 adds a `loadSchemaFromYAML` method to `StorageServiceMBean` (triggered via 
JMX: see https://issues.apache.org/jira/browse/CASSANDRA-1001 ) which will load 
the schema defined in cassandra.yaml, but this is a one-time operation.  A node 
that has had its schema defined via `loadSchemaFromYAML` will load its schema 
from the system table on subsequent restarts, which means that any further 
changes to the schema need to be made using the `system_*` thrift operations 
(see [[API]]).
- 
- It is recommended that you only perform schema updates on one node and let 
cassandra propagate changes to the rest of the cluster.  If you try to perform 
the same updates simultaneously on multiple nodes, you run the risk of 
introducing inconsistent migrations, which will lead to a confused cluster.
- 
- See LiveSchemaUpdates for more information.
  
  <<Anchor(gui)>>
  
  == Is there a GUI admin tool for Cassandra? ==
   * [[http://www.datastax.com/products/opscenter|DataStax Opscenter]], a 
management and monitoring tool for Cassandra with a web-based UI.
-  * [[http://code.google.com/p/cassandra-gui|cassandra-gui]], a Swing data 
browser.
   * [[https://github.com/sebgiroux/Cassandra-Cluster-Admin|Cassandra Cluster 
Admin]], a PHP-based web UI.
   * [[http://toadforcloud.com | Toad for Cloud Databases]], a desktop 
application and Eclipse plugin which support Cassandra.
   * [[http://dbeaver.jkiss.org | DBeaver]], a desktop application, support 
Cassandra using JDBC driver.
@@ -302, +248 @@

  <<Anchor(batch_mutate_atomic)>>
  
  == Are batch_mutate operations atomic? ==
- As a special case, mutations against a single key are atomic but not 
isolated. Reads which occur during such a mutation may see part of the write 
before they see the whole thing. More generally, batch_mutate operations are 
not atomic. [[API#batch_mutate|batch_mutate]] allows grouping operations on 
many keys into a single call in order to save on the cost of network 
round-trips. If `batch_mutate` fails in the middle of its list of mutations, no 
rollback occurs and the mutations that have already been applied stay applied. 
The client should typically retry the `batch_mutate` operation.
+ 
+ Since Cassandra 1.2, CQL batches are atomic by default 
(http://www.datastax.com/dev/blog/atomic-batches-in-cassandra-1-2).  Thrift API 
users must call atomic_batch_mutate instead of batch_mutate if they want this 
behavior.
  
  <<Anchor(hadoop_support)>>
  
@@ -317, +264 @@

  <<Anchor(using_cassandra)>>
  
  == Who is using Cassandra and for what? ==
- See CassandraUsers.
+ See http://planetcassandra.org/Company/ViewCompany?IndustryId=-1.
  
  <<Anchor(what_about_the_obdc)>>
  
  == Are there any OBDC drivers for Cassandra? ==
- No.
+ 
+ Yes: 
http://www.datastax.com/dev/blog/using-the-datastax-odbc-driver-for-apache-cassandra
  
  <<Anchor(logging_using_cassandra)>>
  
@@ -332, +280 @@

  <<Anchor(rhel_selinux)>>
  
  == On RHEL nodes are unable to join the ring ==
- Check if selinux is on, if it is turn it OFF
+ Check if selinux is on; if it is, turn it OFF.
  
  <<Anchor(auth)>>
  
@@ -343, +291 @@

  
  == How do I bulk load data into Cassandra? ==
  See BulkLoading
- 
- <<Anchor(range_rp)>>
- 
- == Why aren't range slices/sequential scans giving me the expected results? ==
- You're probably using the RandomPartitioner.  This is the default because it 
avoids hotspots, but it means your rows are ordered by the md5 of the row key 
rather than lexicographically by the raw key bytes.
- 
- You '''can''' start out with a start key and end key of [empty] and use the 
row count argument instead, if your goal is paging the rows.  To get the next 
page, start from the last key you got in the previous page. This is what the 
Cassandra Hadoop RecordReader does, for instance.
- 
- You can also use intra-row ordering of column names to get ordered results 
'''within''' a row; with appropriate row 'bucketing,' you often don't need the 
rows themselves to be ordered.
  
  <<Anchor(unsubscribe)>>
  
@@ -413, +352 @@

  <<Anchor(seed_spof)>>
  
  == Does single seed mean single point of failure? ==
- If you are using replicated CF on the ring, only one seed in the ring doesn't 
mean single point of failure. The ring can operate or boot without the seed. 
However, it will need more time to spread status changes of node over the ring. 
It is recommended to have multiple seeds in production system.
+ The ring can operate or boot without a seed; however, you will not be able to 
add new nodes to the cluster. It is recommended to configure multiple seeds in 
production system.
  
  <<Anchor(jconsole_array_arg)>>
  
@@ -486, +425 @@

  <<Anchor(dropped_messages)>>
  
  == Why do I see "... messages dropped.." in the logs? ==
+ This is a symptom of load shedding -- Cassandra defending itself against more 
requests than it can handle.
+ 
  Internode messages which are received by a node, but do not get not to be 
processed within rpc_timeout are dropped rather than processed. As the 
coordinator node will no longer be waiting for a response. If the Coordinator 
node does not receive Consistency Level responses before the rpc_timeout it 
will return a !TimedOutException to the client. If the coordinator receives 
Consistency Level responses it will return success to the client.
  
  For MUTATION messages this means that the mutation was not applied to all 
replicas it was sent to. The inconsistency will be repaired by Read Repair or 
Anti Entropy Repair.
@@ -493, +434 @@

  For READ messages this means a read request may not have completed.
  
  Load shedding is part of the Cassandra architecture, if this is a persistent 
issue it is generally a sign of an overloaded node or cluster.
- 
- <<Anchor(cli_keys)>>
- 
- == Why does the 0.8 cli not assume keys are strings anymore? ==
- Prior to 0.8, there was no type metadata available for row keys, and the cli 
interface treated all keys as strings.  This made the cli unusable for the many 
applications whose rows were numeric, uuids, or other non-string data.
- 
- 0.8 added key_validation_class to the !ColumnFamily definition, similarly to 
the existing comparator for column names, and column_metadata validation_class 
for column values.  This both lets clients know the expected data type, and 
rejects updates with non-conformant values.
- 
- To preserve application compatibility, the default key_validation_class is 
BytesType, i.e., "anything goes."  The CLI expects bytes to be provided in hex.
- 
- If all your keys are of the same type, you should add information to the CF 
metadata, e.g., "update column family <cf> with key_validation_class = 
'UTF8Type'".  If you have heterogeneous keys, you can tell the cli what type to 
use on case-by-case basis, as in, "assume <cf> keys as utf8".
  
  <<Anchor(memlock)>>
  == Cassandra dies with "java.lang.OutOfMemoryError: Map failed" ==

[Cassandra Wiki] Update of "FAQ" by JonathanEllis

Reply via email to