[Cassandra Wiki] Update of "StorageConfiguration" by Jo nHermes

Apache Wiki Wed, 25 Aug 2010 09:26:29 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.


The "StorageConfiguration" page has been changed by JonHermes.
http://wiki.apache.org/cassandra/StorageConfiguration?action=diff&rev1=35&rev2=36

--------------------------------------------------

  ||''-Dcassandra.config=http://www.example.com/remote-cassandra.yaml'' ||loads 
a configuration file from a remote host. ||
  ||''-Dcassandra.config=file:///home/me/external-local-cassandra.yaml'' 
||loads a local configuration file that is not located in the cassandra 
classpath. ||
  
+ 
+ 
+ 
  == "Where are my keyspaces?" ==
  LiveSchemaUpdates. You can load the schema once by using:
  
  {{{
  bin/schematool HOST PORT import
  }}}
- 
  = Config Overview =
  Not going to cover every value, just the interesting ones. When in doubt, 
check out the comments on the default cassandra.yaml as they're well documented 
there.
  
  == per-Cluster (Global) Settings ==
   * '''authenticator'''
+ 
  Allows for pluggable authentication of users, which defines whether it is 
necessary to call the Thrift 'login' method, and which parameters are required 
to login. The default '!AllowAllAuthenticator' does not require users to call 
'login': any user can perform any operation. The other built in option is 
'!SimpleAuthenticator', which requires users and passwords to be defined in 
property files, and for users to call login with a valid combo.
  
  Default is: 'org.apache.cassandra.auth.AllowAllAuthenticator', a no-op.
  
   * '''auto_bootstrap'''
+ 
  Set to 'true' to make new [non-seed] nodes automatically migrate the right 
data to themselves.  (If no InitialToken is specified, they will pick one  such 
that they will get half the range of the most-loaded node.) If a node starts up 
without bootstrapping, it will mark itself bootstrapped so that you can't 
subsequently accidently bootstrap a node with data on it.  (You can reset this 
by wiping your data and commitlog directories.)
  
  Default is: 'false', so that new clusters don't bootstrap immediately.  You 
should turn this on when you start adding new nodes to a cluster that already 
has data on it.
  
   * '''cluster_name'''
+ 
  The name of this cluster.  This is mainly used to prevent machines in one 
logical cluster from joining another.
  
   * '''commitlog_directory and data_file_directories'''
+ 
  Be sure to seperate your commitlog and data disks, as commitlog performance 
is reliant on its append-only nature, and seeking to random data at the same 
time will damage write speed.
  
  Defaults are: '/var/lib/cassandra/commitlog' and '/var/lib/cassandra/data'.
  
   * '''concurrent_reads''' and '''concurrent_writes''', '''commitlog_sync''' 
and '''commitlog_sync_period_in_ms'''
+ 
  Unlike most systems, in Cassandra writes are faster than reads, so you can 
afford more of those in parallel.  A good rule of thumb is 4 concurrent_reads 
per processor core.  Increase {{{concurrent_writes}}} to the number of clients 
writing at once if you use commitlog_sync.
  
  {{{CommitLogSync}}} may be either "periodic" or "batch."  When in batch mode, 
Cassandra won't ack writes until the commit log has been fsynced to disk.  It 
will wait up to {{{CommitLogSyncBatchWindowInMS}}} milliseconds for other 
writes, before performing the sync.
@@ -54, +61 @@

  Defaults are: '8' c. reads, '32' c. writes, 'periodic' sync, '10000' ms 
between syncs.
  
   * '''disk_access_mode'''
- The options are: 'auto', 'mmap', 'mmap_index_only', and 'standard'.
+ 
- mmapped i/o is substantially faster, but only practical on a 64bit machine 
(which notably does not include EC2 "small" instances) or relatively small 
datasets.  "auto", the safe choice, will enable mmapping on a 64bit JVM.  Other 
values are "mmap", "mmap_index_only" (which may allow you to get part of the 
benefits of mmap on a 32bit machine by mmapping only index files) and 
"standard". (The buffer size settings that follow only apply to standard, 
non-mmapped i/o.)
+ The options are: 'auto', 'mmap', 'mmap_index_only', and 'standard'. mmapped 
i/o is substantially faster, but only practical on a 64bit machine (which 
notably does not include EC2 "small" instances) or relatively small datasets.  
"auto", the safe choice, will enable mmapping on a 64bit JVM.  Other values are 
"mmap", "mmap_index_only" (which may allow you to get part of the benefits of 
mmap on a 32bit machine by mmapping only index files) and "standard". (The 
buffer size settings that follow only apply to standard, non-mmapped i/o.)
  
  Default is: 'auto'.
  
   * '''dynamic_snitch''' and '''endpoint_snitch'''
+ 
  !EndPointSnitch: Setting this to the class that implements 
{{{IEndPointSnitch}}} which will see if two endpoints are in the same data 
center or on the same rack. Out of the box, Cassandra provides 
{{{org.apache.cassandra.locator.RackInferringSnitch}}}
  
  Note: this class will work on hosts' IPs only. There is no configuration 
parameter to tell Cassandra that a node is in rack ''R'' and in datacenter 
''D''. The current rules are based on the two methods:
@@ -73, +81 @@

  Defaults are: 'org.apache.cassandra.locator.SimpleSnitch' and 'false'.
  
   * '''listen_address'''
+ 
- Commenting out this property leaves it up to 
{{{InetAddress.getLocalHost()}}}. This will always do the Right Thing *if* the 
node is properly configured (hostname, name resolution, etc), and the Right 
Thing is to use the address associated with the hostname (it might not be).  
+ Commenting out this property leaves it up to 
{{{InetAddress.getLocalHost()}}}. This will always do the Right Thing *if* the 
node is properly configured (hostname, name resolution, etc), and the Right 
Thing is to use the address associated with the hostname (it might not be).
  
  Default is: 'localhost'. This must be changed for other nodes to contact this 
node.
  
   * '''memtable_flush_after_mins''', '''memtable_operations_in_millions''', 
and '''memtable_throughput_in_mb'''
+ 
  The maximum time to leave a dirty memtable unflushed. (While any affected 
columnfamilies have unflushed data from a commit log segment, that segment 
cannot be deleted.) This needs to be large enough that it won't cause a flush 
storm of all your memtables flushing at once because none has hit the size or 
count thresholds yet.  For production, a larger value such as 1440 is 
recommended.
  
  The maximum number of columns in millions to store in memory per ColumnFamily 
before flushing to disk.  This is also a per-memtable setting.  Use with 
{{{MemtableSizeInMB}}} to tune memory usage.
@@ -87, +97 @@

  Defaults are: '60' minutes, '0.3' millions, and '64' mb respectively.
  
   * '''partitioner'''
+ 
  Partitioner: any {{{IPartitioner}}} may be used, including your own as long 
as it is on the classpath.  Out of the box, Cassandra provides 
{{{org.apache.cassandra.dht.RandomPartitioner}}}, 
{{{org.apache.cassandra.dht.OrderPreservingPartitioner}}}, and 
{{{org.apache.cassandra.dht.CollatingOrderPreservingPartitioner}}}. 
(CollatingOPP colates according to EN,US rules, not naive byte ordering.  Use 
this as an example if you need locale-aware collation.) Range queries require 
using an order-preserving partitioner.
  
  Achtung!  Changing this parameter requires wiping your data directories, 
since the partitioner can modify the !sstable on-disk format.
@@ -102, +113 @@

  Default is: 'org.apache.cassandra.dht.RandomPartitioner'. Manually assigning 
tokens is highly recommended to guarantee even load distribution.
  
   * '''seeds'''
+ 
  Never use a node's own address as a seed if you are bootstrapping it by 
setting autobootstrap to true!
  
   * '''thrift_framed_transport_size_in_mb'''
+ 
  Setting this to '0' is how to denote using unframed (Buffered) transport.
  
  Default is: '15' mb.
  
  == per-Keyspace Settings ==
   * '''name'''
+ 
  Required field. Will not allow you to use dashes.
+ 
   * '''replica_placement_strategy''' and '''replication_factor''' ===
+ 
  Strategy: Setting this to the class that implements 
{{{IReplicaPlacementStrategy}}} will change the way the node picker works. Out 
of the box, Cassandra provides 
{{{org.apache.cassandra.locator.RackUnawareStrategy}}} and 
{{{org.apache.cassandra.locator.RackAwareStrategy}}} (place one replica in a 
different datacenter, and the others on different racks in the same one.)
  
  Note that the replication factor (RF) is the ''total'' number of nodes onto 
which the data will be placed.  So, a replication factor of 1 means that only 1 
node will have the data.  It does '''not''' mean that one ''other'' node will 
have the data.
@@ -120, +136 @@

  Defaults are: 'org.apache.cassandra.locator.RackUnawareStrategy' and '1'. RF 
of at least 2 is highly recommended, keeping in mind that your effective number 
of nodes is (N total nodes / RF).
  
  == per-ColumnFamily Settings ==
-   * '''comment''' and '''name'''
+  * '''comment''' and '''name'''
+ 
  You can describe a ColumnFamily in plain text by setting these properties.
  
-   * '''compare_with'''
+  * '''compare_with'''
+ 
  The {{{CompareWith}}} attribute tells Cassandra how to sort the columns for 
slicing operations.  The default is {{{BytesType}}}, which is a straightforward 
lexical comparison of the bytes in each column. Other options are 
{{{AsciiType}}}, {{{UTF8Type}}}, {{{LexicalUUIDType}}}, {{{TimeUUIDType}}}, and 
{{{LongType}}}.  You can also specify the fully-qualified class name to a class 
of your choice extending {{{org.apache.cassandra.db.marshal.AbstractType}}}.
  
   a. {{{SuperColumns}}} have a similar {{{CompareSubcolumnsWith}}} attribute.
@@ -134, +152 @@

   a. {{{LexicalUUIDType}}}: A 128bit UUID, compared lexically (by byte value)
   a. {{{TimeUUIDType}}}: a 128bit version 1 UUID, compared by timestamp
  
+  * '''gc_grace_seconds'''
  
-   * '''gc_grace_seconds'''
  Time to wait before garbage-collection deletion markers.  Set this to a large 
enough value that you are confident that the deletion marker will be propagated 
to all replicas by the time this many seconds has elapsed, even in the face of 
hardware failures.  The default value is ten days.
  
  Default is: '864000' seconds, or 10 days.
  
-   * '''keys_cached''' and '''rows_cached'''
+  * '''keys_cached''' and '''rows_cached'''
+ 
  Defaults are: '200000' keys cached, and '0', disabled row cache.
  
-   * '''preload_row_cache'''
+  * '''preload_row_cache'''
  
-   * '''read_repair_chance'''
+  * '''read_repair_chance'''
  
-   * '''default_validation_class'''
+  * '''default_validation_class'''
+ 
- Used in conjunction with the validation_class property in the per-column 
settings to guarantee the 
+ Used in conjunction with the validation_class property in the per-column 
settings to guarantee the
  
  Default is: 'BytesType', a no-op.
  
  == per-Column Settings ==
-   * '''validation_class'''
+  * '''validation_class'''
  
-   * '''index_type'''
+  * '''index_type'''

[Cassandra Wiki] Update of "StorageConfiguration" by Jo nHermes

Reply via email to