Dear Wiki user, You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for change notification.
The "UseCases" page has been changed by FlipKromer. The comment on this change is: Cassandra to deduplicate a massive dataset.. http://wiki.apache.org/cassandra/UseCases -------------------------------------------------- New page: = Cassandra Use Cases = Here are several use cases and example implementations in high-level code. == Uniq a large dataset using simple key-value columns == We have to batch-process a massive dataset with frequent duplicates that we'd like to skip. Here is ruby code using Cassandra as a simple key-value store to skip duplicates. You can find a real working version in the [[http://github.com/mrflip/wukong/blob/master/examples/keystore/conditional_outputter_example.rb|Wukong example code]] -- it's used to batch process terabyte-scale data on a 30 machine cluster using Hadoop and Cassandra. {{{ class CassandraConditionalOutputter CASSANDRA_KEYSPACE = 'Foo' # Batch parse a raw stream into parsed objects. The parsed objects may have # many duplicates which we'd like to reject # # records respond to #key (only one record for the given key will be output) # and #timestamp (which can be say '0' if record has no meaningful timestamp) def process raw_records raw_records.parse do |record| if should_emit?(record) track! record puts record end end end # Emit if record's key isn't already in the key column def should_emit? record key_cache.get(key_column, record.key).blank? end # register key in the key_cache def track! record key_cache.insert(key_column, record.key, 't' => record.timestamp) end # nuke key from the key_cache def remove record key_cache.remove(key_column, record.key) end # The Cassandra keyspace for key lookup def key_cache @key_cache ||= Cassandra.new(CASSANDRA_KEYSPACE) end # Name the key column after class def key_column self.class.to_s+'Keys' end end }}}
