Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Cassandra Wiki" for 
change notification.

The "UseCases" page has been changed by FlipKromer.
The comment on this change is: Cassandra to deduplicate a massive dataset..
http://wiki.apache.org/cassandra/UseCases

--------------------------------------------------

New page:
= Cassandra Use Cases =

Here are several use cases and example implementations in high-level code.


== Uniq a large dataset using simple key-value columns ==

We have to batch-process a massive dataset with frequent duplicates that we'd 
like to skip.

Here is ruby code using Cassandra as a simple key-value store to skip 
duplicates. You can find a real working version in the 
[[http://github.com/mrflip/wukong/blob/master/examples/keystore/conditional_outputter_example.rb|Wukong
 example code]] -- it's used to batch process terabyte-scale data on a 30 
machine cluster using Hadoop and Cassandra.

{{{
    class CassandraConditionalOutputter
      CASSANDRA_KEYSPACE = 'Foo'
    
      # Batch parse a raw stream into parsed objects. The parsed objects may 
have
      # many duplicates which we'd like to reject
      # 
      # records respond to #key (only one record for the given key will be 
output)
      # and #timestamp (which can be say '0' if record has no meaningful 
timestamp)
      def process raw_records
        raw_records.parse do |record|
          if should_emit?(record)
            track! record
            puts   record
          end
        end
      end
    
      # Emit if record's key isn't already in the key column
      def should_emit? record
        key_cache.get(key_column, record.key).blank?
      end
    
      # register key in the key_cache
      def track! record
        key_cache.insert(key_column, record.key, 't' => record.timestamp)
      end
    
      # nuke key from the key_cache
      def remove record
        key_cache.remove(key_column, record.key)
      end
    
      # The Cassandra keyspace for key lookup
      def key_cache
        @key_cache ||= Cassandra.new(CASSANDRA_KEYSPACE)
      end
    
      # Name the key column after class
      def key_column
        self.class.to_s+'Keys'
      end
    end
}}}

Reply via email to