[Hadoop Wiki] Update of "Hbase/SecondaryIndexing" by jgray

Apache Wiki Mon, 28 Feb 2011 11:17:37 -0800

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Hadoop Wiki" for change 
notification.


The "Hbase/SecondaryIndexing" page has been changed by jgray.
http://wiki.apache.org/hadoop/Hbase/SecondaryIndexing?action=diff&rev1=1&rev2=2

--------------------------------------------------

  
  == Eventually Consistent Secondary Indexes using Coprocessors ==
  
- The basic idea is to use an additional (secondary) table for each index on 
the main (primary) table.  A coprocessor binding to a family would be used to 
define a given secondary index on that family (or specific column(s) within it).
+ The basic idea is to use an additional (secondary) table for each index on 
the main (primary) table.  A coprocessor binding to a family would be used to 
define a given secondary index on that family (or specific column(s) within 
it).  The WAL would be used to ensure durability and a shared queue makes the 
secondary update async from the callers POV.  Normal HBase timestamps would be 
used for any conflict resolution and to make operations idempotent.
  
- When a Put comes in to the primary table, the following would happen:
+ When a Put comes in to the primary table, the following would happen 
(assuming a single index update to a single secondary table for now):
  
  1. Generate WALEdit for primary table
+ 
  2. Generate a new, special kind of WALEdit for secondary table update
- 3. 
+ 
+ 3. Append and sync both WALEdits to the HLog
+ 
+ 4. Apply primary table edits to MemStore and commit RWCC
+ 
+ 5. Add secondary table edit job to shared work queue
+ 
+ 6. Return to client
  
  
+ The shared queue would be a thread or threadpool that picks up these 
secondary table edit jobs and applies them using a normal Put operation to the 
secondary table.
  
- Open questions:
+ On failover of primary table, primary edits would be replayed normally, and 
secondary edits would be applied to the secondary table/server as is done with 
the shared queue.
  
- * How to deal with creation of secondary tables
- * 
+ The big open questions are around how to deal with the WAL and replay.
+ 
+ The secondary table could be offline because of another RS failure, so we may 
have long-waiting secondary updates.  How can we guarantee all secondary 
updates are applied when evicting an old HLog?  Ideally we want to avoid 
over-exploiting operations being idempotent and not just aggressively 
reapplying everything.  Do we need to keep track of each HLog and it's pending 
secondary updates and prevent log eviction until they are done?
+ 
+ Should the workers applying secondary edits write back into the WAL that the 
edit is complete (and thus durable on the other server so does not need to be 
replayed if this one fails over)?  Or we could tie secondary edits to each 
memstore, and the flushing of a memstore can only happen if its secondary edits 
have all been applied, which would tie in with the existing semantics around 
log eviction... but that has other implications.
+ 
+ 
+ Other open questions:
+ 
+ * Creation of secondary tables (auto-bootstrapped?  part of coprocessor init? 
 manual?)
  
  
  Future work:
  
  * Declaration of indexes via API or shell syntax rather than programatically 
with a coprocessor-per-index
  * Creation of indexes on existing tables (build of indexes based on current 
data and kept up to date)
+ * Option to apply secondary update in a synchronous fashion (if you want to 
take performance hit and have stronger consistency of the index)
  
  
  == Secondary Indexes using Optimistic Concurrency Control ==

[Hadoop Wiki] Update of "Hbase/SecondaryIndexing" by jgray

Reply via email to