Hi,

you can also read the following paper http://www.cslab.ntua.gr/~ikons/distributed_indexing_of_webscale_datasets_for_the_cloud_mdac_2010_cr.pdf where we present an inverted index system based on hbase (both the index and the content is served through hbase, and indexing is performed through mapreduce hadoop functions)

στις 17/5/2010 6:44 μμ, O/H Jonathan Gray έγραψε:
Kevin,

You would want to make your row keys the words.

HBase defines it's tablets (called Regions) by the startRow and endRow.  So as you say, a given 
region may contain "ro to ru".  Looking up the word "round" would use that 
region.  This is handled automatically by the META table.

For a refresher on these concepts, check out the BigTable paper.  There have 
also been some discussions about inverted word indexes on this mailing list 
though I don't have links.

JG

-----Original Message-----
From: Kevin Apte [mailto:technicalarchitect2...@gmail.com]
Sent: Monday, May 17, 2010 1:07 AM
To: hbase-user@hadoop.apache.org
Subject: Inverted word index...

     Consider a search system with an inverted word index- in other
words, an
index which points to document location- with these columns- word,
document
ID and possibly timestamp.

Given a word, how will I know which tablet to scan to find all Document
IDs,
with the given word.

If you are indexing a large database - say 50 TB, then each word may be
split across multiple tablets. There may be hundreds  of such tablets
each
with a large number of SSTables  to store the index. How will I know
which
tablet to search for?  Is there a master index that specifies which
tablet
has words with range say "ro to ru"  ?    Or do I have to lookup Bloom
Filters for every tablet?

Kevin

--
Ioannis Konstantinou
Research Associate, Computing Systems Laboratory
National Technical University of Athens
phone: +30 2107721544(internal 421)
mobile: +30 6945992906
Web: http://www.cslab.ntua.gr/~ikons

Reply via email to