Hi,
you can also read the following paper
http://www.cslab.ntua.gr/~ikons/distributed_indexing_of_webscale_datasets_for_the_cloud_mdac_2010_cr.pdf
where we present an inverted index system based on hbase (both the index
and the content is served through hbase, and indexing is performed
through mapreduce hadoop functions)
στις 17/5/2010 6:44 μμ, O/H Jonathan Gray έγραψε:
Kevin,
You would want to make your row keys the words.
HBase defines it's tablets (called Regions) by the startRow and endRow. So as you say, a given
region may contain "ro to ru". Looking up the word "round" would use that
region. This is handled automatically by the META table.
For a refresher on these concepts, check out the BigTable paper. There have
also been some discussions about inverted word indexes on this mailing list
though I don't have links.
JG
-----Original Message-----
From: Kevin Apte [mailto:technicalarchitect2...@gmail.com]
Sent: Monday, May 17, 2010 1:07 AM
To: hbase-user@hadoop.apache.org
Subject: Inverted word index...
Consider a search system with an inverted word index- in other
words, an
index which points to document location- with these columns- word,
document
ID and possibly timestamp.
Given a word, how will I know which tablet to scan to find all Document
IDs,
with the given word.
If you are indexing a large database - say 50 TB, then each word may be
split across multiple tablets. There may be hundreds of such tablets
each
with a large number of SSTables to store the index. How will I know
which
tablet to search for? Is there a master index that specifies which
tablet
has words with range say "ro to ru" ? Or do I have to lookup Bloom
Filters for every tablet?
Kevin
--
Ioannis Konstantinou
Research Associate, Computing Systems Laboratory
National Technical University of Athens
phone: +30 2107721544(internal 421)
mobile: +30 6945992906
Web: http://www.cslab.ntua.gr/~ikons