Re: Inverted word index...

Ioannis Konstantinou Mon, 17 May 2010 10:22:47 -0700

Hi,

you can also read the following paperhttp://www.cslab.ntua.gr/~ikons/distributed_indexing_of_webscale_datasets_for_the_cloud_mdac_2010_cr.pdfwhere we present an inverted index system based on hbase (both the indexand the content is served through hbase, and indexing is performedthrough mapreduce hadoop functions)


στις 17/5/2010 6:44 μμ, O/H Jonathan Gray έγραψε:

Kevin,

You would want to make your row keys the words.

HBase defines it's tablets (called Regions) by the startRow and endRow.  So as you say, a given 
region may contain "ro to ru".  Looking up the word "round" would use that 
region.  This is handled automatically by the META table.

For a refresher on these concepts, check out the BigTable paper.  There have 
also been some discussions about inverted word indexes on this mailing list 
though I don't have links.

JG

-----Original Message-----
From: Kevin Apte [mailto:technicalarchitect2...@gmail.com]
Sent: Monday, May 17, 2010 1:07 AM
To: hbase-user@hadoop.apache.org
Subject: Inverted word index...

     Consider a search system with an inverted word index- in other
words, an
index which points to document location- with these columns- word,
document
ID and possibly timestamp.

Given a word, how will I know which tablet to scan to find all Document
IDs,
with the given word.

If you are indexing a large database - say 50 TB, then each word may be
split across multiple tablets. There may be hundreds  of such tablets
each
with a large number of SSTables  to store the index. How will I know
which
tablet to search for?  Is there a master index that specifies which
tablet
has words with range say "ro to ru"  ?    Or do I have to lookup Bloom
Filters for every tablet?

Kevin


--
Ioannis Konstantinou
Research Associate, Computing Systems Laboratory
National Technical University of Athens
phone: +30 2107721544(internal 421)
mobile: +30 6945992906
Web: http://www.cslab.ntua.gr/~ikons

Re: Inverted word index...

Reply via email to