Hi everyone, I'm working on a project in which we need a distributed inverted index, and are getting some fair results using HBase and Hadoop (Crawlers -> Document Repository (HBase) --M/R-> Document Index (Hbase) --M/R-> Inverted Index). However, we are also investigating more efficient methods to use this inverted index. So after reading [1] we are wondering if anyone figured a way to let a HBase cluster do document-based partitioning instead of term-based partitioning.
Basically the question boils down to: is there a easy way to distribute columns over multiple regions and let a client/HBase scan over multiple regions to gather a row and its columns? And if no, are there people using HBase for (search system) inverted indexes anyway and how is it coping? Greetings, Menno Luiten [1] B. Cambazoglu, et al. "Effects of Inverted Index Partitioning Schemes on Performance of Query Processing in Parallel Text Retrieval Systems"
