Agree on the scan pattern of inverted index, that it is typically a keyword scan within a time range. In addition, data shall be sharded so parallel scans on multiple regions can cut down response time.The final rowkey may look like "shard + time frame + keyword".
These ideas will be put into next version of invented index storage. On Sun, Mar 1, 2015 at 9:23 PM, 蒋旭 <[email protected]> wrote: > Basically, we have 2 ways to design the hbase rowkey for inverted index: > 1. "time + keyword": > It split the index by time that can avoid hbase region merge. But one > query may scan lots of scattered rows that is not sequential. > 2. "keyword + time": > It can guarantee the sequential scan of keyword. But it may trigger the > hbase region merge since one keyword may be scattered in many regions. > > > So, we can merge these 2 solutions as this: "coarse granularity time + > keyword + fine granularity time". For example, "20150215 + abc + 1130". In > this way, we use "coarse granularity time" to avoid hbase region merge and > "fine granularity time" to guarantee the sequential scan. > > > User can define different "coarse granularity time" & "fine granularity > time" for different cases. If the inverted index is only used in real-time > case, we can define a small "coarse granularity time" (e.g. 1 day). If the > indverted index will cover full data set, we can define a big "coarse > granularity time" (e.g. 1 month). > > > Thanks > Jiang Xu
