Agree on the scan pattern of inverted index, that it is typically a keyword
scan within a time range. In addition, data shall be sharded so parallel
scans on multiple regions can cut down response time.The final rowkey may
look like "shard + time frame + keyword".

These ideas will be put into next version of invented index storage.

On Sun, Mar 1, 2015 at 9:23 PM, 蒋旭 <[email protected]> wrote:

> Basically, we have 2 ways to design the hbase rowkey for inverted index:
> 1. "time + keyword":
> It split the index by time that can avoid hbase region merge. But one
> query may scan lots of scattered rows that is not sequential.
> 2. "keyword + time":
> It can guarantee the sequential scan of keyword. But it may trigger the
> hbase region merge since one keyword may be scattered in many regions.
>
>
> So, we can merge these 2 solutions as this: "coarse granularity time +
> keyword + fine granularity time". For example, "20150215 + abc + 1130". In
> this way, we use "coarse granularity time" to avoid hbase region merge and
> "fine granularity time" to guarantee the sequential scan.
>
>
> User can define different "coarse granularity time" & "fine granularity
> time" for different cases. If the inverted index is only used in real-time
> case, we can define a small "coarse granularity time" (e.g. 1 day). If the
> indverted index will cover full data set, we can define a big "coarse
> granularity time" (e.g. 1 month).
>
>
> Thanks
> Jiang Xu

Reply via email to