The key point is that we can't just put the time frame as a prefix before 
keyword. 
Since the time frame is normally small for high volume data, if the scan time 
range is big, the hbase scan range is too big and we have to skip other 
keywords in this range. So, then scan performance is bad.
So, I suggest to use "coarse granularity time frame + keyword + fine 
granularity time frame". In this schema, you can call hbase several time by 
coarse granularity time frame with small scan range on fine granularity time 
frame.

------------------ ???????? ------------------
??????: Li Yang <[email protected]>
????????: 2015??03??03?? 15:17
??????: dev <[email protected]>
????: Re: hbase rowkey design of inverted index



Agree on the scan pattern of inverted index, that it is typically a keyword
scan within a time range. In addition, data shall be sharded so parallel
scans on multiple regions can cut down response time.The final rowkey may
look like "shard + time frame + keyword".

These ideas will be put into next version of invented index storage.

On Sun, Mar 1, 2015 at 9:23 PM, ???? <[email protected]> wrote:

> Basically, we have 2 ways to design the hbase rowkey for inverted index:
> 1. "time + keyword":
> It split the index by time that can avoid hbase region merge. But one
> query may scan lots of scattered rows that is not sequential.
> 2. "keyword + time":
> It can guarantee the sequential scan of keyword. But it may trigger the
> hbase region merge since one keyword may be scattered in many regions.
>
>
> So, we can merge these 2 solutions as this: "coarse granularity time +
> keyword + fine granularity time". For example, "20150215 + abc + 1130". In
> this way, we use "coarse granularity time" to avoid hbase region merge and
> "fine granularity time" to guarantee the sequential scan.
>
>
> User can define different "coarse granularity time" & "fine granularity
> time" for different cases. If the inverted index is only used in real-time
> case, we can define a small "coarse granularity time" (e.g. 1 day). If the
> indverted index will cover full data set, we can define a big "coarse
> granularity time" (e.g. 1 month).
>
>
> Thanks
> Jiang Xu

Reply via email to