The key point is that we can't just put the time frame as a prefix before keyword. Since the time frame is normally small for high volume data, if the scan time range is big, the hbase scan range is too big and we have to skip other keywords in this range. So, then scan performance is bad. So, I suggest to use "coarse granularity time frame + keyword + fine granularity time frame". In this schema, you can call hbase several time by coarse granularity time frame with small scan range on fine granularity time frame.
------------------ ???????? ------------------ ??????: Li Yang <[email protected]> ????????: 2015??03??03?? 15:17 ??????: dev <[email protected]> ????: Re: hbase rowkey design of inverted index Agree on the scan pattern of inverted index, that it is typically a keyword scan within a time range. In addition, data shall be sharded so parallel scans on multiple regions can cut down response time.The final rowkey may look like "shard + time frame + keyword". These ideas will be put into next version of invented index storage. On Sun, Mar 1, 2015 at 9:23 PM, ???? <[email protected]> wrote: > Basically, we have 2 ways to design the hbase rowkey for inverted index: > 1. "time + keyword": > It split the index by time that can avoid hbase region merge. But one > query may scan lots of scattered rows that is not sequential. > 2. "keyword + time": > It can guarantee the sequential scan of keyword. But it may trigger the > hbase region merge since one keyword may be scattered in many regions. > > > So, we can merge these 2 solutions as this: "coarse granularity time + > keyword + fine granularity time". For example, "20150215 + abc + 1130". In > this way, we use "coarse granularity time" to avoid hbase region merge and > "fine granularity time" to guarantee the sequential scan. > > > User can define different "coarse granularity time" & "fine granularity > time" for different cases. If the inverted index is only used in real-time > case, we can define a small "coarse granularity time" (e.g. 1 day). If the > indverted index will cover full data set, we can define a big "coarse > granularity time" (e.g. 1 month). > > > Thanks > Jiang Xu
