Sure. :) Just a reminder, we still need "fine granularity time frame" as suffix to split the long inverted index.
------------------ ???????? ------------------ ??????: Li Yang <[email protected]> ????????: 2015??03??03?? 16:47 ??????: dev <[email protected]> ????: Re: hbase rowkey design of inverted index Hi Xu, I understand the "coarse granularity time" concept, my "time frame" means exactly the same, not the most granular time, but a bigger range of time. Just think its shorter and looks better. :-) On Tue, Mar 3, 2015 at 3:52 PM, ???? <[email protected]> wrote: > The key point is that we can't just put the time frame as a prefix before > keyword. > Since the time frame is normally small for high volume data, if the scan > time range is big, the hbase scan range is too big and we have to skip > other keywords in this range. So, then scan performance is bad. > So, I suggest to use "coarse granularity time frame + keyword + fine > granularity time frame". In this schema, you can call hbase several time by > coarse granularity time frame with small scan range on fine granularity > time frame. > > ------------------ ???????? ------------------ > ??????: Li Yang <[email protected]> > ????????: 2015??03??03?? 15:17 > ??????: dev <[email protected]> > ????: Re: hbase rowkey design of inverted index > > > > Agree on the scan pattern of inverted index, that it is typically a keyword > scan within a time range. In addition, data shall be sharded so parallel > scans on multiple regions can cut down response time.The final rowkey may > look like "shard + time frame + keyword". > > These ideas will be put into next version of invented index storage. > > On Sun, Mar 1, 2015 at 9:23 PM, ???? <[email protected]> wrote: > > > Basically, we have 2 ways to design the hbase rowkey for inverted index: > > 1. "time + keyword": > > It split the index by time that can avoid hbase region merge. But one > > query may scan lots of scattered rows that is not sequential. > > 2. "keyword + time": > > It can guarantee the sequential scan of keyword. But it may trigger the > > hbase region merge since one keyword may be scattered in many regions. > > > > > > So, we can merge these 2 solutions as this: "coarse granularity time + > > keyword + fine granularity time". For example, "20150215 + abc + 1130". > In > > this way, we use "coarse granularity time" to avoid hbase region merge > and > > "fine granularity time" to guarantee the sequential scan. > > > > > > User can define different "coarse granularity time" & "fine granularity > > time" for different cases. If the inverted index is only used in > real-time > > case, we can define a small "coarse granularity time" (e.g. 1 day). If > the > > indverted index will cover full data set, we can define a big "coarse > > granularity time" (e.g. 1 month). > > > > > > Thanks > > Jiang Xu >
