??????hbase rowkey design of inverted index

???? Tue, 03 Mar 2015 01:34:45 -0800

Sure. :)
Just a reminder, we still need "fine granularity time frame" as suffix to split 
the long inverted index.


------------------ ???????? ------------------
??????: Li Yang <[email protected]>
????????: 2015??03??03?? 16:47
??????: dev <[email protected]>
????: Re: hbase rowkey design of inverted index



Hi Xu, I understand the "coarse granularity time" concept, my "time frame"
means exactly the same, not the most granular time, but a bigger range of
time. Just think its shorter and looks better. :-)


On Tue, Mar 3, 2015 at 3:52 PM, ???? <[email protected]> wrote:

> The key point is that we can't just put the time frame as a prefix before
> keyword.
> Since the time frame is normally small for high volume data, if the scan
> time range is big, the hbase scan range is too big and we have to skip
> other keywords in this range. So, then scan performance is bad.
> So, I suggest to use "coarse granularity time frame + keyword + fine
> granularity time frame". In this schema, you can call hbase several time by
> coarse granularity time frame with small scan range on fine granularity
> time frame.
>
> ------------------ ???????? ------------------
> ??????: Li Yang <[email protected]>
> ????????: 2015??03??03?? 15:17
> ??????: dev <[email protected]>
> ????: Re: hbase rowkey design of inverted index
>
>
>
> Agree on the scan pattern of inverted index, that it is typically a keyword
> scan within a time range. In addition, data shall be sharded so parallel
> scans on multiple regions can cut down response time.The final rowkey may
> look like "shard + time frame + keyword".
>
> These ideas will be put into next version of invented index storage.
>
> On Sun, Mar 1, 2015 at 9:23 PM, ???? <[email protected]> wrote:
>
> > Basically, we have 2 ways to design the hbase rowkey for inverted index:
> > 1. "time + keyword":
> > It split the index by time that can avoid hbase region merge. But one
> > query may scan lots of scattered rows that is not sequential.
> > 2. "keyword + time":
> > It can guarantee the sequential scan of keyword. But it may trigger the
> > hbase region merge since one keyword may be scattered in many regions.
> >
> >
> > So, we can merge these 2 solutions as this: "coarse granularity time +
> > keyword + fine granularity time". For example, "20150215 + abc + 1130".
> In
> > this way, we use "coarse granularity time" to avoid hbase region merge
> and
> > "fine granularity time" to guarantee the sequential scan.
> >
> >
> > User can define different "coarse granularity time" & "fine granularity
> > time" for different cases. If the inverted index is only used in
> real-time
> > case, we can define a small "coarse granularity time" (e.g. 1 day). If
> the
> > indverted index will cover full data set, we can define a big "coarse
> > granularity time" (e.g. 1 month).
> >
> >
> > Thanks
> > Jiang Xu
>

??????hbase rowkey design of inverted index

Reply via email to