Re: using combiner vs. building stats cache

Russ Weeks Wed, 26 Aug 2015 21:20:01 -0700

You only pay the cost at scan-time for values that haven't been compacted.
If you snapshot your stats table after a compaction and then do your scans
on the snapshot, then you would get the best of options 1 and 2. The
drawback is that the values returned by the scan would not be totally
up-to-date.


Regards,
-Russ

On Wed, Aug 26, 2015 at 9:06 PM Dylan Hutchison <[email protected]> wrote:

> Go for option #2 and use the combiners.  It's one of the core features of
> Accumulo and the overhead at insert-time is minimal.  Developer time
> overhead is also minimal-- add a couple lines next to where you make your
> mutations and you're done.
>
> Regards, Dylan
>
> On Wed, Aug 26, 2015 at 6:11 PM, z11373 <[email protected]> wrote:
>
> > Hi,
> > Apologize if this question has been asked before (which I am kind of
> > certain).
> > I am building a triple store, and need to build the stats table which
> will
> > be used for query optimization (i.e. re-order the query triple pattern).
> > There may be more than 2 solutions for this, but the two I know are:
> > 1. Manually rebuild the whole stats, this can be run once per day for
> > example
> > This option would be expensive because we are re-calculating all rows in
> > master table, but the end result is no more computation when we retrieve
> > the
> > stat info. For example, we'll just query stats table for word 'foo', and
> > it'll return a single row with total items for that word.
> >
> > 2. Use Accumulo combiner
> > With this option, we could simply add the counter to the stats table
> (i.e.
> > insert ['foo', 1]) whenever we insert 'foo' to master table. When we want
> > to
> > get the stat info during query time, Accumulo will actually aggregate all
> > the count for that word 'foo' in map-reduce fashion.
> > For #2, we pay the cost during scan time, but if the rows that have word
> > 'foo' only in hundredth, I guess it won't be so bad, because that
> > aggregation will be done on the server side (and it'd be optimized due to
> > Accumulo design)
> >
> > I prefer option #2, but not sure how expensive is that on Accumulo,
> > especially we'll do a big number of queries per day, than that stats
> > re-calculating process which is once per day. Any comments on this?
> > Please let me know if my problem statement or the question is unclear.
> >
> >
> > Thanks,
> > Z
> >
> >
> >
> > --
> > View this message in context:
> >
> http://apache-accumulo.1065345.n5.nabble.com/using-combiner-vs-building-stats-cache-tp14979.html
> > Sent from the Developers mailing list archive at Nabble.com.
> >
>

Re: using combiner vs. building stats cache

Reply via email to