You only pay the cost at scan-time for values that haven't been compacted. If you snapshot your stats table after a compaction and then do your scans on the snapshot, then you would get the best of options 1 and 2. The drawback is that the values returned by the scan would not be totally up-to-date.
Regards, -Russ On Wed, Aug 26, 2015 at 9:06 PM Dylan Hutchison <[email protected]> wrote: > Go for option #2 and use the combiners. It's one of the core features of > Accumulo and the overhead at insert-time is minimal. Developer time > overhead is also minimal-- add a couple lines next to where you make your > mutations and you're done. > > Regards, Dylan > > On Wed, Aug 26, 2015 at 6:11 PM, z11373 <[email protected]> wrote: > > > Hi, > > Apologize if this question has been asked before (which I am kind of > > certain). > > I am building a triple store, and need to build the stats table which > will > > be used for query optimization (i.e. re-order the query triple pattern). > > There may be more than 2 solutions for this, but the two I know are: > > 1. Manually rebuild the whole stats, this can be run once per day for > > example > > This option would be expensive because we are re-calculating all rows in > > master table, but the end result is no more computation when we retrieve > > the > > stat info. For example, we'll just query stats table for word 'foo', and > > it'll return a single row with total items for that word. > > > > 2. Use Accumulo combiner > > With this option, we could simply add the counter to the stats table > (i.e. > > insert ['foo', 1]) whenever we insert 'foo' to master table. When we want > > to > > get the stat info during query time, Accumulo will actually aggregate all > > the count for that word 'foo' in map-reduce fashion. > > For #2, we pay the cost during scan time, but if the rows that have word > > 'foo' only in hundredth, I guess it won't be so bad, because that > > aggregation will be done on the server side (and it'd be optimized due to > > Accumulo design) > > > > I prefer option #2, but not sure how expensive is that on Accumulo, > > especially we'll do a big number of queries per day, than that stats > > re-calculating process which is once per day. Any comments on this? > > Please let me know if my problem statement or the question is unclear. > > > > > > Thanks, > > Z > > > > > > > > -- > > View this message in context: > > > http://apache-accumulo.1065345.n5.nabble.com/using-combiner-vs-building-stats-cache-tp14979.html > > Sent from the Developers mailing list archive at Nabble.com. > > >
