Sounds like you have the idea now Z. There are three places an iterator can be applied: scan time, minor compaction time, and major compaction time. Minor compactions help your case a lot-- when enough entries are written to a tablet server that the tablet server needs to dump them to a new Hadoop RFile, the minor compaction iterators run on the entries as they stream to the RFile. This means that each RFile has only one entry for each unique (row, column family, column qualifier) tuple.
Entries with the same (row, column family, column qualifier) in distinct RFiles will get combined at the next major compaction, or on the fly during the next scan. For example, let say there are 100 rows of [foo, 1], it will actually be > 'combined' to a single row [foo, 100]? Careful-- Accumulo's combiners combine on Keys with identical row, column family and column qualifier. You'd have to make a more fancy iterator if you want to combine all the entries that share the same row. Let us know if you need help doing that. On Thu, Aug 27, 2015 at 3:09 PM, z11373 <[email protected]> wrote: > Thanks again Russ! > > "but it might not be in this case if most of the data has already been > combined" > Does this mean Accumulo actually combine and persist the combined result > after the scan/compaction (depending on which op the combiner is applied)? > For example, let say there are 100 rows of [foo, 1], it will actually be > 'combined' to a single row [foo, 100]? If that is the case, then combiner > is > not expensive. > > Wow! that's brilliant using -1 approach, I didn't even think about it > before. Yes, this will work for my case because i only need to know the > count. > > Thanks, > Z > > > > -- > View this message in context: > http://apache-accumulo.1065345.n5.nabble.com/using-combiner-vs-building-stats-cache-tp14979p14988.html > Sent from the Developers mailing list archive at Nabble.com. >
