On Thu, Aug 27, 2015 at 9:33 AM z11373 <[email protected]> wrote: > Russ: I like your idea (indeed best of both worlds), so during compaction > time, we can store that stats info to another table (but this time it will > be only single row, hence won't affect query time). So I can add the code > to > insert to another table in the reduce() of my custom combiner, right? Or is > there a better way? >
No, don't trigger the table snapshot or compaction from inside your combiner. I'd do it as a scheduled task via cron or something like that. A full major compaction is generally seen as a big job, but it might not be in this case if most of the data has already been combined. Alternatively, if you can isolate a range of rows to be compacted you can pass that into TableOperations.compact to speed things up. I think the only way to guarantee that your scans of the snapshot are dealing with totally compacted data is to compact after the snapshot. But I think if you want both the original table and the snapshot to get the benefit of compaction, you'd want to compact before the snapshot and accept the risk that there might be a little bit of uncompacted data in the snapshot. Honestly, this is how I *think* it should all work, but there are probably people on this list who are more familiar with combiners, snapshots and compaction than me. Let say we have table called TEMP_STATS which we apply the custom combiner. > During ingestion, we simply insert a row, i.e. ['foo', 1] to the table. > Next > time insert ['foo', 1], and so on. Let say we have 10 rows of 'foo', so > reading that word would return 'foo', 10 (thanks to combiner). Now I want > to > delete only 1 row, so that it'd return 'foo', 9 instead. What is the best > way to do this? > If all you're doing in your stats table is tracking counts, then you could insert 'foo':-1 and the count will be adjusted correctly. If you're also tracking mins and maxes, you'll need a different approach... which I would be fascinated to understand because it seems like a very hard problem. -Russ One option I could think is to apply another identifier, i.e. seq number, so > it'd insert ['foo', 1, 1], ['foo', 2, 1], and so on (the second number will > be the seq# and can be stored as column qualifier). Then I have to modify > the combiner to make it also returns the highest seq# (i.e. 'foo', 10, 10). > When deleting for one item only, I could just put delete 'foo', :10, and it > will only mark that row as deleted. Any other better approach? > > > Thanks, > Z > > > > -- > View this message in context: > http://apache-accumulo.1065345.n5.nabble.com/using-combiner-vs-building-stats-cache-tp14979p14984.html > Sent from the Developers mailing list archive at Nabble.com. >
