Thanks Dylan and Russ! Dylan: I guess that option is ok if it is only has few hunded total for that same word, but if word 'foo' has a million in total, then Accumulo still has to go thru that 1 million of items to sum the count, hence I think will be expensive in that case even though it doesn't have to return those rows back to client.
Russ: I like your idea (indeed best of both worlds), so during compaction time, we can store that stats info to another table (but this time it will be only single row, hence won't affect query time). So I can add the code to insert to another table in the reduce() of my custom combiner, right? Or is there a better way? Another question, I'd think using combiner will also be perfect for delete scenario since it doesn't need to re-calculate the whole thing. However, how really to delete only 1 row from those rows in the would-do-combine table? Let me give an example below to be clear. Let say we have table called TEMP_STATS which we apply the custom combiner. During ingestion, we simply insert a row, i.e. ['foo', 1] to the table. Next time insert ['foo', 1], and so on. Let say we have 10 rows of 'foo', so reading that word would return 'foo', 10 (thanks to combiner). Now I want to delete only 1 row, so that it'd return 'foo', 9 instead. What is the best way to do this? One option I could think is to apply another identifier, i.e. seq number, so it'd insert ['foo', 1, 1], ['foo', 2, 1], and so on (the second number will be the seq# and can be stored as column qualifier). Then I have to modify the combiner to make it also returns the highest seq# (i.e. 'foo', 10, 10). When deleting for one item only, I could just put delete 'foo', :10, and it will only mark that row as deleted. Any other better approach? Thanks, Z -- View this message in context: http://apache-accumulo.1065345.n5.nabble.com/using-combiner-vs-building-stats-cache-tp14979p14984.html Sent from the Developers mailing list archive at Nabble.com.
