If you were doing a batch job to just recompute the stats, I'd probably
make a new table and then rename it, replacing your old stats table.
This can also be problematic in making sure clients that are still
writing data will correctly write to the new table. Can you quiesce
ingest temporarily?
In short, this is hard to do correctly (and there are edge cases that
could potentially happen that make the table inaccurate at a very low
probability). Have you considered just running the system for a while
and seeing how skewed your stats are?
It kind of sounds like the easier problem to solve is whether or not
some record exists in your system and then you can know definitively
whether or not you need to even process that record again (much less
update the stats table).
z11373 wrote:
Revisit this topic, if I go with option #2, i.e. having a batch job to fix
the stats table, now I am not really sure if it will work, since the stats
table already have summing combiner enabled, hence the batch job can't just
update the value since it'll be incorrect.
For example:
Current stats table contains:
foo | 2
bar | 3
test | 1
The batch job scan the main table, and going to update the stats table, let
say the actual stats is foo=1, bar=4, test=1, hence the final stats table
would become:
foo | 3
bar | 7
test | 2
It'd be correct if it removes the summing combiner from the table, but then
another process (not the batch job) may update particular key, overwriting
the correct value (updated from batch job). We can't tolerate the system is
offline, otherwise we can refresh the stats during that downtime. Any idea
on how to solve this problem?
Unfortunately there is an inherent problem with summing combiner, i.e. when
adding same key to main table, it'll behave just like 'update' when the same
key already exist, but my current logic will add<key>|1 to the stats table,
so if we have many 'update', then some values in stats table will be far
off. Similar case for deleting, it will be no-op for main table if the key
doesn't exist, but the app logic will add<key>|-1 to the stats table. This
is the reason why we're thinking to have a batch job to 'fix' the stats
table, but that also has its own problem :-(
Thanks,
Z
--
View this message in context:
http://apache-accumulo.1065345.n5.nabble.com/another-question-on-summing-combiner-tp15238p15351.html
Sent from the Developers mailing list archive at Nabble.com.