That's a good point, Ed, and there hasn't been any other discussion (on
the mailing lists) so you did the right thing bringing this up here.
There is no user administration or monitoring support that would allow
user intervention (aside from restarting a tserver which is a no-go). If
we're going to include it, like it appears so, we need to both make sure
that the cache is bounded in size and we have as many people as possible
look at it (since it's such a late addition to the release -- it's
common for us to only notice subtleties weeks to months after a change
is made during normal development cycles).
Ed Coleman wrote:
Eric commented on the vote for RC3:
- - - -
It would be nice to have
ACCUMULO-3547<https://issues.apache.org/jira/browse/ACCUMULO-3549> in 1.6.2.
We are running at scale with it at the moment, and it has made a huge
improvement. I hate to hold up 1.6.2, though. If it doesn't make it, please
update the ticket to point to 1.6.3.
- - - -
I generally agree with this and it seems that ACCUMULO-3547 will make it into
1.6.2 - which I think is the preferable option. My concerns deal with not
having ACCUMULO-3549 included in 1.6.2 too.
In ACCUMULO-3549 Keith made the assumption that end rows are 10 bytes - I'm not
sure this is a good assumption. If end rows are larger than 10 bytes, then how
much more memory will be required over time? How much faster will it grow?
Without ACCUMULO-3549, what are my options for monitoring / correcting the
situation if the cache grows too large? Will tablet server performance slowly
degrade over time because the cache keeps growing? What will users need to do
to monitor and then correct this? Will we be in a situation where tserevrs will
start to run out of memory, we will increase the memory allocation if we can,
and just kick the can down the road a little further and performance will just
keep degrading?
Is there a way to trigger the cache to clear short of restarting a tserver?
While not optimal, having a utility / script that slowly walks across the
tservers and clears the cache so that each tserver cache is cleared every 12,
24, 48,... hours may be a bridge until ACCUMULO-3549 is resolved. If this is
the case, it would seem that having the fix in 1.6.3 would also be a priority.
Maybe this has been discussed and resolved, but I want to bring this up to
ensure that the ramifications have been considered and that there is a viable
mitigation strategy that is communicated to the users. Sorry for the doom - end
of the world tone I was just trying to emphasis the worst case scenarios that I
could envision. I think ACCUMULO-3547 is an important (even necessary
improvement) and I'm not suggesting that it be removed - I just want to make
sure that I understand the other side effects and know our options.
Ed Coleman