[ 
https://issues.apache.org/jira/browse/ACCUMULO-4229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15264212#comment-15264212
 ] 

ASF GitHub Bot commented on ACCUMULO-4229:
------------------------------------------

Github user ShawnWalker commented on the pull request:

    https://github.com/apache/accumulo/pull/96#issuecomment-215765483
  
    Instead of trying to wrap things up so that these classes can pseudo-safely 
hold references to instances of `TabletLocator`, I feel that it would be much 
simpler to establish the rule "thou shalt not hold a reference to this for an 
extended period", and make changes appropriately.  
    
    By which I mean:
    1. Refactor `ConditionalWriterImpl` to obtain a temporary `TabletLocator` 
reference at the beginning of each method that needs it (`queue(...)`, 
`sendToServer(...)`, `invalidateSession(...)`), instead of obtaining a single 
reference in the constructor.
    2. Modify `TimeoutTabletLocator` so that its constructor takes a 
`ClientContext` and a tableId, like your current pull request already does.  
But have each of it's methods invoke `TabletLocator.getLocator(...)` anew.
    
    While I'm dubious that it'd be a real concern, if we felt that this would 
have significant performance impact (due to lock contention), then it should be 
possible to partially ameliorate the hit by using a Guava Cache or even a 
`ConcurrentHashMap`.


> BatchWriter Locator cache out-of-sync when shared with tserver
> --------------------------------------------------------------
>
>                 Key: ACCUMULO-4229
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4229
>             Project: Accumulo
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 1.6.5, 1.7.1
>            Reporter: Dylan Hutchison
>            Assignee: Dylan Hutchison
>
> BatchWriters that run a long time have write rates that sometimes 
> mysteriously decrease after the table it is writing to goes through a major 
> compaction or a split.  The decrease can be as bad as reducing throughput to 
> 0.
> This was first first mentioned in this [email 
> thread|https://mail-archives.apache.org/mod_mbox/accumulo-user/201406.mbox/%3ccamz+duvmmhegon9ejehr9h_rrpp50l2qz53bbdruvo0pira...@mail.gmail.com%3E]
>  for major compactions.  
> I discovered this in this [email 
> thread|https://mail-archives.apache.org/mod_mbox/accumulo-dev/201604.mbox/%3CCAPx%3DJkaY7fVh-U0O%2Bysx2d98LOGMcA4oEQOYgoPxR-0em4hdvg%40mail.gmail.com%3E]
>  for splits.  See the thread for some log messages.
> I turned on TRACE logs and I think I pinned it down: the TabletLocator cached 
> by a BatchWriter gets out of sync with the static cache of TabletLocators.
> # The TabletServerBatchWriter caches a TabletLocator from the static 
> collection of TabletLocators when it starts writing.  Suppose it is writing 
> to tablet T1.
> # The TabletServerBatchWriter uses its locally cached TabletLocator inside 
> its `binMutations` method for its entire lifespan; this cache is never 
> refreshed or updated to sync up with the static collection of TabletLocators.
> # Every hour, the static collection of TabletLocators clears itself.  The 
> next call to get a TabletLocator from the static collection allocates a new 
> TabletLocator.  Unfortunately, the TabletServerBatchWriter does not reflect 
> this change and continues to use the old, locally cached TabletLocator.
> # Tablet T1 splits into T2 and T3, which closes T1.  As such, it no longer 
> exists and the tablet server that receives the entries meant to go to T1 all 
> fail to write because T1 is closed.
> # The TabletServerBatchWriter receives the response from the tablet server 
> that all entries failed to write.  It invalidates the cache of the *new* 
> TabletLocator obtained from the static collection of TabletLocators.  The old 
> TabletLocator that is cached locally does not get invalidated.
> # The TabletServerBatchWriter re-queues the failed entries and tries to write 
> them to the same closed tablet T1, because it is still looking up tablets 
> using the old TabletLocator.
> This behavior subsumes the circumstances William wrote about in the thread he 
> mentioned.  The problem would occur as a result of either splits or major 
> compactions.  It would only stop the BatchWriter if its entire memory filled 
> up with writes to the same tablet that was closed as a result of a majc or 
> split; otherwise it would just slow down the BatchWriter by failing to write 
> some number of entries with every RPC.
> There are a few solutions we can think of.  
> # Not have the MutationWriter inside the TabletServerBatchWriter locally 
> cache TabletLocators.  I suspect this was done for performance reasons, so 
> it's probably not a good solution. 
> # Have all the MutationWriters clear their cache at the same time the static 
> TabletLocator cache clears.  We could store a reference to the Map that each 
> MutationWriter has inside a static synchronized WeakHashMap.  The only time 
> the weak map needs to be accessed is:
> ## When a MutationWriter is constructed (from constructing a 
> TabletServerBatchWriter), add its new local TabletLocator cache to the weak 
> map.
> ## When the static TabletLocator cache is cleared, also clear every map in 
> the weak map.
> # Another solution is to make the invalidate calls on the local TabletLocator 
> cache rather than the global static one.  If we go this route we should 
> double check the idea to make sure it does not impact the correctness of any 
> other pieces of code that use the cache. 
> # Perhaps the simplest solution is to put an extra Boolean variable inside 
> the Locators indicating whether they are valid.  When they are cleared, their 
> Boolean variables set to false.  Before a client uses a locator from cache, 
> it checks its Boolean indicator.
> The TimeoutTabletLocator does not help when no timeout is set on the 
> BatchWriter (the default behavior).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to