[jira] [Commented] (ACCUMULO-4229) BatchWriter Locator cache out-of-sync when shared with tserver

ASF GitHub Bot (JIRA) Thu, 28 Apr 2016 12:24:23 -0700

    [ 
https://issues.apache.org/jira/browse/ACCUMULO-4229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15262793#comment-15262793
 ]


ASF GitHub Bot commented on ACCUMULO-4229:
------------------------------------------

Github user dhutchis commented on the pull request:

    https://github.com/apache/accumulo/pull/96#issuecomment-215535249
  
    > An alternative solution to introducing SyncingTabletLocator could be to 
change the implementation of clearLocators to the following. This clears all of 
the caches while still maintaining the references to in uses locators.
    > 
    >      public static synchronized void clearLocators() {
    >         for(TabletLocator tl : locators.values()) {
    >           tl.invalidateCache();
    >         }
    >      }
    > If worried about empty tablet location caches being left around in the 
locators map, that could possible be changed to a weak hash map.
    
    You could be on to something @keith-turner if we give it more thought.  The 
original reason why the `locators` cache is cleared every hour is to remove 
TabletLocators that refer to tablets that don't exist or are rarely accessed, 
ya?
    
    A WeakHashMap wouldn't work because it uses weak references for the keys. 
This would mean that the gc could reclaim keys as soon as the `getLocator` 
method returns, which is inefficient for a cache and has correctness issues if 
the static `invalidate` methods are called.
    
    What do you think of switching locators to a [Guava 
cache](https://github.com/google/guava/wiki/CachesExplained) with a 1 hour 
timeout on entries since their last access?
    
    I think this would let us remove the `synchronized` from `getLocator` 
because Guava caches are thread-safe and designed for concurrency.  It would 
eliminate the need for any `clearLocators` call (and `SyncingTabletLocator`).  


> BatchWriter Locator cache out-of-sync when shared with tserver
> --------------------------------------------------------------
>
>                 Key: ACCUMULO-4229
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-4229
>             Project: Accumulo
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 1.6.5, 1.7.1
>            Reporter: Dylan Hutchison
>            Assignee: Dylan Hutchison
>
> BatchWriters that run a long time have write rates that sometimes 
> mysteriously decrease after the table it is writing to goes through a major 
> compaction or a split.  The decrease can be as bad as reducing throughput to 
> 0.
> This was first first mentioned in this [email 
> thread|https://mail-archives.apache.org/mod_mbox/accumulo-user/201406.mbox/%3ccamz+duvmmhegon9ejehr9h_rrpp50l2qz53bbdruvo0pira...@mail.gmail.com%3E]
>  for major compactions.  
> I discovered this in this [email 
> thread|https://mail-archives.apache.org/mod_mbox/accumulo-dev/201604.mbox/%3CCAPx%3DJkaY7fVh-U0O%2Bysx2d98LOGMcA4oEQOYgoPxR-0em4hdvg%40mail.gmail.com%3E]
>  for splits.  See the thread for some log messages.
> I turned on TRACE logs and I think I pinned it down: the TabletLocator cached 
> by a BatchWriter gets out of sync with the static cache of TabletLocators.
> # The TabletServerBatchWriter caches a TabletLocator from the static 
> collection of TabletLocators when it starts writing.  Suppose it is writing 
> to tablet T1.
> # The TabletServerBatchWriter uses its locally cached TabletLocator inside 
> its `binMutations` method for its entire lifespan; this cache is never 
> refreshed or updated to sync up with the static collection of TabletLocators.
> # Every hour, the static collection of TabletLocators clears itself.  The 
> next call to get a TabletLocator from the static collection allocates a new 
> TabletLocator.  Unfortunately, the TabletServerBatchWriter does not reflect 
> this change and continues to use the old, locally cached TabletLocator.
> # Tablet T1 splits into T2 and T3, which closes T1.  As such, it no longer 
> exists and the tablet server that receives the entries meant to go to T1 all 
> fail to write because T1 is closed.
> # The TabletServerBatchWriter receives the response from the tablet server 
> that all entries failed to write.  It invalidates the cache of the *new* 
> TabletLocator obtained from the static collection of TabletLocators.  The old 
> TabletLocator that is cached locally does not get invalidated.
> # The TabletServerBatchWriter re-queues the failed entries and tries to write 
> them to the same closed tablet T1, because it is still looking up tablets 
> using the old TabletLocator.
> This behavior subsumes the circumstances William wrote about in the thread he 
> mentioned.  The problem would occur as a result of either splits or major 
> compactions.  It would only stop the BatchWriter if its entire memory filled 
> up with writes to the same tablet that was closed as a result of a majc or 
> split; otherwise it would just slow down the BatchWriter by failing to write 
> some number of entries with every RPC.
> There are a few solutions we can think of.  
> # Not have the MutationWriter inside the TabletServerBatchWriter locally 
> cache TabletLocators.  I suspect this was done for performance reasons, so 
> it's probably not a good solution. 
> # Have all the MutationWriters clear their cache at the same time the static 
> TabletLocator cache clears.  We could store a reference to the Map that each 
> MutationWriter has inside a static synchronized WeakHashMap.  The only time 
> the weak map needs to be accessed is:
> ## When a MutationWriter is constructed (from constructing a 
> TabletServerBatchWriter), add its new local TabletLocator cache to the weak 
> map.
> ## When the static TabletLocator cache is cleared, also clear every map in 
> the weak map.
> # Another solution is to make the invalidate calls on the local TabletLocator 
> cache rather than the global static one.  If we go this route we should 
> double check the idea to make sure it does not impact the correctness of any 
> other pieces of code that use the cache. 
> # Perhaps the simplest solution is to put an extra Boolean variable inside 
> the Locators indicating whether they are valid.  When they are cleared, their 
> Boolean variables set to false.  Before a client uses a locator from cache, 
> it checks its Boolean indicator.
> The TimeoutTabletLocator does not help when no timeout is set on the 
> BatchWriter (the default behavior).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (ACCUMULO-4229) BatchWriter Locator cache out-of-sync when shared with tserver

Reply via email to