[ https://issues.apache.org/jira/browse/ACCUMULO-4229?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15258562#comment-15258562 ]
ASF GitHub Bot commented on ACCUMULO-4229: ------------------------------------------ Github user joshelser commented on a diff in the pull request: https://github.com/apache/accumulo/pull/96#discussion_r61133856 --- Diff: core/src/main/java/org/apache/accumulo/core/client/impl/SyncingTabletLocator.java --- @@ -0,0 +1,118 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.accumulo.core.client.impl; + +import java.util.Collection; +import java.util.List; +import java.util.Map; + +import org.apache.accumulo.core.client.AccumuloException; +import org.apache.accumulo.core.client.AccumuloSecurityException; +import org.apache.accumulo.core.client.Instance; +import org.apache.accumulo.core.client.TableNotFoundException; +import org.apache.accumulo.core.data.KeyExtent; +import org.apache.accumulo.core.data.Mutation; +import org.apache.accumulo.core.data.Range; +import org.apache.accumulo.core.security.Credentials; +import org.apache.hadoop.io.Text; + +/** + * Syncs itself with the static collection of TabletLocators, so that when the server clears it, it will automatically get the most up-to-date version. Caching + * TabletLocators locally is safe when using SyncingTabletLocator. + */ +public class SyncingTabletLocator extends TabletLocator { + /* + * Implementation: the invalidateCache calls do not need to call syncLocator because it is okay if an outdated locator is invalidated. In this case, on the + * next call to a meaningful function like binRanges, the new locator will replace the current one and be used instead. The new locator is fresh; it has no + * cache at all when created. + */ + + /** + * Used to tell SyncingTabletLocator how to fetch a new locator when the current one is outdated. For example, + * <code>TabletLocator.getLocator(instance, tableId);</code> + */ + public interface GetLocatorFunction { + TabletLocator getLocator(); + } + + private volatile TabletLocator locator; + private final GetLocatorFunction getLocatorFunction; + + public SyncingTabletLocator(GetLocatorFunction getLocatorFunction) { + this.getLocatorFunction = getLocatorFunction; + this.locator = getLocatorFunction.getLocator(); + } + + public SyncingTabletLocator(final Instance instance, final Text tableId) { + this(new GetLocatorFunction() { + @Override + public TabletLocator getLocator() { + return TabletLocator.getLocator(instance, tableId); + } + }); + } + + private void syncLocator() { + if (!locator.isValid()) --- End diff -- > trying to reason about partial memory barriers makes my head hurt. I hear ya :) > given that TabletLocator.isValid is also volatile, and TabletLocator.getLocator(...) is synchronized. But given all that, taking out the lock on this here is effectively a no-op Ah, yeah, I see what you mean. > It's difficult to guess what the consequences would be for an arbitrary GetLocatorFunction Agreed. We lose the context of what that "Callable" (which is all that GetLocatorFunction really is -- why do we need a custom interface for that @dhutchis ?) and the locking already in place. > BatchWriter Locator cache out-of-sync when shared with tserver > -------------------------------------------------------------- > > Key: ACCUMULO-4229 > URL: https://issues.apache.org/jira/browse/ACCUMULO-4229 > Project: Accumulo > Issue Type: Bug > Components: client > Affects Versions: 1.6.5, 1.7.1 > Reporter: Dylan Hutchison > Assignee: Dylan Hutchison > > BatchWriters that run a long time have write rates that sometimes > mysteriously decrease after the table it is writing to goes through a major > compaction or a split. The decrease can be as bad as reducing throughput to > 0. > This was first first mentioned in this [email > thread|https://mail-archives.apache.org/mod_mbox/accumulo-user/201406.mbox/%3ccamz+duvmmhegon9ejehr9h_rrpp50l2qz53bbdruvo0pira...@mail.gmail.com%3E] > for major compactions. > I discovered this in this [email > thread|https://mail-archives.apache.org/mod_mbox/accumulo-dev/201604.mbox/%3CCAPx%3DJkaY7fVh-U0O%2Bysx2d98LOGMcA4oEQOYgoPxR-0em4hdvg%40mail.gmail.com%3E] > for splits. See the thread for some log messages. > I turned on TRACE logs and I think I pinned it down: the TabletLocator cached > by a BatchWriter gets out of sync with the static cache of TabletLocators. > # The TabletServerBatchWriter caches a TabletLocator from the static > collection of TabletLocators when it starts writing. Suppose it is writing > to tablet T1. > # The TabletServerBatchWriter uses its locally cached TabletLocator inside > its `binMutations` method for its entire lifespan; this cache is never > refreshed or updated to sync up with the static collection of TabletLocators. > # Every hour, the static collection of TabletLocators clears itself. The > next call to get a TabletLocator from the static collection allocates a new > TabletLocator. Unfortunately, the TabletServerBatchWriter does not reflect > this change and continues to use the old, locally cached TabletLocator. > # Tablet T1 splits into T2 and T3, which closes T1. As such, it no longer > exists and the tablet server that receives the entries meant to go to T1 all > fail to write because T1 is closed. > # The TabletServerBatchWriter receives the response from the tablet server > that all entries failed to write. It invalidates the cache of the *new* > TabletLocator obtained from the static collection of TabletLocators. The old > TabletLocator that is cached locally does not get invalidated. > # The TabletServerBatchWriter re-queues the failed entries and tries to write > them to the same closed tablet T1, because it is still looking up tablets > using the old TabletLocator. > This behavior subsumes the circumstances William wrote about in the thread he > mentioned. The problem would occur as a result of either splits or major > compactions. It would only stop the BatchWriter if its entire memory filled > up with writes to the same tablet that was closed as a result of a majc or > split; otherwise it would just slow down the BatchWriter by failing to write > some number of entries with every RPC. > There are a few solutions we can think of. > # Not have the MutationWriter inside the TabletServerBatchWriter locally > cache TabletLocators. I suspect this was done for performance reasons, so > it's probably not a good solution. > # Have all the MutationWriters clear their cache at the same time the static > TabletLocator cache clears. We could store a reference to the Map that each > MutationWriter has inside a static synchronized WeakHashMap. The only time > the weak map needs to be accessed is: > ## When a MutationWriter is constructed (from constructing a > TabletServerBatchWriter), add its new local TabletLocator cache to the weak > map. > ## When the static TabletLocator cache is cleared, also clear every map in > the weak map. > # Another solution is to make the invalidate calls on the local TabletLocator > cache rather than the global static one. If we go this route we should > double check the idea to make sure it does not impact the correctness of any > other pieces of code that use the cache. > # Perhaps the simplest solution is to put an extra Boolean variable inside > the Locators indicating whether they are valid. When they are cleared, their > Boolean variables set to false. Before a client uses a locator from cache, > it checks its Boolean indicator. > The TimeoutTabletLocator does not help when no timeout is set on the > BatchWriter (the default behavior). -- This message was sent by Atlassian JIRA (v6.3.4#6332)