[jira] [Commented] (HBASE-12728) buffered writes substantially less useful after removal of HTablePool

Carter (JIRA) Fri, 02 Jan 2015 10:26:53 -0800

    [ 
https://issues.apache.org/jira/browse/HBASE-12728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263110#comment-14263110
 ]


Carter commented on HBASE-12728:
--------------------------------

Okay, here's another pass, scratching out the HTableMultiplexer idea.  Instead 
we'll create a new class called {{AsyncPutter}}. (Not a huge fan of the name, 
so if you have a better one, please share.)

First off, here are our basic requirements in this refactor:
# Handle the M/R case where a user wants to batch and flush in a single thread
# Handle the case Aaron described where we batch across multiple threads
# Provide a way to do this through the new Table interface for convenience
# Buffering/batching limits based on size in bytes, not queue length
# Move towards [~lhofhansl]'s suggestion of "HTable as cheap proxies to tables 
only"
# While durability can't be guaranteed in case of a crash, avoid losing data 
otherwise.

So here are our classes:

{code:java}
// BufferedTable is lightweight and single-threaded.  Many of them can share a 
single AsyncPutter.
public class BufferedTable implements Table {
    public BufferedTable(Table t, AsyncPutter ap);
    public void flush();
}

// Thread-safe handler of puts for one or more BufferedTable instances.
public class AsyncPutter implements Closeable {
    public AsyncPutter(Connection c, ExecutorService pool, ExceptionListener e, 
PutBuffer pb);
    public synchronized add(Put put);  // Synchronization adds nanoseconds in 
the single-threaded case.  No biggie.
    public synchronized flush();
    public synchronized close();
}

// Simple single-threaded data holder.
public class PutBuffer {
    public PutBuffer(long maxBufferSize);  // In bytes.  This makes more sense 
than queue length for memory management.
    // maxBufferSize = totalBufferMem / numberOfExecutorPoolThreads
    public void add(Put p);
    public boolean isBatchAvailable();
    public List<Put> removeBatch();
}

// To make sure exceptions don't get swallowed.
public interface ExceptionListener {
    void onException(RetriesExhaustedWithDetailsException e);
}
{code}

We also proposed a {{BufferedConnection}} factory, simply to make it easier to 
switch between Table and BufferTable implementations without much refactoring.  
When used, it would own the AsyncPutter.  Pros/cons for this idea?  It's not 
essential.

Asynchronous exception handling takes place through an {{ExceptionListener}} 
observer provided by the user.  This means that exceptions are not thrown for 
simple put failures; they are passed to the listener.  The thought here is I 
find the current behavior non-deterministic:

{code:java}
table.put(put1);  // This put causes an exception
table.put(put2);  // But we don't see the exception until we get here ...
table.put(put3);  // ... or maybe(?) here.  put3 succeeded, but I got an 
exception thrown.  That's counter-intuitive.
{code}

An ExceptionListener is a pretty standard pattern for asynchronous error 
handling.  M/R or other cases might rely on an exception being thrown 
synchronously to rollback appropriately, but it's easy enough to mimic that 
behavior with the listener approach.

{{BufferedTable#close}} does not flush since we need to support batching across 
multiple threads.  {{AsyncPutter#close}} does flush.  (Will JavaDoc this.)  If 
we decide to provide a BufferedConnection, then closing that would also flush, 
since it owns the AsyncPutter.

Do we need a timeout-based flush?  I don't see one in the current HTable 
implementation, but if it's important we could add it to the AsyncPutter.  
Seems a good way to limit lost mutations during slow periods of writes into a 
big buffer.


> buffered writes substantially less useful after removal of HTablePool
> ---------------------------------------------------------------------
>
>                 Key: HBASE-12728
>                 URL: https://issues.apache.org/jira/browse/HBASE-12728
>             Project: HBase
>          Issue Type: Bug
>          Components: hbase
>    Affects Versions: 0.98.0
>            Reporter: Aaron Beppu
>
> In previous versions of HBase, when use of HTablePool was encouraged, HTable 
> instances were long-lived in that pool, and for that reason, if autoFlush was 
> set to false, the table instance could accumulate a full buffer of writes 
> before a flush was triggered. Writes from the client to the cluster could 
> then be substantially larger and less frequent than without buffering.
> However, when HTablePool was deprecated, the primary justification seems to 
> have been that creating HTable instances is cheap, so long as the connection 
> and executor service being passed to it are pre-provided. A use pattern was 
> encouraged where users should create a new HTable instance for every 
> operation, using an existing connection and executor service, and then close 
> the table. In this pattern, buffered writes are substantially less useful; 
> writes are as small and as frequent as they would have been with 
> autoflush=true, except the synchronous write is moved from the operation 
> itself to the table close call which immediately follows.
> More concretely :
> ```
> // Given these two helpers ...
> private HTableInterface getAutoFlushTable(String tableName) throws 
> IOException {
>   // (autoflush is true by default)
>   return storedConnection.getTable(tableName, executorService);
> }
> private HTableInterface getBufferedTable(String tableName) throws IOException 
> {
>   HTableInterface table = getAutoFlushTable(tableName);
>   table.setAutoFlush(false);
>   return table;
> }
> // it's my contention that these two methods would behave almost identically,
> // except the first will hit a synchronous flush during the put call,
> and the second will
> // flush during the (hidden) close call on table.
> private void writeAutoFlushed(Put somePut) throws IOException {
>   try (HTableInterface table = getAutoFlushTable(tableName)) {
>     table.put(somePut); // will do synchronous flush
>   }
> }
> private void writeBuffered(Put somePut) throws IOException {
>   try (HTableInterface table = getBufferedTable(tableName)) {
>     table.put(somePut);
>   } // auto-close will trigger synchronous flush
> }
> ```
> For buffered writes to actually provide a performance benefit to users, one 
> of two things must happen:
> - The writeBuffer itself shouldn't live, flush and die with the lifecycle of 
> it's HTableInstance. If the writeBuffer were managed elsewhere and had a long 
> lifespan, this could cease to be an issue. However, if the same writeBuffer 
> is appended to by multiple tables, then some additional concurrency control 
> will be needed around it.
> - Alternatively, there should be some pattern for having long-lived HTable 
> instances. However, since HTable is not thread-safe, we'd need multiple 
> instances, and a mechanism for leasing them out safely -- which sure sounds a 
> lot like the old HTablePool to me.
> See discussion on mailing list here : 
> http://mail-archives.apache.org/mod_mbox/hbase-user/201412.mbox/%3CCAPdJLkEzmUQZ_kvD%3D8mrxi4V%3DhCmUp3g9MUZsddD%2Bmon%2BAvNtg%40mail.gmail.com%3E



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (HBASE-12728) buffered writes substantially less useful after removal of HTablePool

Reply via email to