dlmarion commented on PR #2665:
URL: https://github.com/apache/accumulo/pull/2665#issuecomment-1113302935

   ## Background
   
   Accumulo TabletServers are responsible for:
   
     1. ingesting new data
     2. compacting (merging) new and old data into files
     3. reading data from files to support system and user scans
     4. performing maintenance on Tablets (assignments, merging, splitting, 
bulk importing, etc).
   
   To support these activities newly ingested data is hosted in memory 
(in-memory maps) until it's written to a file, and blocks of accessed files may 
be cached within the TabletServer for better performance. The TabletServer has 
configuration properties to control the amout of memory available to the heap, 
in-memory maps, and block caches, and the size of the various thread pools that 
perform these activities. For example:
   
       tserver.assignment.concurrent.max
       tserver.bulk.process.threads
       tserver.cache.data.size
       tserver.cache.index.size
       tserver.cache.summary.size
       tserver.compaction.major.concurrent.max
       tserver.compaction.minor.concurrent.max
       tserver.memory.maps.max
       tserver.migrations.concurrent.max
       tserver.recovery.concurrent.max
       tserver.scan.executors.default.threads
       tserver.scan.executors.meta.threads
       tserver.scan.files.open.max
       tserver.server.threads.minimum
       tserver.sort.buffer.size
       tserver.summary.partition.threads
       tserver.summary.remote.threads
       tserver.total.mutation.queue.max
       tserver.workq.threads
   
   When a TabletServer exhausts available memory, for whatever reason, an 
OutOfMemoryError will be raised and the TabletServer will be terminated. When 
this happens all of the running scans on that TabletServer are paused while the 
Tablets are re-hosted and then the scans continue on the new TabletServers once 
the re-hosting process is complete. If the cause of the TabletServer failure 
was due to scans on a particular Tablet, then this process will repeat until 
there are no TabletServers remaining or the pattern is identified by a 
user/admin and the scan process is terminated.
   
   ## Objective
   
   Provide Accumulo users with the ability to run scans without terminating the 
TabletServer.
   
   ## Possible approaches
   
     1. Run the scan in a separate process
     2. Restrict memory usage on a per-scan basis
     3. Read directly from files in client side scan code.  This approach does 
not allow a small number of clients to scale out a large number of expensive 
queries to tablet and/or scan servers.  It also may lead to an OOM killing a 
client process that may be executing multiple concurrent scans.  It also does 
not allow client to leverage cache of data and metadata on a scan server or 
tablet server.
   
   ## This approach
   
   Create a separate server process that is used to run user scans and give the 
user the option whether or not to use the new server process on a per-scan 
basis. Provide the user with the ability to control how many scans will be 
affected if this new process dies and how many of these new processes to use 
for a single scan.
   
   ## Implementation
   
   This PR includes:
   
     1. a new server process called the ScanServer.
     2. changes to the Accumulo client
     3. changes to the GarbageCollector
     4. Ancillary changes
   
   ### Scan Server
   
   The ScanServer is a TabletHostingServer that hosts SnapshotTablets and 
implements the TabletScanClientService Thrift API. When the ScanServer receives 
a request via the scan API, it creates a SnapshotTablet object from the Tablet 
metadata (which may be cached), and then uses the ThriftScanClientHandler to 
complete the scan operations. The user scan is run using the same code that the 
TabletServer uses; the ScanServer is just responsible for ensuring that the 
Tablet exists for the scan code. The Tablet hosted within the ScanServer may 
not contain the exact same data as the corresponding Tablet hosted by the 
TabletServer. The ScanServer does not have any of the Tablet data that may 
reside within the in-memory maps and the Tablet may reference files that have 
been compacted as Tablet metadata can be cached within the ScanServer (see 
Property.SSERV_CACHED_TABLET_METADATA_EXPIRATION). The number of concurrent 
scans that the ScanServer will run is configurable (Property.SSERV_SCAN_EXECU
 TORS_DEFAULT_THREADS and Property.SSERV_SCAN_EXECUTORS_PREFIX). The ScanServer 
has other configuration properties that can be set to allow it to have 
different settings than the TabletServer (Thrift, block caches, etc). It is 
also possible that a ScanServer may be hosting multiple versions of a 
SnapshotTablet in the case where scans are in progress, the TabletMetadata has 
expires, and a new scan request arrives.
   
   Scan servers implement a busy timeout parameter on their scan RPCs.  The 
busytimeout allows a client to specify a configurable time during which the 
scan must either start running or throw a busy thrift exception.  On the client 
side this busy exception can be detected and a different scan server selected.
   
   ### Client changes
   
   A new method has been added to the client (ScannerBase.setConsistencyLevel) 
to configure the client to use IMMEDIATE (default) or EVENTUAL consistency for 
scans. IMMEDIATE means that the user wants to scan all data related to the 
Tablet at the time of the scan. To accomplish this the client will send the 
scan request to the TabletServer that is hosting the Tablet. This is the 
current behavior and is the default configuration, so no code change is 
required to have the same behavior. The other possible value, EVENTUAL, means 
that the user is willing to relax the data freshness guarantee that the 
TabletServer provides and instead potentially improve the chances of their scan 
completing when their scan is known to take a long time or require a lot of 
memory. When the consistency level is set to EVENTUAL the client uses a 
ScanServerDispatcher class to determine which ScanServers to use. The user can 
supply their own ScanServerDispatcher implementation 
(ClientProperty.SCAN_SERVER_DISPAT
 CHER) if they don't want to use the DefaultScanServerDispatcher (see class 
javadoc for a description of the behavior). Scans will be sent to the 
TabletServer in the event that EVENTUAL consistency is selected for the client 
and no ScanServers are running.
   
   #### Default scan server dispatcher
   
   The default scan server dispatcher that executes on the client side has the 
following strategy for selecting a scan server.
   
    * It hashes a tablets tableId, end row, and prev endrow.  This hash is used 
to consitently map the tablet to one of three random scan servers.  So for a 
given tablet the same three random scan servers are used by different tablets.
    * The client sends a request to one of the three scan servers with a small 
busytimeout.
    * If a busytimeout exception happens, then the default scan server 
dispatcher will notice this and it will choose from a larger set of scan 
servers.
    * The default scan server dispatcher will expand rapidly to randomly 
selecting from all scan servers after which point it will start exponentially 
increasing the busy timeout.
   
   For example if there are 1000 scan servers and a lot of them are busy, the 
default scan dispatcher might do something like the following.  This example 
shows how it will rapidly increase the set of servers chosen from and then 
start rapidly increasing the busy timeout.  The reason to start increasing the 
busy timeout after observing a lot busy exceptions is that those provide 
evidence that the entire cluster of scan servers may be busy. So eventually its 
better to just go to a scan server and queue up rather look for a non-busy scan 
server.
   
    1. Choose scan server S1 from 3 random scan servers with a busy timeout of 
33ms.
    2. If a busy exceptions happens. Choose scan server S2 from 21 random scan 
servers with a busy timeout of 33ms.
    3. If a busy exceptions happens. Choose scan server S3 from 147 random scan 
servers with a busy timeout of 33ms.
    4. If a busy exceptions happens. Choose scan server S4 from 1000 random 
scan servers with a busy timeout of 33ms.
    5. If a busy exceptions happens. Choose scan server S5 from 1000 random 
scan servers with a busy timeout of 66ms.
    6. If a busy exceptions happens. Choose scan server S6 from 1000 random 
scan servers with a busy timeout of 132ms.
   
   This default behavior makes tablets sticky to scan servers which is good for 
cache utilization and reusing cached tablet metadata. In the case where those 
few scan servers are busy the client starts searching for other places to run.
   
   ### Garbage Collector changes
   
   The ScanServer inserts entries into a new section (~sserv) of the metadata 
table to place a reservation on the file so that the GarbageCollector process 
does not remove the files that are being used for the scan. Accordingly 
GCEnv.getReferences has been modified to include these file reservations in the 
list of active file references. The ScanServer has a background thread that 
removes the file reservations from the metadata table after some period of time 
after the file is no longer used (see 
Property.SSERVER_SCAN_REFERENCE_EXPIRATION_TIME). The Manager has a new 
background thread that calls the ScanServerMetadataEntries.clean method on a 
periodic basis. Users can use the ScanServerMetadataEntries utility to remove 
file reservations that exist in the metadata table with no corresponding 
running ScanServer.
   
   In order to avoid race conditions with the Accumulo GC, Scan servers use the 
following algorithm when first reading a tablets metadata.
   
    1. Read metadata for tablet
    2. Write an ~sserv entries for the tablets files to the metadata table to 
prevent GC
    3. Read the meadata again and see if it changed.  If it did changes delete 
the entries from step 2 and go back to step 1.
   
   The above algorithm may be a bit expensive the first time a tablet is 
scanned on scan server.  However subsequent scans of the same tablet will use 
cached tablet metadata for a configurable time and not repeate the above steps. 
 In the future we may want to look into faster ways of preventing GC of files 
used by scan servers.
   
   ### Ancillary changes
   
     1. Modifications to scripts (accumulo-cluster, accumulo-service and 
accumulo-env.sh) have been made to start/stop one or more ScanServers per host. 
     2. The shell commands `grep` and `scan` have been modified to accept a 
consistency level (`cl`) argument
     3. The shell command `listscans` has been modified to include scans 
running on ScanServers
     4. ZooZap has been modified to remove ScanServer entries in ZooKeeper
     5. MiniAccumuloCluster has been modified to include the ability to 
start/stop ScanServers (used by the ITs)
     6. A new utility (ScanServerMetadataEntries) has been created to cleanup 
any dangling scan server file references in the metadata table.
   
   ## Shell Example
   
   Below is an example of how this works using the `scan` command in the shell.
   
   ```
   root@test> createtable test (1)
   root@test test> insert a b c d (2)
   root@test test> scan (3)
   a b:c []     d
   root@test test> scan -cl immediate (4)
   a b:c []     d
   root@test test> scan -cl eventual (5)
   root@test test> flush (6)
   2022-01-28T18:58:10,693 [shell.Shell] INFO : Flush of table test  
initiated...
   root@test test> scan (7)
   a b:c []     d
   root@test test> scan -cl eventual (8)
   a b:c []     d
   ```
   
   In this example, I create a table (1) and insert some data (2). When I run a 
scan (3,4) with the immediate consistency level, which happens to be the 
default, the client uses the normal code path and issues the scan command 
against the Tablet Server. Data is returned because the Tablet Server code path 
also returns data that is in the in-memory map. When I scan with the eventual 
consistency level (5) no data is returned because the Scan Server only uses the 
data in the Tablet's files. When I flush (6) the data to write a file in HDFS, 
the subsequent scans with immediate (7) and eventual (8) consistency level 
return the data.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to