cshannon commented on PR #4715:
URL: https://github.com/apache/accumulo/pull/4715#issuecomment-2241699447

   Here is an update to the current state of things. This PR has gotten quite 
large now so I think it's about time to wrap it up and look at follow on tasks 
for more work to be done. Below is where things are at for completed, still to 
be done in this PR, and future work.
   
   
   #### Finished:
   * Added a new gRPC module that contains all the necessary protobuf 
definitions for Compaction Coordinator Service (including common objects like 
KeyExtent. Updated the maven build to generate the protobuf objects and 
`CompactionCoordinator` service and to be proper dependencies of other modules.
   * Created a utiity class to convert between Thrift and Protobuf equivalent 
objects. This should hopefully be temporary until everything is converted.
   * Created another utility class to share a `ManagedChannel` for the gRPC 
connection across clients. This needs to be investigated more to see if pooling 
will be required (more listed under future work).
   * Completely replaced the Thrift version of the `CompactionCoordinator` 
service with a gRPC version. The Compactor and Monitor use this version now and 
create gRPC clients for the Compaction coordinator service. The thrift 
interface has been removed from the CompactionCoordinator. The Manager starts 
up the gRPC service and stops it on shutdown.
   * Replaced the thrift version of the compaction related thrift objects with 
protobuf versions, where possible, throughout the CompactionCoordinator code 
and related code.
   * getCompactionJob() RPC uses long polling now and is processed async by 
using the CompleteableFuture from the job queue. This unblocks the main io 
threads but there is still work to do with adding more thread pools for 
processing.
   * Updated `ClientContext` to contain a new credentials holder to hold both 
the grpc and thrift versions of credentials. 
   * Created a property for defining the gRPC port for the Manager. The default 
port is used for testing for now.
   * Fixed some tests and `ExternalCompaction_1_IT` is now passing. (more work 
is needed for tests still)
   
   #### Todo in this PR:
   * Make sure the tests all pass and the build is in a good state. Some tests 
like `ExternalCompaction_2_IT` are hanging and require more work to be done 
like a timeout feature (listed as follow on work). So we may need to do 
something as a quick solution for those tests for now or just disable them.
   * Address any concerns or issues found during PR reviews.
   
   
   #### Follow on Work:
   * Update ServiceDescriptor to support gRPC and add the gRPC port for the 
manager to Zookeeper. 
   * Add support for using a random port with gRPC when setting the port to 0.
   * Update MiniAccumulo cluster and tests to use a random port. This will 
require the random port feature to be implemented to find an open port and also 
ServiceDescriptor to work correcty with finding the gRPC service so clients can 
use it.
   * Add a thread pool for async rpc request response processing. We can use 
the async methods on CompletableFuture to take advantage of the thread pool.
   * Add accumulo config for thread pools for processing grpc request
   * Add the needed grpc jars and related jars (protobuf, etc) to tarball
   * Explore supporting kerberos and ssl w/ grpc
   * Look into pooling channels/connections for gRPC. ManagedChannel can be 
shared by multiple threads but it's still a bit unclear how they manage 
connections and if we need to implement our own custom pooling like we did for 
Thrift.
   * Investigate any other properties that need to be customized for the gRPC 
server and clients. It's netty based and there are a lot of different config 
options so we should look into what we can set (and their defaults) vs exposing 
as properties.
   * Convert more RPC layers, such as the Compactor service, to gRPC. When all 
of the compaction related items are changed over we can delete the Thrift 
related generated code (objects and compaction services) entirely.
   * Implement a timeout for the Compaction queue job CompletableFutures. This 
would be good to have a reasonable default in case the connections from 
Compactors associated with the future go away. 
   * Investigate error and exception handling with gRPC. We need to make sure 
when there are exceptions that the client and server behave correctly and don't 
kill the connection unless it needs to and it does there is reconnection.
   * Finish fixing the other External compaction tests. A timeout is necessary 
to fix ExternalCompaction_2_IT test as tests hang for now. The issue is when 
the test kill the original compactors and starts up a do nothing compator. The 
issue is because of the long polling change, there is a race condition where 
the job is assigned to the future from the original compactor. Then when the 
compactor dies, the next one won't get a job assigned as the future never times 
out from the dead compactor even though the job was cleaned up by the dead 
compaction detector. 
   * Explore how to implement tracing with gRPC. With Thrift we use TraceUtil 
and pass the TraceInfo object everywhere. There is code that was written to 
wrap all of these calls with a proxy. The proxy will look for the first 
argument and expect it to be trace info can can use that for debugging and 
metrics. With gRPC, there are no longer multiple arguments for RPC calls. Every 
call has exactly 1 object and that object contains all the fields necessary for 
the request so we would need to create a new version of this feature that would 
inspect the object being passed and look for the trace info.
   * Explore a common API for the RPC layer or at least minimizing what code 
touches the RPC objects. This needs to be talked about much more and a scope 
needs to be defined. It would be nice to minimiize the leakage of RPC objects 
but it's also tricky because we don't want to have to copy objects constantly. 
For example, we already convert between TKeyExent (and now PKeyExtent) and 
KeyExtent but I would not want to have to add a second conversion such as 
PKeyExtent<->Generic RPC KeyExtent<->KeyExtent. The RPC features and behavior 
are also very different between different RPC implementations such as gRPC 
being entirely async natively. Thrift also supports nulls where as Protobuf 
doesn't support null values.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to